Home
cd ../playbooks
File OrganizationBeginner

Bulk Image Extractor

Extract all images from Google Docs, PDFs, websites, and other documents in high resolution.

5 minutes
By communitySource
#images#extraction#google-docs#pdf#batch-download
CLAUDE.md Template

Download this file and place it in your project folder to get started.

# Bulk Image Extractor

## Your Role
You help extract embedded images from various document types including Google Docs, PDFs, Office documents, and web pages. You provide the right commands and methods to get high-resolution images.

## Extraction Methods

### Google Docs
The best method is to export as .docx and extract:

```bash
# Step 1: User downloads as .docx from Google Docs
# File > Download > Microsoft Word (.docx)

# Step 2: Extract images from .docx
unzip "Document.docx" -d extracted/
cp extracted/word/media/* ./images/

# Or one-liner
unzip -j "Document.docx" "word/media/*" -d ./images/
```

**Why not PDF?** PDFs often compress images. The .docx export preserves original resolution.

### PDF Files
```bash
# Install poppler (contains pdfimages)
# macOS: brew install poppler
# Ubuntu: sudo apt install poppler-utils
# Windows: choco install poppler

# Extract embedded images (best quality)
pdfimages -all document.pdf output/img

# Extract as PNG specifically
pdfimages -png document.pdf output/img

# For scanned documents (full pages as images)
pdftoppm -png -r 300 document.pdf output/page
```

### Word Documents (.docx)
```bash
# Single file
unzip -j "document.docx" "word/media/*" -d ./extracted_images/

# All DOCX in folder
for doc in *.docx; do
  dir="${doc%.docx}_images"
  mkdir -p "$dir"
  unzip -j "$doc" "word/media/*" -d "$dir/" 2>/dev/null
done
```

### PowerPoint (.pptx)
```bash
# Single file
unzip -j "presentation.pptx" "ppt/media/*" -d ./extracted_images/

# All PPTX in folder
for ppt in *.pptx; do
  dir="${ppt%.pptx}_images"
  mkdir -p "$dir"
  unzip -j "$ppt" "ppt/media/*" -d "$dir/" 2>/dev/null
done
```

### Excel (.xlsx)
```bash
# Images are in xl/media/
unzip -j "spreadsheet.xlsx" "xl/media/*" -d ./extracted_images/
```

### Web Pages
```bash
# Download all images from a page
wget -r -l1 -H -A jpg,jpeg,png,gif,webp -P images/ "https://example.com"

# Just images, no directory structure
wget -nd -r -l1 -A jpg,jpeg,png,gif -P images/ "https://example.com"

# Using curl for specific patterns
curl -O "https://example.com/images/photo[1-50].jpg"

# Using Python (if wget not available)
python -c "
import urllib.request
from html.parser import HTMLParser
import sys
# ... parsing code
"
```

### EPUB Files
```bash
# EPUB is also a ZIP
unzip -j "book.epub" "OEBPS/images/*" -d ./extracted_images/
# or
unzip -j "book.epub" "images/*" -d ./extracted_images/
```

## Batch Processing

```bash
#!/bin/bash
# extract_all_images.sh

OUTPUT_DIR="extracted_images"
mkdir -p "$OUTPUT_DIR"

# Process all PDFs
for pdf in *.pdf; do
  [ -e "$pdf" ] || continue
  echo "Processing: $pdf"
  subdir="$OUTPUT_DIR/${pdf%.pdf}"
  mkdir -p "$subdir"
  pdfimages -all "$pdf" "$subdir/img"
done

# Process all DOCX
for doc in *.docx; do
  [ -e "$doc" ] || continue
  echo "Processing: $doc"
  subdir="$OUTPUT_DIR/${doc%.docx}"
  mkdir -p "$subdir"
  unzip -j "$doc" "word/media/*" -d "$subdir/" 2>/dev/null
done

# Process all PPTX
for ppt in *.pptx; do
  [ -e "$ppt" ] || continue
  echo "Processing: $ppt"
  subdir="$OUTPUT_DIR/${ppt%.pptx}"
  mkdir -p "$subdir"
  unzip -j "$ppt" "ppt/media/*" -d "$subdir/" 2>/dev/null
done

echo "Extraction complete. Images in: $OUTPUT_DIR"
```

## Output Organization

```markdown
extracted_images/
├── by_source/
│   ├── document1/
│   │   ├── img-001.png
│   │   └── img-002.jpg
│   └── document2/
│       └── img-001.png
├── by_type/
│   ├── png/
│   ├── jpg/
│   └── gif/
└── _manifest.txt
```

### Organization Commands
```bash
# Organize by file type
mkdir -p by_type/{png,jpg,gif,svg}
mv *.png by_type/png/
mv *.jpg *.jpeg by_type/jpg/
mv *.gif by_type/gif/
mv *.svg by_type/svg/

# Rename with prefixes
i=1; for f in *.png; do mv "$f" "image_$(printf %03d $i).png"; ((i++)); done

# Generate manifest
find . -name "*.png" -o -name "*.jpg" | while read f; do
  echo "$(basename "$f"),$(stat -f%z "$f" 2>/dev/null || stat -c%s "$f"),$(file -b "$f")"
done > manifest.csv
```

## Quality Considerations

### Getting Best Quality

| Source | Best Method | Notes |
|--------|-------------|-------|
| Google Docs | .docx export | Not PDF - preserves resolution |
| PDF (embedded) | pdfimages -all | Gets original embedded images |
| PDF (scanned) | pdftoppm -r 300 | Higher DPI = better quality |
| Office docs | Direct unzip | Images stored at original size |
| Web | Find original URLs | Look for -original or full-size links |

### Common Issues
- PDF shows low-res: Images were embedded at low resolution
- DOCX missing images: May be links, not embedded
- Web images small: Thumbnails, look for high-res versions

## Output Format

```markdown
## Image Extraction Report

### Source
- Type: [Google Doc / PDF / DOCX / etc.]
- File: [filename]

### Extraction Method
```bash
[command used]
```

### Results
- Images extracted: [count]
- Total size: [size]
- Formats: [PNG, JPG, etc.]

### Image Details
| # | Filename | Dimensions | Size | Format |
|---|----------|------------|------|--------|
| 1 | img-001.png | 1920x1080 | 2.3MB | PNG |

### Output Location
[path to extracted images]

### Notes
- [Any quality observations]
- [Missing images if any]
```

## Instructions

1. Identify document type
2. Recommend best extraction method
3. Provide copy-paste commands
4. Offer organization options
5. Note any quality limitations
6. Suggest batch processing if multiple files

## Commands

```
"Extract images from [document]"
"Get all images from this PDF"
"Download images from this Google Doc"
"Batch extract from all documents"
"Organize extracted images"
"What format gives best quality?"
```

README.md

What This Does

Extract all embedded images from Google Docs, PDFs, Word documents, PowerPoints, and web pages. Get high-resolution versions organized in a folder.


Quick Start

Step 1: Create an Extraction Folder

mkdir -p ~/Documents/Extracted-Images

Step 2: Download the Template

Click Download above, then:

mv ~/Downloads/CLAUDE.md ~/Documents/Extracted-Images/

Step 3: Run Claude Code

cd ~/Documents/Extracted-Images
claude

Step 4: Request Extraction

Say: "Extract all images from [document/URL]"


Sources Supported

Source Method Notes
Google Docs Export as .docx, extract Preserves original quality
PDF pdfimages or poppler Gets embedded images
Word (.docx) Unzip, find images Images in word/media/
PowerPoint Unzip, find images Images in ppt/media/
Web Pages wget/curl Downloads linked images
ZIP Archives Extract, filter Find images recursively

Extraction Methods

From Google Docs

  1. File > Download > Microsoft Word (.docx)
  2. Then extract from .docx:
# Unzip the docx (it's a ZIP file)
unzip document.docx -d extracted/

# Images are in word/media/
ls extracted/word/media/

From PDF

# Using pdfimages (from poppler)
pdfimages -all document.pdf output_prefix

# Using pdftoppm for full pages
pdftoppm -png document.pdf page

# Install poppler
# macOS: brew install poppler
# Ubuntu: sudo apt install poppler-utils

From Word/PowerPoint

# DOCX files
unzip document.docx -d doc_extracted/
cp doc_extracted/word/media/* ./images/

# PPTX files
unzip presentation.pptx -d ppt_extracted/
cp ppt_extracted/ppt/media/* ./images/

From Web Pages

# Download all images from URL
wget -r -l1 -A jpg,jpeg,png,gif -P images/ "https://example.com/page"

# Or using curl
curl -O "https://example.com/image[1-10].jpg"

Example Output

Extracted-Images/
├── google-doc-export/
│   ├── image1.png (1920x1080)
│   ├── image2.jpg (2400x1600)
│   └── image3.png (800x600)
├── pdf-images/
│   ├── page-001.png
│   └── page-002.png
└── _extraction_log.txt

Batch Extraction

# All PDFs in folder
for pdf in *.pdf; do
  mkdir -p "${pdf%.pdf}_images"
  pdfimages -all "$pdf" "${pdf%.pdf}_images/img"
done

# All DOCX files
for doc in *.docx; do
  mkdir -p "${doc%.docx}_images"
  unzip -j "$doc" "word/media/*" -d "${doc%.docx}_images/"
done

Image Quality

Source Quality Notes
Google Docs High Use .docx export, not PDF
Word Original Embedded at full resolution
PDF Varies Depends on how PDF was created
Web Varies May be compressed

Getting Best Quality

  • Google Docs: Download as .docx, not PDF
  • PDFs: Use pdfimages with -all flag
  • Web: Look for original/full-size links
  • Presentations: Export slides as images if needed

Tips

  • Rename systematically: Add prefixes for organization
  • Check dimensions: Verify you got full resolution
  • Keep metadata: Some tools preserve EXIF data
  • Deduplicate: Remove identical images after extraction
  • Convert formats: Standardize to PNG or JPG

Commands

"Extract images from this Google Doc"
"Get all images from this PDF"
"Download images from [URL]"
"Extract media from this PowerPoint"
"Batch extract from all documents in folder"
"What's the best quality I can get?"
"Organize extracted images by size"

Troubleshooting

Images are low quality Try different export method, or source may not have better quality

PDF extraction fails Install poppler: brew install poppler or apt install poppler-utils

Can't unzip Office doc Rename to .zip first, then extract

Missing images Some may be referenced (links) not embedded - need different approach

$Related Playbooks