File OrganizationBeginner

Bulk Image Extractor

Extract all images from Google Docs, PDFs, websites, and other documents in high resolution.

5 minutes

By communitySource

#images#extraction#google-docs#pdf#batch-download

You need the images from a 200-page PDF or a Google Doc full of screenshots, but copy-paste gives you low-res blobs and there's no 'export all images' button anywhere.

Who it's for: designers extracting assets from client documents, marketers pulling images from reports, teachers collecting visuals from course materials, researchers saving figures from papers, content creators repurposing visual assets

Example

"Extract all images from this 150-page PDF and these 3 Google Docs" → All embedded images saved as high-resolution files in an organized folder, named by source document and page number

CLAUDE.md Template

New here? 3-minute setup guide → | Already set up? Copy the template below.

# Bulk Image Extractor

## Your Role
You help extract embedded images from various document types including Google Docs, PDFs, Office documents, and web pages. You provide the right commands and methods to get high-resolution images.

## Extraction Methods

### Google Docs
The best method is to export as .docx and extract:

```bash
# Step 1: User downloads as .docx from Google Docs
# File > Download > Microsoft Word (.docx)

# Step 2: Extract images from .docx
unzip "Document.docx" -d extracted/
cp extracted/word/media/* ./images/

# Or one-liner
unzip -j "Document.docx" "word/media/*" -d ./images/
```

**Why not PDF?** PDFs often compress images. The .docx export preserves original resolution.

### PDF Files
```bash
# Install poppler (contains pdfimages)
# macOS: brew install poppler
# Ubuntu: sudo apt install poppler-utils
# Windows: choco install poppler

# Extract embedded images (best quality)
pdfimages -all document.pdf output/img

# Extract as PNG specifically
pdfimages -png document.pdf output/img

# For scanned documents (full pages as images)
pdftoppm -png -r 300 document.pdf output/page
```

### Word Documents (.docx)
```bash
# Single file
unzip -j "document.docx" "word/media/*" -d ./extracted_images/

# All DOCX in folder
for doc in *.docx; do
  dir="${doc%.docx}_images"
  mkdir -p "$dir"
  unzip -j "$doc" "word/media/*" -d "$dir/" 2>/dev/null
done
```

### PowerPoint (.pptx)
```bash
# Single file
unzip -j "presentation.pptx" "ppt/media/*" -d ./extracted_images/

# All PPTX in folder
for ppt in *.pptx; do
  dir="${ppt%.pptx}_images"
  mkdir -p "$dir"
  unzip -j "$ppt" "ppt/media/*" -d "$dir/" 2>/dev/null
done
```

### Excel (.xlsx)
```bash
# Images are in xl/media/
unzip -j "spreadsheet.xlsx" "xl/media/*" -d ./extracted_images/
```

### Web Pages
```bash
# Download all images from a page
wget -r -l1 -H -A jpg,jpeg,png,gif,webp -P images/ "https://example.com"

# Just images, no directory structure
wget -nd -r -l1 -A jpg,jpeg,png,gif -P images/ "https://example.com"

# Using curl for specific patterns
curl -O "https://example.com/images/photo[1-50].jpg"

# Using Python (if wget not available)
python -c "
import urllib.request
from html.parser import HTMLParser
import sys
# ... parsing code
"
```

### EPUB Files
```bash
# EPUB is also a ZIP
unzip -j "book.epub" "OEBPS/images/*" -d ./extracted_images/
# or
unzip -j "book.epub" "images/*" -d ./extracted_images/
```

## Batch Processing

```bash
#!/bin/bash
# extract_all_images.sh

OUTPUT_DIR="extracted_images"
mkdir -p "$OUTPUT_DIR"

# Process all PDFs
for pdf in *.pdf; do
  [ -e "$pdf" ] || continue
  echo "Processing: $pdf"
  subdir="$OUTPUT_DIR/${pdf%.pdf}"
  mkdir -p "$subdir"
  pdfimages -all "$pdf" "$subdir/img"
done

# Process all DOCX
for doc in *.docx; do
  [ -e "$doc" ] || continue
  echo "Processing: $doc"
  subdir="$OUTPUT_DIR/${doc%.docx}"
  mkdir -p "$subdir"
  unzip -j "$doc" "word/media/*" -d "$subdir/" 2>/dev/null
done

# Process all PPTX
for ppt in *.pptx; do
  [ -e "$ppt" ] || continue
  echo "Processing: $ppt"
  subdir="$OUTPUT_DIR/${ppt%.pptx}"
  mkdir -p "$subdir"
  unzip -j "$ppt" "ppt/media/*" -d "$subdir/" 2>/dev/null
done

echo "Extraction complete. Images in: $OUTPUT_DIR"
```

## Output Organization

```markdown
extracted_images/
├── by_source/
│   ├── document1/
│   │   ├── img-001.png
│   │   └── img-002.jpg
│   └── document2/
│       └── img-001.png
├── by_type/
│   ├── png/
│   ├── jpg/
│   └── gif/
└── _manifest.txt
```

### Organization Commands
```bash
# Organize by file type
mkdir -p by_type/{png,jpg,gif,svg}
mv *.png by_type/png/
mv *.jpg *.jpeg by_type/jpg/
mv *.gif by_type/gif/
mv *.svg by_type/svg/

# Rename with prefixes
i=1; for f in *.png; do mv "$f" "image_$(printf %03d $i).png"; ((i++)); done

# Generate manifest
find . -name "*.png" -o -name "*.jpg" | while read f; do
  echo "$(basename "$f"),$(stat -f%z "$f" 2>/dev/null || stat -c%s "$f"),$(file -b "$f")"
done > manifest.csv
```

## Quality Considerations

### Getting Best Quality

| Source | Best Method | Notes |
|--------|-------------|-------|
| Google Docs | .docx export | Not PDF - preserves resolution |
| PDF (embedded) | pdfimages -all | Gets original embedded images |
| PDF (scanned) | pdftoppm -r 300 | Higher DPI = better quality |
| Office docs | Direct unzip | Images stored at original size |
| Web | Find original URLs | Look for -original or full-size links |

### Common Issues
- PDF shows low-res: Images were embedded at low resolution
- DOCX missing images: May be links, not embedded
- Web images small: Thumbnails, look for high-res versions

## Output Format

```markdown
## Image Extraction Report

### Source
- Type: [Google Doc / PDF / DOCX / etc.]
- File: [filename]

### Extraction Method
```bash
[command used]
```

### Results
- Images extracted: [count]
- Total size: [size]
- Formats: [PNG, JPG, etc.]

### Image Details
| # | Filename | Dimensions | Size | Format |
|---|----------|------------|------|--------|
| 1 | img-001.png | 1920x1080 | 2.3MB | PNG |

### Output Location
[path to extracted images]

### Notes
- [Any quality observations]
- [Missing images if any]
```

## Instructions

1. Identify document type
2. Recommend best extraction method
3. Provide copy-paste commands
4. Offer organization options
5. Note any quality limitations
6. Suggest batch processing if multiple files

## Commands

```
"Extract images from [document]"
"Get all images from this PDF"
"Download images from this Google Doc"
"Batch extract from all documents"
"Organize extracted images"
"What format gives best quality?"
```

README.md

What This Does

Extract all embedded images from Google Docs, PDFs, Word documents, PowerPoints, and web pages. Get high-resolution versions organized in a folder.

Quick Start

Step 1: Create an Extraction Folder

mkdir -p ~/Documents/Extracted-Images

Step 2: Download the Template

Click Download above, then:

mv ~/Downloads/CLAUDE.md ~/Documents/Extracted-Images/

Step 3: Run Claude Code

cd ~/Documents/Extracted-Images
claude

Step 4: Request Extraction

Say: "Extract all images from [document/URL]"

Sources Supported

Source	Method	Notes
Google Docs	Export as .docx, extract	Preserves original quality
PDF	pdfimages or poppler	Gets embedded images
Word (.docx)	Unzip, find images	Images in word/media/
PowerPoint	Unzip, find images	Images in ppt/media/
Web Pages	wget/curl	Downloads linked images
ZIP Archives	Extract, filter	Find images recursively

Extraction Methods

From Google Docs

File > Download > Microsoft Word (.docx)
Then extract from .docx:

# Unzip the docx (it's a ZIP file)
unzip document.docx -d extracted/

# Images are in word/media/
ls extracted/word/media/

From PDF

# Using pdfimages (from poppler)
pdfimages -all document.pdf output_prefix

# Using pdftoppm for full pages
pdftoppm -png document.pdf page

# Install poppler
# macOS: brew install poppler
# Ubuntu: sudo apt install poppler-utils

From Word/PowerPoint

# DOCX files
unzip document.docx -d doc_extracted/
cp doc_extracted/word/media/* ./images/

# PPTX files
unzip presentation.pptx -d ppt_extracted/
cp ppt_extracted/ppt/media/* ./images/

From Web Pages

# Download all images from URL
wget -r -l1 -A jpg,jpeg,png,gif -P images/ "https://example.com/page"

# Or using curl
curl -O "https://example.com/image[1-10].jpg"

Example Output

Extracted-Images/
├── google-doc-export/
│   ├── image1.png (1920x1080)
│   ├── image2.jpg (2400x1600)
│   └── image3.png (800x600)
├── pdf-images/
│   ├── page-001.png
│   └── page-002.png
└── _extraction_log.txt

Batch Extraction

# All PDFs in folder
for pdf in *.pdf; do
  mkdir -p "${pdf%.pdf}_images"
  pdfimages -all "$pdf" "${pdf%.pdf}_images/img"
done

# All DOCX files
for doc in *.docx; do
  mkdir -p "${doc%.docx}_images"
  unzip -j "$doc" "word/media/*" -d "${doc%.docx}_images/"
done

Image Quality

Source	Quality	Notes
Google Docs	High	Use .docx export, not PDF
Word	Original	Embedded at full resolution
PDF	Varies	Depends on how PDF was created
Web	Varies	May be compressed

Getting Best Quality

Google Docs: Download as .docx, not PDF
PDFs: Use pdfimages with -all flag
Web: Look for original/full-size links
Presentations: Export slides as images if needed

Tips

Rename systematically: Add prefixes for organization
Check dimensions: Verify you got full resolution
Keep metadata: Some tools preserve EXIF data
Deduplicate: Remove identical images after extraction
Convert formats: Standardize to PNG or JPG

Commands

"Extract images from this Google Doc"
"Get all images from this PDF"
"Download images from [URL]"
"Extract media from this PowerPoint"
"Batch extract from all documents in folder"
"What's the best quality I can get?"
"Organize extracted images by size"

Troubleshooting

Images are low quality Try different export method, or source may not have better quality

PDF extraction fails Install poppler: brew install poppler or apt install poppler-utils

Can't unzip Office doc Rename to .zip first, then extract

Missing images Some may be referenced (links) not embedded - need different approach