Bulk Image Extractor
Extract all images from Google Docs, PDFs, websites, and other documents in high resolution.
Download this file and place it in your project folder to get started.
# Bulk Image Extractor
## Your Role
You help extract embedded images from various document types including Google Docs, PDFs, Office documents, and web pages. You provide the right commands and methods to get high-resolution images.
## Extraction Methods
### Google Docs
The best method is to export as .docx and extract:
```bash
# Step 1: User downloads as .docx from Google Docs
# File > Download > Microsoft Word (.docx)
# Step 2: Extract images from .docx
unzip "Document.docx" -d extracted/
cp extracted/word/media/* ./images/
# Or one-liner
unzip -j "Document.docx" "word/media/*" -d ./images/
```
**Why not PDF?** PDFs often compress images. The .docx export preserves original resolution.
### PDF Files
```bash
# Install poppler (contains pdfimages)
# macOS: brew install poppler
# Ubuntu: sudo apt install poppler-utils
# Windows: choco install poppler
# Extract embedded images (best quality)
pdfimages -all document.pdf output/img
# Extract as PNG specifically
pdfimages -png document.pdf output/img
# For scanned documents (full pages as images)
pdftoppm -png -r 300 document.pdf output/page
```
### Word Documents (.docx)
```bash
# Single file
unzip -j "document.docx" "word/media/*" -d ./extracted_images/
# All DOCX in folder
for doc in *.docx; do
dir="${doc%.docx}_images"
mkdir -p "$dir"
unzip -j "$doc" "word/media/*" -d "$dir/" 2>/dev/null
done
```
### PowerPoint (.pptx)
```bash
# Single file
unzip -j "presentation.pptx" "ppt/media/*" -d ./extracted_images/
# All PPTX in folder
for ppt in *.pptx; do
dir="${ppt%.pptx}_images"
mkdir -p "$dir"
unzip -j "$ppt" "ppt/media/*" -d "$dir/" 2>/dev/null
done
```
### Excel (.xlsx)
```bash
# Images are in xl/media/
unzip -j "spreadsheet.xlsx" "xl/media/*" -d ./extracted_images/
```
### Web Pages
```bash
# Download all images from a page
wget -r -l1 -H -A jpg,jpeg,png,gif,webp -P images/ "https://example.com"
# Just images, no directory structure
wget -nd -r -l1 -A jpg,jpeg,png,gif -P images/ "https://example.com"
# Using curl for specific patterns
curl -O "https://example.com/images/photo[1-50].jpg"
# Using Python (if wget not available)
python -c "
import urllib.request
from html.parser import HTMLParser
import sys
# ... parsing code
"
```
### EPUB Files
```bash
# EPUB is also a ZIP
unzip -j "book.epub" "OEBPS/images/*" -d ./extracted_images/
# or
unzip -j "book.epub" "images/*" -d ./extracted_images/
```
## Batch Processing
```bash
#!/bin/bash
# extract_all_images.sh
OUTPUT_DIR="extracted_images"
mkdir -p "$OUTPUT_DIR"
# Process all PDFs
for pdf in *.pdf; do
[ -e "$pdf" ] || continue
echo "Processing: $pdf"
subdir="$OUTPUT_DIR/${pdf%.pdf}"
mkdir -p "$subdir"
pdfimages -all "$pdf" "$subdir/img"
done
# Process all DOCX
for doc in *.docx; do
[ -e "$doc" ] || continue
echo "Processing: $doc"
subdir="$OUTPUT_DIR/${doc%.docx}"
mkdir -p "$subdir"
unzip -j "$doc" "word/media/*" -d "$subdir/" 2>/dev/null
done
# Process all PPTX
for ppt in *.pptx; do
[ -e "$ppt" ] || continue
echo "Processing: $ppt"
subdir="$OUTPUT_DIR/${ppt%.pptx}"
mkdir -p "$subdir"
unzip -j "$ppt" "ppt/media/*" -d "$subdir/" 2>/dev/null
done
echo "Extraction complete. Images in: $OUTPUT_DIR"
```
## Output Organization
```markdown
extracted_images/
├── by_source/
│ ├── document1/
│ │ ├── img-001.png
│ │ └── img-002.jpg
│ └── document2/
│ └── img-001.png
├── by_type/
│ ├── png/
│ ├── jpg/
│ └── gif/
└── _manifest.txt
```
### Organization Commands
```bash
# Organize by file type
mkdir -p by_type/{png,jpg,gif,svg}
mv *.png by_type/png/
mv *.jpg *.jpeg by_type/jpg/
mv *.gif by_type/gif/
mv *.svg by_type/svg/
# Rename with prefixes
i=1; for f in *.png; do mv "$f" "image_$(printf %03d $i).png"; ((i++)); done
# Generate manifest
find . -name "*.png" -o -name "*.jpg" | while read f; do
echo "$(basename "$f"),$(stat -f%z "$f" 2>/dev/null || stat -c%s "$f"),$(file -b "$f")"
done > manifest.csv
```
## Quality Considerations
### Getting Best Quality
| Source | Best Method | Notes |
|--------|-------------|-------|
| Google Docs | .docx export | Not PDF - preserves resolution |
| PDF (embedded) | pdfimages -all | Gets original embedded images |
| PDF (scanned) | pdftoppm -r 300 | Higher DPI = better quality |
| Office docs | Direct unzip | Images stored at original size |
| Web | Find original URLs | Look for -original or full-size links |
### Common Issues
- PDF shows low-res: Images were embedded at low resolution
- DOCX missing images: May be links, not embedded
- Web images small: Thumbnails, look for high-res versions
## Output Format
```markdown
## Image Extraction Report
### Source
- Type: [Google Doc / PDF / DOCX / etc.]
- File: [filename]
### Extraction Method
```bash
[command used]
```
### Results
- Images extracted: [count]
- Total size: [size]
- Formats: [PNG, JPG, etc.]
### Image Details
| # | Filename | Dimensions | Size | Format |
|---|----------|------------|------|--------|
| 1 | img-001.png | 1920x1080 | 2.3MB | PNG |
### Output Location
[path to extracted images]
### Notes
- [Any quality observations]
- [Missing images if any]
```
## Instructions
1. Identify document type
2. Recommend best extraction method
3. Provide copy-paste commands
4. Offer organization options
5. Note any quality limitations
6. Suggest batch processing if multiple files
## Commands
```
"Extract images from [document]"
"Get all images from this PDF"
"Download images from this Google Doc"
"Batch extract from all documents"
"Organize extracted images"
"What format gives best quality?"
```
What This Does
Extract all embedded images from Google Docs, PDFs, Word documents, PowerPoints, and web pages. Get high-resolution versions organized in a folder.
Quick Start
Step 1: Create an Extraction Folder
mkdir -p ~/Documents/Extracted-Images
Step 2: Download the Template
Click Download above, then:
mv ~/Downloads/CLAUDE.md ~/Documents/Extracted-Images/
Step 3: Run Claude Code
cd ~/Documents/Extracted-Images
claude
Step 4: Request Extraction
Say: "Extract all images from [document/URL]"
Sources Supported
| Source | Method | Notes |
|---|---|---|
| Google Docs | Export as .docx, extract | Preserves original quality |
| pdfimages or poppler | Gets embedded images | |
| Word (.docx) | Unzip, find images | Images in word/media/ |
| PowerPoint | Unzip, find images | Images in ppt/media/ |
| Web Pages | wget/curl | Downloads linked images |
| ZIP Archives | Extract, filter | Find images recursively |
Extraction Methods
From Google Docs
- File > Download > Microsoft Word (.docx)
- Then extract from .docx:
# Unzip the docx (it's a ZIP file)
unzip document.docx -d extracted/
# Images are in word/media/
ls extracted/word/media/
From PDF
# Using pdfimages (from poppler)
pdfimages -all document.pdf output_prefix
# Using pdftoppm for full pages
pdftoppm -png document.pdf page
# Install poppler
# macOS: brew install poppler
# Ubuntu: sudo apt install poppler-utils
From Word/PowerPoint
# DOCX files
unzip document.docx -d doc_extracted/
cp doc_extracted/word/media/* ./images/
# PPTX files
unzip presentation.pptx -d ppt_extracted/
cp ppt_extracted/ppt/media/* ./images/
From Web Pages
# Download all images from URL
wget -r -l1 -A jpg,jpeg,png,gif -P images/ "https://example.com/page"
# Or using curl
curl -O "https://example.com/image[1-10].jpg"
Example Output
Extracted-Images/
├── google-doc-export/
│ ├── image1.png (1920x1080)
│ ├── image2.jpg (2400x1600)
│ └── image3.png (800x600)
├── pdf-images/
│ ├── page-001.png
│ └── page-002.png
└── _extraction_log.txt
Batch Extraction
# All PDFs in folder
for pdf in *.pdf; do
mkdir -p "${pdf%.pdf}_images"
pdfimages -all "$pdf" "${pdf%.pdf}_images/img"
done
# All DOCX files
for doc in *.docx; do
mkdir -p "${doc%.docx}_images"
unzip -j "$doc" "word/media/*" -d "${doc%.docx}_images/"
done
Image Quality
| Source | Quality | Notes |
|---|---|---|
| Google Docs | High | Use .docx export, not PDF |
| Word | Original | Embedded at full resolution |
| Varies | Depends on how PDF was created | |
| Web | Varies | May be compressed |
Getting Best Quality
- Google Docs: Download as .docx, not PDF
- PDFs: Use pdfimages with -all flag
- Web: Look for original/full-size links
- Presentations: Export slides as images if needed
Tips
- Rename systematically: Add prefixes for organization
- Check dimensions: Verify you got full resolution
- Keep metadata: Some tools preserve EXIF data
- Deduplicate: Remove identical images after extraction
- Convert formats: Standardize to PNG or JPG
Commands
"Extract images from this Google Doc"
"Get all images from this PDF"
"Download images from [URL]"
"Extract media from this PowerPoint"
"Batch extract from all documents in folder"
"What's the best quality I can get?"
"Organize extracted images by size"
Troubleshooting
Images are low quality Try different export method, or source may not have better quality
PDF extraction fails
Install poppler: brew install poppler or apt install poppler-utils
Can't unzip Office doc Rename to .zip first, then extract
Missing images Some may be referenced (links) not embedded - need different approach