Home
cd ../playbooks
Academic ResearchIntermediate

Paper Figures to PowerPoint

Extract all figures, tables, and equations from a research paper PDF and generate a polished PowerPoint deck automatically.

10 minutes
By RebeccaStevensonSource
#powerpoint#research-papers#figures#tables#pdf#academic#presentations
CLAUDE.md Template

Download this file and place it in your project folder to get started.

# Paper Figures PowerPoint Generator

Generate a PowerPoint presentation from a Marker-converted paper folder or directly from raw PDFs. The deck includes a title slide with paper metadata followed by one slide per figure, table, and equation, preserving markdown document order for content slides.

## When to Use

Activate when the user wants to:
- Create a PowerPoint from a research paper's figures and tables
- Generate a visual summary deck from a Marker-converted PDF
- Extract all figures/tables/equations from a paper into slides
- Generate PPTX decks from existing Marker output folders
- Generate PPTX decks directly from PDF(s) when markdown does not exist yet

## Input Requirements

Preferred input is a paper folder from `outputs/library_markdown_output/` containing:
- **Primary `.md` file** - the Marker conversion (not `_citation.md` or `_no_refs.md`)
- **`_citation.md`** - enriched metadata with DOI, APA citation, abstract
- **Figure image files** - `_page_*_Figure_*.(jpeg|jpg|png|webp)` extracted by Marker (`_Picture_` assets are intentionally ignored)

You can point this workflow at a Marker output directory and select one paper folder (or all folders), or provide raw PDF input and let the wrapper convert first.

## Workflow

### 1. Preferred: one-step workflow from raw PDF(s)

```bash
bash scripts/marker_output_to_pptx.sh \
  --pdf-input /path/to/paper_or_pdf_folder \
  --marker-output /path/to/marker_output \
  --all \
  --output-dir /path/to/pptx_output
```

Behavior:
- The wrapper runs `marker_single` directly (one PDF at a time).
- Conversion state is tracked in `<marker_output>/state` so reruns can resume.
- Keep image extraction enabled (do **not** disable image extraction), because this workflow needs `_page_*_Figure_*` files.
- Figure-slide default mode is **figure-only** (no caption text block).

### 1b. Reprocess a PDF for slide-quality figure/caption extraction (recommended when captions are noisy)

Use this when figure slides come out with missing/weak captions, too many `_Picture_` assets, or poor OCR text around figure headings.

```bash
PDF=/path/to/paper.pdf
OUT=/path/to/marker_output_reprocessed

bash scripts/marker_output_to_pptx.sh \
  --pdf-input "$PDF" \
  --marker-output "$OUT" \
  --output-dir /tmp/pptx_out \
  --marker-use-llm \
  --marker-service-timeout 120 \
  --marker-extra-arg --force_ocr \
  --marker-extra-arg --strip_existing_ocr \
  --marker-extra-arg --highres_image_dpi \
  --marker-extra-arg 256 \
  --marker-extra-arg --lowres_image_dpi \
  --marker-extra-arg 128 \
  --marker-extra-arg --format_lines
```

Notes:
- Use a fresh output directory for reprocessing so you can compare baseline vs reprocessed markdown/images.
- For born-digital PDFs with already clean text, try the same command without `--force_ocr` and `--strip_existing_ocr`.

Quick quality checks before PPTX generation:
- Figure assets count: `find "$OUT" -type f | rg '_page_[0-9]+_Figure_[0-9]+\.(jpeg|jpg|png|webp)$' | wc -l`
- Figure heading coverage: `rg -n '^(#{1,6}\s+)?(Figure|Fig\.)\s+[A-Za-z0-9.-]+' "$OUT"`

### 2. Generate PPTX from Marker output

```bash
bash scripts/marker_output_to_pptx.sh \
  --marker-output /path/to/marker_output
```

Useful options:
- `--paper <name-or-path>` - choose one paper when marker output contains multiple paper folders
- `--all` - generate decks for all detected paper folders
- `--output` / `-o` - output `.pptx` path for a single paper
- `--output-dir` - output directory for one or many generated decks
- `--pdf-input <path>` - optional conversion pre-step from raw PDFs
- `--experimental-figure-captions` - experimental mode that adds extracted caption text to figure slides (prone to OCR/parse errors)
- `--with-captions` - alias for `--experimental-figure-captions`
- `--marker-force` - reprocess previously converted PDFs
- `--marker-fail-fast` - stop on first PDF conversion failure
- `--marker-timeout <seconds>` - per-PDF conversion timeout (default 1000)
- `--marker-state-dir <path>` - custom location for conversion logs/state files
- `--skip-tables` - omit table slides
- `--skip-equations` - omit equation slides

### 3. Direct generator usage (advanced)

If the user gives a paper name or partial match, find the folder:
```bash
ls outputs/library_markdown_output/ | grep -i "<search_term>"
```

```bash
python scripts/generate_paper_pptx.py \
  "outputs/library_markdown_output/<paper_folder>" \
  --output "outputs/<paper_name>_figures.pptx"
```

**Options:**
- `--output` / `-o` - Custom output path (default: `<folder>/<name>_figures.pptx`)
- `--experimental-figure-captions` - experimental caption mode for figure slides
- `--with-captions` - alias for `--experimental-figure-captions`
- `--skip-tables` - Omit table slides
- `--skip-equations` - Omit equation slides

### 4. Verify output

After generation, report to the user:
- Total number of slides
- Breakdown: N figures, N tables, N equations
- Output file path

## Scripts

| Script | Purpose |
|---|---|
| `scripts/marker_output_to_pptx.sh` | Wrapper that can convert raw PDF(s) first, then generate one or many decks |
| `scripts/generate_paper_pptx.py` | Core PPTX generator for a single Marker-converted paper folder |

## Slide Layout

- **Slide 1 (Title)**: Paper title, authors, APA citation, DOI, truncated abstract
- **Figure slides (default)**: Figure image with simplified heading `Figure N` (no caption text block).
- **Figure slides (experimental captions mode)**: Figure image (left ~65%) + extracted caption block (right ~35%). This mode is prone to OCR/parse errors.
- **Table slides**: Table title + native PowerPoint table with header row styling
- **Equation slides**: Equation label + rendered equation image (matplotlib mathtext) in a highlighted box + preceding context paragraph. Falls back to raw LaTeX source if rendering is unavailable.

## Dependencies

- Python 3.10+
- `python-pptx` (install: `pip install python-pptx`)
- `Pillow` (optional, for correct image aspect ratios; install: `pip install Pillow`)
- `matplotlib` (optional, for rendering equation LaTeX into slide images; install: `pip install matplotlib`)
- `marker_single` on `PATH` for PDF-to-markdown conversion

## Marker Output Patterns

The script parses these patterns from Marker-generated markdown:

- **Images**: `![](_page_X_Figure_Y.jpeg)` (and related image extensions) — inline figure references (`_Picture_` images are skipped)
- **Figure captions**: `## Figure N. ...` or plain `Figure N. ...` on nearby lines; the generator combines heading + nearby caption lines into one caption block
- **Tables**: Markdown pipe tables (`| col1 | col2 |`) preceded by `Table N. Title` headings
- **Equations**: `$$...$$` blocks with optional `\tag{N}` or `\n(N)` equation numbers

## Customization

To adjust colors, fonts, or layout, edit `scripts/generate_paper_pptx.py`. Key constants at the top:
- `ACCENT_COLOR`, `TITLE_COLOR`, `BODY_COLOR` — color palette
- `TITLE_FONT`, `BODY_FONT` — typography
- `SLIDE_WIDTH`, `SLIDE_HEIGHT` — slide dimensions (default: widescreen 16:9)
README.md

What This Does

Converts a research paper PDF into a PowerPoint presentation with one slide per figure, table, and equation. Uses Marker for PDF-to-markdown conversion, then parses the output to build a polished 16:9 deck with a title slide (metadata, DOI, abstract) followed by content slides preserving document order.


Quick Start

Step 1: Install Dependencies

pip install python-pptx Pillow matplotlib

You also need marker_single on your PATH for PDF conversion. See Marker installation.

Step 2: Download the Template

Click Download above to get the CLAUDE.md file, then place it in your project folder along with the scripts from the source repository.

Step 3: Generate a Deck

claude

Say: "Create a PowerPoint from this paper's figures" and provide your PDF path.


Slide Layout

Slide Type Content
Title Paper title, authors, APA citation, DOI, truncated abstract
Figure Figure image with heading (optional caption in experimental mode)
Table Table title + native PowerPoint table with header row styling
Equation Label + rendered LaTeX image in highlighted box + context paragraph

Two Ways to Run

From raw PDF (recommended):

bash scripts/marker_output_to_pptx.sh \
  --pdf-input /path/to/paper.pdf \
  --marker-output /path/to/marker_output \
  --output-dir /path/to/pptx_output

From existing Marker output:

bash scripts/marker_output_to_pptx.sh \
  --marker-output /path/to/marker_output

Key Options

Flag What It Does
--all Generate decks for all papers in the output folder
--paper <name> Choose one specific paper
--with-captions Experimental: add extracted caption text to figure slides
--skip-tables Omit table slides
--skip-equations Omit equation slides
--marker-force Reprocess previously converted PDFs
--marker-use-llm Use LLM-enhanced extraction for better captions

Tips

  • Caption quality varies — the default figure-only mode is more reliable; use --with-captions experimentally
  • Reprocess noisy PDFs — use --marker-use-llm with --force_ocr for scanned papers with poor text extraction
  • Check before generating — verify figure count with find "$OUT" -type f | rg '_page_[0-9]+_Figure_[0-9]+' | wc -l
  • Customize the look — edit generate_paper_pptx.py constants for colors, fonts, and slide dimensions

Example Prompts

"Create a PowerPoint from this paper's figures"
"Extract all figures and tables from my PDF into slides"
"Generate a presentation deck from my Marker output folder"
"Make a figure deck from this paper, skip equations"
"Reprocess this PDF with better OCR and generate slides"

$Related Playbooks