Office to Markdown Converter
Convert Office documents to Markdown for version control and AI processing using Microsoft's markitdown.
You can't put a Word doc in Git, search inside a PowerPoint, or feed a PDF to your AI workflow. Your team's knowledge lives in Office formats that are opaque to every modern tool. Converting to Markdown unlocks version control, search, and AI processing.
Who it's for: teams migrating documentation from Word to Markdown-based systems, developers who need to process Office docs programmatically, organizations moving from SharePoint to Git-based documentation, technical writers converting legacy docs for static site generators, anyone feeding Office documents into AI pipelines
Example
"Convert our 50 Word SOPs to Markdown for our docs site" → Clean Markdown files preserving headings, tables, lists, and images, with frontmatter metadata, ready for Jekyll/Hugo/Docusaurus, and a conversion report noting any formatting that needed manual review
New here? 3-minute setup guide → | Already set up? Copy the template below.
# Office to Markdown
## Overview
This workflow enables conversion from various Office formats to Markdown using **markitdown** - Microsoft's open-source tool for converting documents to Markdown. Perfect for making Office content searchable, version-controllable, and AI-friendly.
## How to Use
1. Provide the Office file (Word, Excel, PowerPoint, PDF, etc.)
2. Optionally specify conversion options
3. I'll convert it to clean Markdown
**Example prompts:**
- "Convert this Word document to Markdown"
- "Turn this PowerPoint into Markdown notes"
- "Extract content from this PDF as Markdown"
- "Convert this Excel file to Markdown tables"
## Domain Knowledge
### markitdown Fundamentals
```python
from markitdown import MarkItDown
# Initialize converter
md = MarkItDown()
# Convert file
result = md.convert("document.docx")
print(result.text_content)
# Save to file
with open("output.md", "w") as f:
f.write(result.text_content)
```
### Supported Formats
| Format | Extension | Notes |
|--------|-----------|-------|
| Word | .docx | Full text, tables, basic formatting |
| Excel | .xlsx | Converts to Markdown tables |
| PowerPoint | .pptx | Slides as sections |
| PDF | .pdf | Text extraction |
| HTML | .html | Clean markdown |
| Images | .jpg, .png | OCR with vision model |
| Audio | .mp3, .wav | Transcription |
| ZIP | .zip | Processes contained files |
### Basic Usage
#### Python API
```python
from markitdown import MarkItDown
# Simple conversion
md = MarkItDown()
result = md.convert("document.docx")
# Access content
markdown_text = result.text_content
# With options
md = MarkItDown(
llm_client=None, # Optional LLM for enhanced processing
llm_model=None # Model name if using LLM
)
```
#### Command Line
```bash
# Install
pip install markitdown
# Convert file
markitdown document.docx > output.md
# Or with output file
markitdown document.docx -o output.md
```
### Word Document Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
# Convert Word document
result = md.convert("report.docx")
# Output preserves:
# - Headings (as # headers)
# - Bold/italic formatting
# - Lists (bulleted and numbered)
# - Tables (as markdown tables)
# - Hyperlinks
print(result.text_content)
```
**Example Output:**
```markdown
# Annual Report 2024
## Executive Summary
This report summarizes the key achievements and challenges...
### Key Metrics
| Metric | 2023 | 2024 | Change |
|--------|------|------|--------|
| Revenue | $10M | $12M | +20% |
| Users | 50K | 75K | +50% |
## Detailed Analysis
The following sections provide...
```
### Excel Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xlsx")
# Each sheet becomes a section
# Data becomes markdown tables
print(result.text_content)
```
**Example Output:**
```markdown
## Sheet1
| Name | Department | Salary |
|------|------------|--------|
| John | Engineering | $80,000 |
| Jane | Marketing | $75,000 |
## Sheet2
| Product | Q1 | Q2 | Q3 | Q4 |
|---------|----|----|----|----|
| Widget A | 100 | 120 | 150 | 180 |
```
### PowerPoint Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("presentation.pptx")
# Each slide becomes a section
# Speaker notes included if present
print(result.text_content)
```
**Example Output:**
```markdown
# Slide 1: Company Overview
Our mission is to...
## Key Points
- Innovation first
- Customer focused
- Global reach
---
# Slide 2: Market Analysis
The market opportunity is significant...
**Notes:** Mention the competitor analysis here
```
### PDF Conversion
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
# Extracts text content
# Tables converted where detected
print(result.text_content)
```
### Image Conversion (with Vision Model)
```python
from markitdown import MarkItDown
import anthropic
# Use Claude for image description
client = anthropic.Anthropic()
md = MarkItDown(
llm_client=client,
llm_model="claude-sonnet-4-20250514"
)
result = md.convert("diagram.png")
print(result.text_content)
# Output: Description of the image content
```
### Batch Conversion
```python
from markitdown import MarkItDown
from pathlib import Path
def batch_convert(input_dir, output_dir):
"""Convert all Office files to Markdown."""
md = MarkItDown()
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
extensions = ['.docx', '.xlsx', '.pptx', '.pdf']
for ext in extensions:
for file in input_path.glob(f'*{ext}'):
try:
result = md.convert(str(file))
output_file = output_path / f"{file.stem}.md"
with open(output_file, 'w') as f:
f.write(result.text_content)
print(f"Converted: {file.name}")
except Exception as e:
print(f"Error converting {file.name}: {e}")
batch_convert('./documents', './markdown')
```
## Best Practices
1. **Check Output Quality**: Review converted Markdown for accuracy
2. **Handle Tables**: Complex tables may need manual adjustment
3. **Preserve Structure**: Use consistent heading levels in source docs
4. **Image Handling**: Consider using vision models for important images
5. **Version Control**: Store converted Markdown in Git for tracking
## Common Patterns
### Document Archive
```python
import os
from datetime import datetime
from markitdown import MarkItDown
def archive_document(doc_path, archive_dir):
"""Convert and archive Office document to Markdown."""
md = MarkItDown()
result = md.convert(doc_path)
# Create archive structure
date_str = datetime.now().strftime('%Y-%m-%d')
filename = os.path.basename(doc_path)
base_name = os.path.splitext(filename)[0]
# Save with metadata
output_content = f"""---
source: {filename}
converted: {date_str}
---
{result.text_content}
"""
output_path = os.path.join(archive_dir, f"{base_name}.md")
with open(output_path, 'w') as f:
f.write(output_content)
return output_path
```
### AI-Ready Corpus
```python
from markitdown import MarkItDown
from pathlib import Path
import json
def create_ai_corpus(doc_folder, output_file):
"""Convert documents to JSON corpus for AI training/RAG."""
md = MarkItDown()
corpus = []
for doc in Path(doc_folder).glob('**/*'):
if doc.suffix in ['.docx', '.pdf', '.pptx', '.xlsx']:
try:
result = md.convert(str(doc))
corpus.append({
'source': str(doc),
'filename': doc.name,
'content': result.text_content,
'type': doc.suffix[1:]
})
except Exception as e:
print(f"Skipped {doc.name}: {e}")
with open(output_file, 'w') as f:
json.dump(corpus, f, indent=2)
print(f"Created corpus with {len(corpus)} documents")
return corpus
```
## Examples
### Example 1: Convert Documentation Suite
```python
from markitdown import MarkItDown
from pathlib import Path
def convert_docs_to_wiki(docs_folder, wiki_folder):
"""Convert all Office docs to markdown wiki structure."""
md = MarkItDown()
docs_path = Path(docs_folder)
wiki_path = Path(wiki_folder)
# Create wiki structure
wiki_path.mkdir(exist_ok=True)
# Create index
index_content = "# Documentation Index\n\n"
for doc in sorted(docs_path.glob('**/*.docx')):
try:
result = md.convert(str(doc))
# Create relative path in wiki
rel_path = doc.relative_to(docs_path)
output_file = wiki_path / rel_path.with_suffix('.md')
output_file.parent.mkdir(parents=True, exist_ok=True)
# Write markdown
with open(output_file, 'w') as f:
f.write(result.text_content)
# Add to index
link = str(rel_path.with_suffix('.md')).replace('\\', '/')
index_content += f"- [{doc.stem}]({link})\n"
print(f"Converted: {doc.name}")
except Exception as e:
print(f"Error: {doc.name} - {e}")
# Write index
with open(wiki_path / 'index.md', 'w') as f:
f.write(index_content)
convert_docs_to_wiki('./company_docs', './wiki')
```
### Example 2: Meeting Notes Processor
```python
from markitdown import MarkItDown
import re
from datetime import datetime
def process_meeting_notes(pptx_path):
"""Extract and structure meeting notes from PowerPoint."""
md = MarkItDown()
result = md.convert(pptx_path)
# Parse the markdown
content = result.text_content
# Extract sections
sections = {
'attendees': [],
'agenda': [],
'decisions': [],
'action_items': []
}
current_section = None
for line in content.split('\n'):
line_lower = line.lower()
if 'attendee' in line_lower or 'participant' in line_lower:
current_section = 'attendees'
elif 'agenda' in line_lower:
current_section = 'agenda'
elif 'decision' in line_lower:
current_section = 'decisions'
elif 'action' in line_lower:
current_section = 'action_items'
elif line.strip().startswith(('-', '*', '•')) and current_section:
sections[current_section].append(line.strip()[1:].strip())
# Generate structured output
output = f"""# Meeting Notes
**Date:** {datetime.now().strftime('%Y-%m-%d')}
**Source:** {pptx_path}
## Attendees
{chr(10).join('- ' + a for a in sections['attendees'])}
## Agenda
{chr(10).join('- ' + a for a in sections['agenda'])}
## Decisions Made
{chr(10).join('- ' + d for d in sections['decisions'])}
## Action Items
{chr(10).join('- [ ] ' + a for a in sections['action_items'])}
"""
return output
notes = process_meeting_notes('team_meeting.pptx')
print(notes)
```
### Example 3: Excel to Documentation
```python
from markitdown import MarkItDown
def excel_to_data_dictionary(xlsx_path):
"""Convert Excel data model to data dictionary documentation."""
md = MarkItDown()
result = md.convert(xlsx_path)
# Add documentation structure
doc = f"""# Data Dictionary
Generated from: `{xlsx_path}`
{result.text_content}
## Usage Notes
- All tables are derived from the source Excel file
- Review data types and constraints before use
- Contact data team for clarifications
## Change Log
| Date | Change | Author |
|------|--------|--------|
| {datetime.now().strftime('%Y-%m-%d')} | Initial generation | Auto |
"""
return doc
documentation = excel_to_data_dictionary('data_model.xlsx')
with open('data_dictionary.md', 'w') as f:
f.write(documentation)
```
## Limitations
- Complex formatting may be simplified
- Images are not embedded (use vision model for descriptions)
- Some table structures may not convert perfectly
- Track changes in Word are not preserved
- Comments may not be extracted
## Installation
```bash
pip install markitdown
# For image/audio processing
pip install markitdown[all]
# For specific features
pip install markitdown[images] # Image OCR
pip install markitdown[audio] # Audio transcription
```
## Resources
- [GitHub Repository](https://github.com/microsoft/markitdown)
- [PyPI Package](https://pypi.org/project/markitdown/)
- [Supported Formats](https://github.com/microsoft/markitdown#supported-formats)What This Does
This workflow enables conversion from various Office formats to Markdown using markitdown - Microsoft's open-source tool for converting documents to Markdown. Perfect for making Office content searchable, version-controllable, and AI-friendly.
Quick Start
Step 1: Create a Project Folder
mkdir -p ~/Documents/OfficeToMd
Step 2: Download the Template
Click Download above, then:
mv ~/Downloads/CLAUDE.md ~/Documents/OfficeToMd/
Step 3: Start Working
cd ~/Documents/OfficeToMd
claude
How to Use
- Provide the Office file (Word, Excel, PowerPoint, PDF, etc.)
- Optionally specify conversion options
- I'll convert it to clean Markdown
Example prompts:
- "Convert this Word document to Markdown"
- "Turn this PowerPoint into Markdown notes"
- "Extract content from this PDF as Markdown"
- "Convert this Excel file to Markdown tables"
Best Practices
- Check Output Quality: Review converted Markdown for accuracy
- Handle Tables: Complex tables may need manual adjustment
- Preserve Structure: Use consistent heading levels in source docs
- Image Handling: Consider using vision models for important images
- Version Control: Store converted Markdown in Git for tracking
Examples
Example 1: Convert Documentation Suite
from markitdown import MarkItDown
from pathlib import Path
def convert_docs_to_wiki(docs_folder, wiki_folder):
"""Convert all Office docs to markdown wiki structure."""
md = MarkItDown()
docs_path = Path(docs_folder)
wiki_path = Path(wiki_folder)
# Create wiki structure
wiki_path.mkdir(exist_ok=True)
# Create index
index_content = "# Documentation Index\n\n"
for doc in sorted(docs_path.glob('**/*.docx')):
try:
result = md.convert(str(doc))
# Create relative path in wiki
rel_path = doc.relative_to(docs_path)
output_file = wiki_path / rel_path.with_suffix('.md')
output_file.parent.mkdir(parents=True, exist_ok=True)
# Write markdown
with open(output_file, 'w') as f:
f.write(result.text_content)
# Add to index
link = str(rel_path.with_suffix('.md')).replace('\\', '/')
index_content += f"- [{doc.stem}]({link})\n"
print(f"Converted: {doc.name}")
except Exception as e:
print(f"Error: {doc.name} - {e}")
# Write index
with open(wiki_path / 'index.md', 'w') as f:
f.write(index_content)
convert_docs_to_wiki('./company_docs', './wiki')
Example 2: Meeting Notes Processor
from markitdown import MarkItDown
import re
from datetime import datetime
def process_meeting_notes(pptx_path):
"""Extract and structure meeting notes from PowerPoint."""
md = MarkItDown()
result = md.convert(pptx_path)
# Parse the markdown
content = result.text_content
# Extract sections
sections = {
'attendees': [],
'agenda': [],
'decisions': [],
'action_items': []
}
current_section = None
for line in content.split('\n'):
line_lower = line.lower()
if 'attendee' in line_lower or 'participant' in line_lower:
current_section = 'attendees'
elif 'agenda' in line_lower:
current_section = 'agenda'
elif 'decision' in line_lower:
current_section = 'decisions'
elif 'action' in line_lower:
current_section = 'action_items'
elif line.strip().startswith(('-', '*', '•')) and current_section:
sections[current_section].append(line.strip()[1:].strip())
# Generate structured output
output = f"""# Meeting Notes
**Date:** {datetime.now().strftime('%Y-%m-%d')}
**Source:** {pptx_path}
Limitations
- Complex formatting may be simplified
- Images are not embedded (use vision model for descriptions)
- Some table structures may not convert perfectly
- Track changes in Word are not preserved
- Comments may not be extracted
Installation
pip install markitdown
# For image/audio processing
pip install markitdown[all]
# For specific features
pip install markitdown[images] # Image OCR
pip install markitdown[audio] # Audio transcription