Replication-First Coding
When working with existing code or papers, match the original exactly before extending. Verify against gold standard numbers to catch silent bugs.
You extended a published model and the results look wrong — but is it your extension or a bug in the base implementation? Without first replicating the original results exactly, you can't tell where your code diverges. Replication-first development catches silent bugs before they contaminate your novel work.
Who it's for: PhD students implementing algorithms from published papers, research engineers reproducing baseline results before extending models, computational scientists verifying numerical methods against known solutions, ML researchers replicating benchmark results as a foundation for new work, postdocs validating inherited codebases from previous lab members
Example
"Replicate the results from this transformer paper before adding our modifications" → Replication workflow: gold standard numbers extracted from the paper, implementation matching the original architecture exactly, verification tests comparing output to published results, documented discrepancies with root causes, and a clean branch point for your extensions
New here? 3-minute setup guide → | Already set up? Copy the template below.
# Replication-First Protocol
## The Danger of Silent Bugs
The worst bugs aren't crashes — they're wrong answers that look plausible.
**Example**: Translating a statistical analysis from Stata to Python. The code runs, produces numbers, no errors. But a subtle indexing difference means the results are systematically wrong. Students learn incorrect findings. Papers get published with errors.
## The Protocol
### Phase 1: Inventory Original
Before touching code:
1. Identify "gold standard" numbers from original source
2. Document them explicitly:
```
## Gold Standard Values
- Table 3, Column 2: 0.847 (SE: 0.023)
- Figure 2 data point at x=5: 12.3
- Sum of coefficients: -0.156
```
3. Note data sources, versions, random seeds
4. Screenshot or copy exact original output
### Phase 2: Replicate Exactly
Implement the translation/refactor:
1. Match original specification EXACTLY
- Same variables, same order
- Same methodology, same parameters
- Same random seed if applicable
2. Generate output in same format
3. Do NOT optimize or "improve" yet
### Phase 3: Verify Match
Compare every target value:
```
| Target | Original | Ours | Match? |
|--------|----------|------|--------|
| Table 3, Col 2 | 0.847 | 0.847 | ✓ |
| Table 3, Col 3 | 0.156 | 0.158 | ✗ |
```
**Tolerance thresholds**:
- Estimates: < 0.01 difference
- Standard errors: < 0.05 difference
- Counts: exact match
**If mismatch**: STOP. Investigate before proceeding.
### Phase 4: Only Then Extend
After replication is verified:
1. Document that replication is complete
2. Now you can add extensions, optimizations, new features
3. Each extension gets its own verification
## Verification Commands
Before reporting "done", always run:
```
Compare output to gold standard:
- [List each target value]
- [Show differences]
```
## Red Flags
**Stop and investigate if**:
- Any value differs by more than tolerance
- Counts don't match exactly
- Signs are different (positive vs negative)
- Order of magnitude is off
**Common causes**:
- Different default parameters (e.g., intercept included vs excluded)
- Different handling of missing values
- 0-indexed vs 1-indexed
- Different random number generators
## Documentation
After successful replication:
```markdown
## Replication Verified
**Original source**: [paper/repo/code]
**Replication date**: YYYY-MM-DD
### Values matched:
- [Value 1]: ✓
- [Value 2]: ✓
### Methodology notes:
- [Any gotchas discovered]
### Now safe to extend.
```
What This Does
When translating, refactoring, or extending existing code, this playbook enforces a replication-first approach: match the original output exactly before making any changes. This catches silent bugs that produce wrong results without obvious errors — the most dangerous kind of bug.
Prerequisites
- Claude Code installed and configured
- Original code/implementation to replicate
- Known "gold standard" outputs to verify against | Target | Original | Ours | Match? | |--------|----------|------|--------| | Table 3, Col 2 | 0.847 | 0.847 | ✓ | | Table 3, Col 3 | 0.156 | 0.158 | ✗ |
**Tolerance thresholds**:
- Estimates: < 0.01 difference
- Standard errors: < 0.05 difference
- Counts: exact match
**If mismatch**: STOP. Investigate before proceeding.
### Phase 4: Only Then Extend
After replication is verified:
1. Document that replication is complete
2. Now you can add extensions, optimizations, new features
3. Each extension gets its own verification
## Verification Commands
Before reporting "done", always run:
Compare output to gold standard:
- [List each target value]
- [Show differences]
## Red Flags
**Stop and investigate if**:
- Any value differs by more than tolerance
- Counts don't match exactly
- Signs are different (positive vs negative)
- Order of magnitude is off
**Common causes**:
- Different default parameters (e.g., intercept included vs excluded)
- Different handling of missing values
- 0-indexed vs 1-indexed
- Different random number generators
## Documentation
After successful replication:
```markdown
## Replication Verified
**Original source**: [paper/repo/code]
**Replication date**: YYYY-MM-DD
### Values matched:
- [Value 1]: ✓
- [Value 2]: ✓
### Methodology notes:
- [Any gotchas discovered]
### Now safe to extend.
## Step-by-Step Setup
### Step 1: Add to your CLAUDE.md
Add the replication-first protocol to your project's CLAUDE.md.
### Step 2: Document gold standards
Before any translation/refactoring, create a gold standard document:
```markdown
# Gold Standard: [Original Name]
## Source
- Paper: [citation]
- Code: [repo link]
- Data: [data source]
## Target Values
- Main result: X.XXX
- Secondary result: Y.YYY
- Control values: [list]
## Methodology Notes
- Uses clustered standard errors
- Includes year fixed effects
- Sample size: N = 1,234
Step 3: Test the protocol
I'm translating this Python script to TypeScript. Here are the gold standard outputs it produces: [values]. Replicate exactly, then verify match before any changes.
Example Usage
User:
"Translate this R script to Python. The original produces these outputs: coefficient = 0.234, SE = 0.045, R-squared = 0.678"
Claude:
Replication-First: R to Python Translation
Phase 1: Gold Standards Documented
- Coefficient: 0.234
- Standard Error: 0.045
- R-squared: 0.678
Phase 2: Implementing Python Version
[Creates Python code matching R methodology exactly]
Phase 3: Verification
| Target | R Original | Python | Match? | |--------|-----------|--------|--------| | Coefficient | 0.234 | 0.234 | ✓ | | Std Error | 0.045 | 0.045 | ✓ | | R-squared | 0.678 | 0.678 | ✓ |All values match within tolerance.
Phase 4: Replication Complete
Python implementation verified. Safe to extend.
Want me to proceed with any optimizations or enhancements?
Example with mismatch:
Claude:
Phase 3: Verification
| Target | R Original | Python | Match? | |--------|-----------|--------|--------| | Coefficient | 0.234 | 0.234 | ✓ | | Std Error | 0.045 | 0.051 | ✗ | | R-squared | 0.678 | 0.678 | ✓ |MISMATCH DETECTED: Standard error differs by 0.006
Investigating...
Found the issue: R uses heteroskedasticity-robust standard errors by default in this package. Python's statsmodels uses non-robust by default.
Fix: Adding
cov_type='HC3'to Python regression.Re-verification: | Std Error | 0.045 | 0.045 | ✓ |
Now all values match. Proceeding.
Tips
- No shortcuts: "It's probably fine" is not verification. Check every value.
- Document the gotchas: What caused mismatches? Future-you will hit the same issues.
- Random seeds matter: Different RNGs produce different results. Document seeds explicitly.
- Version lock: Package versions can change defaults. Document exact versions used.
Troubleshooting
Problem: Values don't match but code looks correct
Solution: Check defaults. Different languages/packages have different defaults for missing value handling, standard error calculation, intercepts, etc.
Problem: Can't get exact match, always 0.001 off
Solution: May be floating point precision. If differences are < 0.0001 and consistent, this is acceptable. Document the precision caveat.
Problem: Original code doesn't have clear outputs
Solution: Create your own gold standard first. Run the original, document exact outputs, then replicate.