Scientific Skill: Kegg Database
Direct REST API access to KEGG (academic use only). Pathway analysis, gene-pathway mapping, metabolic pathways, drug interactions, ID conversion. For Python workflows with multiple databases, prefer bioservices. Use this for direct HTTP/REST work ...
Download this file and place it in your project folder to get started.
# KEGG Database
## Overview
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a comprehensive bioinformatics resource for biological pathway analysis and molecular interaction networks.
**Important**: KEGG API is made available only for academic use by academic users.
## When to Use This Skill
This skill should be used when querying pathways, genes, compounds, enzymes, diseases, and drugs across multiple organisms using KEGG's REST API.
## Quick Start
The skill provides:
1. Python helper functions (`scripts/kegg_api.py`) for all KEGG REST API operations
2. Comprehensive reference documentation (`references/kegg_reference.md`) with detailed API specifications
When users request KEGG data, determine which operation is needed and use the appropriate function from `scripts/kegg_api.py`.
## Core Operations
### 1. Database Information (`kegg_info`)
Retrieve metadata and statistics about KEGG databases.
**When to use**: Understanding database structure, checking available data, getting release information.
**Usage**:
```python
from scripts.kegg_api import kegg_info
# Get pathway database info
info = kegg_info('pathway')
# Get organism-specific info
hsa_info = kegg_info('hsa') # Human genome
```
**Common databases**: `kegg`, `pathway`, `module`, `brite`, `genes`, `genome`, `compound`, `glycan`, `reaction`, `enzyme`, `disease`, `drug`
### 2. Listing Entries (`kegg_list`)
List entry identifiers and names from KEGG databases.
**When to use**: Getting all pathways for an organism, listing genes, retrieving compound catalogs.
**Usage**:
```python
from scripts.kegg_api import kegg_list
# List all reference pathways
pathways = kegg_list('pathway')
# List human-specific pathways
hsa_pathways = kegg_list('pathway', 'hsa')
# List specific genes (max 10)
genes = kegg_list('hsa:10458+hsa:10459')
```
**Common organism codes**: `hsa` (human), `mmu` (mouse), `dme` (fruit fly), `sce` (yeast), `eco` (E. coli)
### 3. Searching (`kegg_find`)
Search KEGG databases by keywords or molecular properties.
**When to use**: Finding genes by name/description, searching compounds by formula or mass, discovering entries by keywords.
**Usage**:
```python
from scripts.kegg_api import kegg_find
# Keyword search
results = kegg_find('genes', 'p53')
shiga_toxin = kegg_find('genes', 'shiga toxin')
# Chemical formula search (exact match)
compounds = kegg_find('compound', 'C7H10N4O2', 'formula')
# Molecular weight range search
drugs = kegg_find('drug', '300-310', 'exact_mass')
```
**Search options**: `formula` (exact match), `exact_mass` (range), `mol_weight` (range)
### 4. Retrieving Entries (`kegg_get`)
Get complete database entries or specific data formats.
**When to use**: Retrieving pathway details, getting gene/protein sequences, downloading pathway maps, accessing compound structures.
**Usage**:
```python
from scripts.kegg_api import kegg_get
# Get pathway entry
pathway = kegg_get('hsa00010') # Glycolysis pathway
# Get multiple entries (max 10)
genes = kegg_get(['hsa:10458', 'hsa:10459'])
# Get protein sequence (FASTA)
sequence = kegg_get('hsa:10458', 'aaseq')
# Get nucleotide sequence
nt_seq = kegg_get('hsa:10458', 'ntseq')
# Get compound structure
mol_file = kegg_get('cpd:C00002', 'mol') # ATP in MOL format
# Get pathway as JSON (single entry only)
pathway_json = kegg_get('hsa05130', 'json')
# Get pathway image (single entry only)
pathway_img = kegg_get('hsa05130', 'image')
```
**Output formats**: `aaseq` (protein FASTA), `ntseq` (nucleotide FASTA), `mol` (MOL format), `kcf` (KCF format), `image` (PNG), `kgml` (XML), `json` (pathway JSON)
**Important**: Image, KGML, and JSON formats allow only one entry at a time.
### 5. ID Conversion (`kegg_conv`)
Convert identifiers between KEGG and external databases.
**When to use**: Integrating KEGG data with other databases, mapping gene IDs, converting compound identifiers.
**Usage**:
```python
from scripts.kegg_api import kegg_conv
# Convert all human genes to NCBI Gene IDs
conversions = kegg_conv('ncbi-geneid', 'hsa')
# Convert specific gene
gene_id = kegg_conv('ncbi-geneid', 'hsa:10458')
# Convert to UniProt
uniprot_id = kegg_conv('uniprot', 'hsa:10458')
# Convert compounds to PubChem
pubchem_ids = kegg_conv('pubchem', 'compound')
# Reverse conversion (NCBI Gene ID to KEGG)
kegg_id = kegg_conv('hsa', 'ncbi-geneid')
```
**Supported conversions**: `ncbi-geneid`, `ncbi-proteinid`, `uniprot`, `pubchem`, `chebi`
### 6. Cross-Referencing (`kegg_link`)
Find related entries within and between KEGG databases.
**When to use**: Finding pathways containing genes, getting genes in a pathway, mapping genes to KO groups, finding compounds in pathways.
**Usage**:
```python
from scripts.kegg_api import kegg_link
# Find pathways linked to human genes
pathways = kegg_link('pathway', 'hsa')
# Get genes in a specific pathway
genes = kegg_link('genes', 'hsa00010') # Glycolysis genes
# Find pathways containing a specific gene
gene_pathways = kegg_link('pathway', 'hsa:10458')
# Find compounds in a pathway
compounds = kegg_link('compound', 'hsa00010')
# Map genes to KO (orthology) groups
ko_groups = kegg_link('ko', 'hsa:10458')
```
**Common links**: genes ↔ pathway, pathway ↔ compound, pathway ↔ enzyme, genes ↔ ko (orthology)
### 7. Drug-Drug Interactions (`kegg_ddi`)
Check for drug-drug interactions.
**When to use**: Analyzing drug combinations, checking for contraindications, pharmacological research.
**Usage**:
```python
from scripts.kegg_api import kegg_ddi
# Check single drug
interactions = kegg_ddi('D00001')
# Check multiple drugs (max 10)
interactions = kegg_ddi(['D00001', 'D00002', 'D00003'])
```
## Common Analysis Workflows
### Workflow 1: Gene to Pathway Mapping
**Use case**: Finding pathways associated with genes of interest (e.g., for pathway enrichment analysis).
```python
from scripts.kegg_api import kegg_find, kegg_link, kegg_get
# Step 1: Find gene ID by name
gene_results = kegg_find('genes', 'p53')
# Step 2: Link gene to pathways
pathways = kegg_link('pathway', 'hsa:7157') # TP53 gene
# Step 3: Get detailed pathway information
for pathway_line in pathways.split('\n'):
if pathway_line:
pathway_id = pathway_line.split('\t')[1].replace('path:', '')
pathway_info = kegg_get(pathway_id)
# Process pathway information
```
### Workflow 2: Pathway Enrichment Context
**Use case**: Getting all genes in organism pathways for enrichment analysis.
```python
from scripts.kegg_api import kegg_list, kegg_link
# Step 1: List all human pathways
pathways = kegg_list('pathway', 'hsa')
# Step 2: For each pathway, get associated genes
for pathway_line in pathways.split('\n'):
if pathway_line:
pathway_id = pathway_line.split('\t')[0]
genes = kegg_link('genes', pathway_id)
# Process genes for enrichment analysis
```
### Workflow 3: Compound to Pathway Analysis
**Use case**: Finding metabolic pathways containing compounds of interest.
```python
from scripts.kegg_api import kegg_find, kegg_link, kegg_get
# Step 1: Search for compound
compound_results = kegg_find('compound', 'glucose')
# Step 2: Link compound to reactions
reactions = kegg_link('reaction', 'cpd:C00031') # Glucose
# Step 3: Link reactions to pathways
pathways = kegg_link('pathway', 'rn:R00299') # Specific reaction
# Step 4: Get pathway details
pathway_info = kegg_get('map00010') # Glycolysis
```
### Workflow 4: Cross-Database Integration
**Use case**: Integrating KEGG data with UniProt, NCBI, or PubChem databases.
```python
from scripts.kegg_api import kegg_conv, kegg_get
# Step 1: Convert KEGG gene IDs to external database IDs
uniprot_map = kegg_conv('uniprot', 'hsa')
ncbi_map = kegg_conv('ncbi-geneid', 'hsa')
# Step 2: Parse conversion results
for line in uniprot_map.split('\n'):
if line:
kegg_id, uniprot_id = line.split('\t')
# Use external IDs for integration
# Step 3: Get sequences using KEGG
sequence = kegg_get('hsa:10458', 'aaseq')
```
### Workflow 5: Organism-Specific Pathway Analysis
**Use case**: Comparing pathways across different organisms.
```python
from scripts.kegg_api import kegg_list, kegg_get
# Step 1: List pathways for multiple organisms
human_pathways = kegg_list('pathway', 'hsa')
mouse_pathways = kegg_list('pathway', 'mmu')
yeast_pathways = kegg_list('pathway', 'sce')
# Step 2: Get reference pathway for comparison
ref_pathway = kegg_get('map00010') # Reference glycolysis
# Step 3: Get organism-specific versions
hsa_glycolysis = kegg_get('hsa00010')
mmu_glycolysis = kegg_get('mmu00010')
```
## Pathway Categories
KEGG organizes pathways into seven major categories. When interpreting pathway IDs or recommending pathways to users:
1. **Metabolism** (e.g., `map00010` - Glycolysis, `map00190` - Oxidative phosphorylation)
2. **Genetic Information Processing** (e.g., `map03010` - Ribosome, `map03040` - Spliceosome)
3. **Environmental Information Processing** (e.g., `map04010` - MAPK signaling, `map02010` - ABC transporters)
4. **Cellular Processes** (e.g., `map04140` - Autophagy, `map04210` - Apoptosis)
5. **Organismal Systems** (e.g., `map04610` - Complement cascade, `map04910` - Insulin signaling)
6. **Human Diseases** (e.g., `map05200` - Pathways in cancer, `map05010` - Alzheimer disease)
7. **Drug Development** (chronological and target-based classifications)
Reference `references/kegg_reference.md` for detailed pathway lists and classifications.
## Important Identifiers and Formats
### Pathway IDs
- `map#####` - Reference pathway (generic, not organism-specific)
- `hsa#####` - Human pathway
- `mmu#####` - Mouse pathway
### Gene IDs
- Format: `organism:gene_number` (e.g., `hsa:10458`)
### Compound IDs
- Format: `cpd:C#####` (e.g., `cpd:C00002` for ATP)
### Drug IDs
- Format: `dr:D#####` (e.g., `dr:D00001`)
### Enzyme IDs
- Format: `ec:EC_number` (e.g., `ec:1.1.1.1`)
### KO (KEGG Orthology) IDs
- Format: `ko:K#####` (e.g., `ko:K00001`)
## API Limitations
Respect these constraints when using the KEGG API:
1. **Entry limits**: Maximum 10 entries per operation (except image/kgml/json: 1 entry only)
2. **Academic use**: API is for academic use only; commercial use requires licensing
3. **HTTP status codes**: Check for 200 (success), 400 (bad request), 404 (not found)
4. **Rate limiting**: No explicit limit, but avoid rapid-fire requests
## Detailed Reference
For comprehensive API documentation, database specifications, organism codes, and advanced usage, refer to `references/kegg_reference.md`. This includes:
- Complete list of KEGG databases
- Detailed API operation syntax
- All organism codes
- HTTP status codes and error handling
- Integration with Biopython and R/Bioconductor
- Best practices for API usage
## Troubleshooting
**404 Not Found**: Entry or database doesn't exist; verify IDs and organism codes
**400 Bad Request**: Syntax error in API call; check parameter formatting
**Empty results**: Search term may not match entries; try broader keywords
**Image/KGML errors**: These formats only work with single entries; remove batch processing
## Additional Tools
For interactive pathway visualization and annotation:
- **KEGG Mapper**: https://www.kegg.jp/kegg/mapper/
- **BlastKOALA**: Automated genome annotation
- **GhostKOALA**: Metagenome/metatranscriptome annotationWhat This Does
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a comprehensive bioinformatics resource for biological pathway analysis and molecular interaction networks.
Important: KEGG API is made available only for academic use by academic users.
Quick Start
Step 1: Create a Project Folder
mkdir -p ~/Projects/kegg-database
Step 2: Download the Template
Click Download above, then:
mv ~/Downloads/CLAUDE.md ~/Projects/kegg-database/
Step 3: Start Claude Code
cd ~/Projects/kegg-database
claude
Core Operations
1. Database Information (kegg_info)
Retrieve metadata and statistics about KEGG databases.
When to use: Understanding database structure, checking available data, getting release information.
Usage:
from scripts.kegg_api import kegg_info
# Get pathway database info
info = kegg_info('pathway')
# Get organism-specific info
hsa_info = kegg_info('hsa') # Human genome
Common databases: kegg, pathway, module, brite, genes, genome, compound, glycan, reaction, enzyme, disease, drug
2. Listing Entries (kegg_list)
List entry identifiers and names from KEGG databases.
When to use: Getting all pathways for an organism, listing genes, retrieving compound catalogs.
Usage:
from scripts.kegg_api import kegg_list
# List all reference pathways
pathways = kegg_list('pathway')
# List human-specific pathways
hsa_pathways = kegg_list('pathway', 'hsa')
# List specific genes (max 10)
genes = kegg_list('hsa:10458+hsa:10459')
Common organism codes: hsa (human), mmu (mouse), dme (fruit fly), sce (yeast), eco (E. coli)
3. Searching (kegg_find)
Search KEGG databases by keywords or molecular properties.
When to use: Finding genes by name/description, searching compounds by formula or mass, discovering entries by keywords.
Usage:
from scripts.kegg_api import kegg_find
# Keyword search
results = kegg_find('genes', 'p53')
shiga_toxin = kegg_find('genes', 'shiga toxin')
# Chemical formula search (exact match)
compounds = kegg_find('compound', 'C7H10N4O2', 'formula')
# Molecular weight range search
drugs = kegg_find('drug', '300-310', 'exact_mass')
Search options: formula (exact match), exact_mass (range), mol_weight (range)
4. Retrieving Entries (kegg_get)
Get complete database entries or specific data formats.
When to use: Retrieving pathway details, getting gene/protein sequences, downloading pathway maps, accessing compound structures.
Usage:
from scripts.kegg_api import kegg_get
# Get pathway entry
pathway = kegg_get('hsa00010') # Glycolysis pathway
# Get multiple entries (max 10)
genes = kegg_get(['hsa:10458', 'hsa:10459'])
# Get protein sequence (FASTA)
sequence = kegg_get('hsa:10458', 'aaseq')
# Get nucleotide sequence
nt_seq = kegg_get('hsa:10458', 'ntseq')
# Get compound structure
mol_file = kegg_get('cpd:C00002', 'mol') # ATP in MOL format
# Get pathway as JSON (single entry only)
pathway_json = kegg_get('hsa05130', 'json')
# Get pathway image (single entry only)
pathway_img = kegg_get('hsa05130', 'image')
Output formats: aaseq (protein FASTA), ntseq (nucleotide FASTA), mol (MOL format), kcf (KCF format), image (PNG), kgml (XML), json (pathway JSON)
Important: Image, KGML, and JSON formats allow only one entry at a time.
5. ID Conversion (kegg_conv)
Convert identifiers between KEGG and external databases.
When to use: Integrating KEGG data with other databases, mapping gene IDs, converting compound identifiers.
Usage:
from scripts.kegg_api import kegg_conv
# Convert all human genes to NCBI Gene IDs
conversions = kegg_conv('ncbi-geneid', 'hsa')
# Convert specific gene
gene_id = kegg_conv('ncbi-geneid', 'hsa:10458')
# Convert to UniProt
uniprot_id = kegg_conv('uniprot', 'hsa:10458')
# Convert compounds to PubChem
pubchem_ids = kegg_conv('pubchem', 'compound')
# Reverse conversion (NCBI Gene ID to KEGG)
kegg_id = kegg_conv('hsa', 'ncbi-geneid')
Supported conversions: ncbi-geneid, ncbi-proteinid, uniprot, pubchem, chebi
6. Cross-Referencing (kegg_link)
Find related entries within and between KEGG databases.
When to use: Finding pathways containing genes, getting genes in a pathway, mapping genes to KO groups, finding compounds in pathways.
Usage:
from scripts.kegg_api import kegg_link
# Find pathways linked to human genes
pathways = kegg_link('pathway', 'hsa')
# Get genes in a specific pathway
genes = kegg_link('genes', 'hsa00010') # Glycolysis genes
# Find pathways containing a specific gene
gene_pathways = kegg_link('pathway', 'hsa:10458')
# Find compounds in a pathway
compounds = kegg_link('compound', 'hsa00010')
# Map genes to KO (orthology) groups
ko_groups = kegg_link('ko', 'hsa:10458')
Common links: genes ↔ pathway, pathway ↔ compound, pathway ↔ enzyme, genes ↔ ko (orthology)
7. Drug-Drug Interactions (kegg_ddi)
Check for drug-drug interactions.
When to use: Analyzing drug combinations, checking for contraindications, pharmacological research.
Usage:
from scripts.kegg_api import kegg_ddi
# Check single drug
interactions = kegg_ddi('D00001')
# Check multiple drugs (max 10)
interactions = kegg_ddi(['D00001', 'D00002', 'D00003'])
Common Analysis Workflows
Workflow 1: Gene to Pathway Mapping
Use case: Finding pathways associated with genes of interest (e.g., for pathway enrichment analysis).
from scripts.kegg_api import kegg_find, kegg_link, kegg_get
# Step 1: Find gene ID by name
gene_results = kegg_find('genes', 'p53')
# Step 2: Link gene to pathways
pathways = kegg_link('pathway', 'hsa:7157') # TP53 gene
# Step 3: Get detailed pathway information
for pathway_line in pathways.split('\n'):
if pathway_line:
pathway_id = pathway_line.split('\t')[1].replace('path:', '')
pathway_info = kegg_get(pathway_id)
# Process pathway information
Workflow 2: Pathway Enrichment Context
Use case: Getting all genes in organism pathways for enrichment analysis.
from scripts.kegg_api import kegg_list, kegg_link
# Step 1: List all human pathways
pathways = kegg_list('pathway', 'hsa')
# Step 2: For each pathway, get associated genes
for pathway_line in pathways.split('\n'):
if pathway_line:
pathway_id = pathway_line.split('\t')[0]
genes = kegg_link('genes', pathway_id)
# Process genes for enrichment analysis
Workflow 3: Compound to Pathway Analysis
Use case: Finding metabolic pathways containing compounds of interest.
from scripts.kegg_api import kegg_find, kegg_link, kegg_get
# Step 1: Search for compound
compound_results = kegg_find('compound', 'glucose')
# Step 2: Link compound to reactions
reactions = kegg_link('reaction', 'cpd:C00031') # Glucose
# Step 3: Link reactions to pathways
pathways = kegg_link('pathway', 'rn:R00299') # Specific reaction
# Step 4: Get pathway details
pathway_info = kegg_get('map00010') # Glycolysis
Workflow 4: Cross-Database Integration
Use case: Integrating KEGG data with UniProt, NCBI, or PubChem databases.
from scripts.kegg_api import kegg_conv, kegg_get
# Step 1: Convert KEGG gene IDs to external database IDs
uniprot_map = kegg_conv('uniprot', 'hsa')
ncbi_map = kegg_conv('ncbi-geneid', 'hsa')
# Step 2: Parse conversion results
for line in uniprot_map.split('\n'):
if line:
kegg_id, uniprot_id = line.split('\t')
# Use external IDs for integration
# Step 3: Get sequences using KEGG
sequence = kegg_get('hsa:10458', 'aaseq')
Workflow 5: Organism-Specific Pathway Analysis
Use case: Comparing pathways across different organisms.
from scripts.kegg_api import kegg_list, kegg_get
# Step 1: List pathways for multiple organisms
human_pathways = kegg_list('pathway', 'hsa')
mouse_pathways = kegg_list('pathway', 'mmu')
yeast_pathways = kegg_list('pathway', 'sce')
# Step 2: Get reference pathway for comparison
ref_pathway = kegg_get('map00010') # Reference glycolysis
# Step 3: Get organism-specific versions
hsa_glycolysis = kegg_get('hsa00010')
mmu_glycolysis = kegg_get('mmu00010')
Pathway Categories
KEGG organizes pathways into seven major categories. When interpreting pathway IDs or recommending pathways to users:
- Metabolism (e.g.,
map00010- Glycolysis,map00190- Oxidative phosphorylation) - Genetic Information Processing (e.g.,
map03010- Ribosome,map03040- Spliceosome) - Environmental Information Processing (e.g.,
map04010- MAPK signaling,map02010- ABC transporters) - Cellular Processes (e.g.,
map04140- Autophagy,map04210- Apoptosis) - Organismal Systems (e.g.,
map04610- Complement cascade,map04910- Insulin signaling) - Human Diseases (e.g.,
map05200- Pathways in cancer,map05010- Alzheimer disease) - Drug Development (chronological and target-based classifications)
Reference references/kegg_reference.md for detailed pathway lists and classifications.
Important Identifiers and Formats
Pathway IDs
map#####- Reference pathway (generic, not organism-specific)hsa#####- Human pathwaymmu#####- Mouse pathway
Gene IDs
- Format:
organism:gene_number(e.g.,hsa:10458)
Compound IDs
- Format:
cpd:C#####(e.g.,cpd:C00002for ATP)
Drug IDs
- Format:
dr:D#####(e.g.,dr:D00001)
Enzyme IDs
- Format:
ec:EC_number(e.g.,ec:1.1.1.1)
KO (KEGG Orthology) IDs
- Format:
ko:K#####(e.g.,ko:K00001)
API Limitations
Respect these constraints when using the KEGG API:
- Entry limits: Maximum 10 entries per operation (except image/kgml/json: 1 entry only)
- Academic use: API is for academic use only; commercial use requires licensing
- HTTP status codes: Check for 200 (success), 400 (bad request), 404 (not found)
- Rate limiting: No explicit limit, but avoid rapid-fire requests
Detailed Reference
For comprehensive API documentation, database specifications, organism codes, and advanced usage, refer to references/kegg_reference.md. This includes:
- Complete list of KEGG databases
- Detailed API operation syntax
- All organism codes
- HTTP status codes and error handling
- Integration with Biopython and R/Bioconductor
- Best practices for API usage
Troubleshooting
404 Not Found: Entry or database doesn't exist; verify IDs and organism codes 400 Bad Request: Syntax error in API call; check parameter formatting Empty results: Search term may not match entries; try broader keywords Image/KGML errors: These formats only work with single entries; remove batch processing
Additional Tools
For interactive pathway visualization and annotation:
- KEGG Mapper: https://www.kegg.jp/kegg/mapper/
- BlastKOALA: Automated genome annotation
- GhostKOALA: Metagenome/metatranscriptome annotation
Tips
- Read the docs: Check the official kegg-database documentation for latest API changes
- Start simple: Begin with basic examples before tackling complex workflows
- Save your work: Keep intermediate results in case of long-running analyses