Scientific Pubchem Database
Query PubChem via PUG-REST API/PubChemPy (110M+ compounds). Search by name/CID/SMILES, retrieve properties, similarity/substructure searches, bioactivity, for cheminformatics.
You need chemical properties, bioactivity data, and structural analogs for a compound — but each piece of information lives in a different database. PubChem aggregates 110M+ compounds with properties, bioassay results, and cross-references in a single API, making cheminformatics queries as simple as a function call.
Who it's for: medicinal chemists looking up compound properties and bioactivity data for drug design, cheminformatics scientists performing similarity and substructure searches across chemical libraries, toxicologists retrieving safety and hazard data for chemical risk assessment, chemical biologists finding bioactive compounds targeting specific proteins, analytical chemists identifying unknown compounds by molecular formula and properties
Example
"Find structural analogs of our lead compound and retrieve their bioactivity data" → PubChem query: compound lookup by name, CID, or SMILES structure, similarity search for structural analogs within specified Tanimoto threshold, bioactivity data retrieval from linked bioassays, chemical property comparison table across analogs, and a structure-activity report highlighting promising scaffolds
New here? 3-minute setup guide → | Already set up? Copy the template below.
# PubChem Database
## Overview
PubChem is the world's largest freely available chemical database with 110M+ compounds and 270M+ bioactivities. Query chemical structures by name, CID, or SMILES, retrieve molecular properties, perform similarity and substructure searches, access bioactivity data using PUG-REST API and PubChemPy.
## When to Use This Skill
This skill should be used when:
- Searching for chemical compounds by name, structure (SMILES/InChI), or molecular formula
- Retrieving molecular properties (MW, LogP, TPSA, hydrogen bonding descriptors)
- Performing similarity searches to find structurally related compounds
- Conducting substructure searches for specific chemical motifs
- Accessing bioactivity data from screening assays
- Converting between chemical identifier formats (CID, SMILES, InChI)
- Batch processing multiple compounds for drug-likeness screening or property analysis
## Core Capabilities
### 1. Chemical Structure Search
Search for compounds using multiple identifier types:
**By Chemical Name**:
```python
import pubchempy as pcp
compounds = pcp.get_compounds('aspirin', 'name')
compound = compounds[0]
```
**By CID (Compound ID)**:
```python
compound = pcp.Compound.from_cid(2244) # Aspirin
```
**By SMILES**:
```python
compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]
```
**By InChI**:
```python
compound = pcp.get_compounds('InChI=1S/C9H8O4/...', 'inchi')[0]
```
**By Molecular Formula**:
```python
compounds = pcp.get_compounds('C9H8O4', 'formula')
# Returns all compounds matching this formula
```
### 2. Property Retrieval
Retrieve molecular properties for compounds using either high-level or low-level approaches:
**Using PubChemPy (Recommended)**:
```python
import pubchempy as pcp
# Get compound object with all properties
compound = pcp.get_compounds('caffeine', 'name')[0]
# Access individual properties
molecular_formula = compound.molecular_formula
molecular_weight = compound.molecular_weight
iupac_name = compound.iupac_name
smiles = compound.canonical_smiles
inchi = compound.inchi
xlogp = compound.xlogp # Partition coefficient
tpsa = compound.tpsa # Topological polar surface area
```
**Get Specific Properties**:
```python
# Request only specific properties
properties = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'XLogP'],
'aspirin',
'name'
)
# Returns list of dictionaries
```
**Batch Property Retrieval**:
```python
import pandas as pd
compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
all_properties = []
for name in compound_names:
props = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'XLogP'],
name,
'name'
)
all_properties.extend(props)
df = pd.DataFrame(all_properties)
```
**Available Properties**: MolecularFormula, MolecularWeight, CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP, TPSA, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, Complexity, Charge, and many more (see `references/api_reference.md` for complete list).
### 3. Similarity Search
Find structurally similar compounds using Tanimoto similarity:
```python
import pubchempy as pcp
# Start with a query compound
query_compound = pcp.get_compounds('gefitinib', 'name')[0]
query_smiles = query_compound.canonical_smiles
# Perform similarity search
similar_compounds = pcp.get_compounds(
query_smiles,
'smiles',
searchtype='similarity',
Threshold=85, # Similarity threshold (0-100)
MaxRecords=50
)
# Process results
for compound in similar_compounds[:10]:
print(f"CID {compound.cid}: {compound.iupac_name}")
print(f" MW: {compound.molecular_weight}")
```
**Note**: Similarity searches are asynchronous for large queries and may take 15-30 seconds to complete. PubChemPy handles the asynchronous pattern automatically.
### 4. Substructure Search
Find compounds containing a specific structural motif:
```python
import pubchempy as pcp
# Search for compounds containing pyridine ring
pyridine_smiles = 'c1ccncc1'
matches = pcp.get_compounds(
pyridine_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
print(f"Found {len(matches)} compounds containing pyridine")
```
**Common Substructures**:
- Benzene ring: `c1ccccc1`
- Pyridine: `c1ccncc1`
- Phenol: `c1ccc(O)cc1`
- Carboxylic acid: `C(=O)O`
### 5. Format Conversion
Convert between different chemical structure formats:
```python
import pubchempy as pcp
compound = pcp.get_compounds('aspirin', 'name')[0]
# Convert to different formats
smiles = compound.canonical_smiles
inchi = compound.inchi
inchikey = compound.inchikey
cid = compound.cid
# Download structure files
pcp.download('SDF', 'aspirin', 'name', 'aspirin.sdf', overwrite=True)
pcp.download('JSON', '2244', 'cid', 'aspirin.json', overwrite=True)
```
### 6. Structure Visualization
Generate 2D structure images:
```python
import pubchempy as pcp
# Download compound structure as PNG
pcp.download('PNG', 'caffeine', 'name', 'caffeine.png', overwrite=True)
# Using direct URL (via requests)
import requests
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
response = requests.get(url)
with open('structure.png', 'wb') as f:
f.write(response.content)
```
### 7. Synonym Retrieval
Get all known names and synonyms for a compound:
```python
import pubchempy as pcp
synonyms_data = pcp.get_synonyms('aspirin', 'name')
if synonyms_data:
cid = synonyms_data[0]['CID']
synonyms = synonyms_data[0]['Synonym']
print(f"CID {cid} has {len(synonyms)} synonyms:")
for syn in synonyms[:10]: # First 10
print(f" - {syn}")
```
### 8. Bioactivity Data Access
Retrieve biological activity data from assays:
```python
import requests
import json
# Get bioassay summary for a compound
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
# Process bioassay information
table = data.get('Table', {})
rows = table.get('Row', [])
print(f"Found {len(rows)} bioassay records")
```
**For more complex bioactivity queries**, use the `scripts/bioactivity_query.py` helper script which provides:
- Bioassay summaries with activity outcome filtering
- Assay target identification
- Search for compounds by biological target
- Active compound lists for specific assays
### 9. Comprehensive Compound Annotations
Access detailed compound information through PUG-View:
```python
import requests
cid = 2244
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
response = requests.get(url)
if response.status_code == 200:
annotations = response.json()
# Contains extensive data including:
# - Chemical and Physical Properties
# - Drug and Medication Information
# - Pharmacology and Biochemistry
# - Safety and Hazards
# - Toxicity
# - Literature references
# - Patents
```
**Get Specific Section**:
```python
# Get only drug information
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Drug and Medication Information"
```
## Installation Requirements
Install PubChemPy for Python-based access:
```bash
uv pip install pubchempy
```
For direct API access and bioactivity queries:
```bash
uv pip install requests
```
Optional for data analysis:
```bash
uv pip install pandas
```
## Helper Scripts
This skill includes Python scripts for common PubChem tasks:
### scripts/compound_search.py
Provides utility functions for searching and retrieving compound information:
**Key Functions**:
- `search_by_name(name, max_results=10)`: Search compounds by name
- `search_by_smiles(smiles)`: Search by SMILES string
- `get_compound_by_cid(cid)`: Retrieve compound by CID
- `get_compound_properties(identifier, namespace, properties)`: Get specific properties
- `similarity_search(smiles, threshold, max_records)`: Perform similarity search
- `substructure_search(smiles, max_records)`: Perform substructure search
- `get_synonyms(identifier, namespace)`: Get all synonyms
- `batch_search(identifiers, namespace, properties)`: Batch search multiple compounds
- `download_structure(identifier, namespace, format, filename)`: Download structures
- `print_compound_info(compound)`: Print formatted compound information
**Usage**:
```python
from scripts.compound_search import search_by_name, get_compound_properties
# Search for a compound
compounds = search_by_name('ibuprofen')
# Get specific properties
props = get_compound_properties('aspirin', 'name', ['MolecularWeight', 'XLogP'])
```
### scripts/bioactivity_query.py
Provides functions for retrieving biological activity data:
**Key Functions**:
- `get_bioassay_summary(cid)`: Get bioassay summary for compound
- `get_compound_bioactivities(cid, activity_outcome)`: Get filtered bioactivities
- `get_assay_description(aid)`: Get detailed assay information
- `get_assay_targets(aid)`: Get biological targets for assay
- `search_assays_by_target(target_name, max_results)`: Find assays by target
- `get_active_compounds_in_assay(aid, max_results)`: Get active compounds
- `get_compound_annotations(cid, section)`: Get PUG-View annotations
- `summarize_bioactivities(cid)`: Generate bioactivity summary statistics
- `find_compounds_by_bioactivity(target, threshold, max_compounds)`: Find compounds by target
**Usage**:
```python
from scripts.bioactivity_query import get_bioassay_summary, summarize_bioactivities
# Get bioactivity summary
summary = summarize_bioactivities(2244) # Aspirin
print(f"Total assays: {summary['total_assays']}")
print(f"Active: {summary['active']}, Inactive: {summary['inactive']}")
```
## API Rate Limits and Best Practices
**Rate Limits**:
- Maximum 5 requests per second
- Maximum 400 requests per minute
- Maximum 300 seconds running time per minute
**Best Practices**:
1. **Use CIDs for repeated queries**: CIDs are more efficient than names or structures
2. **Cache results locally**: Store frequently accessed data
3. **Batch requests**: Combine multiple queries when possible
4. **Implement delays**: Add 0.2-0.3 second delays between requests
5. **Handle errors gracefully**: Check for HTTP errors and missing data
6. **Use PubChemPy**: Higher-level abstraction handles many edge cases
7. **Leverage asynchronous pattern**: For large similarity/substructure searches
8. **Specify MaxRecords**: Limit results to avoid timeouts
**Error Handling**:
```python
from pubchempy import BadRequestError, NotFoundError, TimeoutError
try:
compound = pcp.get_compounds('query', 'name')[0]
except NotFoundError:
print("Compound not found")
except BadRequestError:
print("Invalid request format")
except TimeoutError:
print("Request timed out - try reducing scope")
except IndexError:
print("No results returned")
```
## Common Workflows
### Workflow 1: Chemical Identifier Conversion Pipeline
Convert between different chemical identifiers:
```python
import pubchempy as pcp
# Start with any identifier type
compound = pcp.get_compounds('caffeine', 'name')[0]
# Extract all identifier formats
identifiers = {
'CID': compound.cid,
'Name': compound.iupac_name,
'SMILES': compound.canonical_smiles,
'InChI': compound.inchi,
'InChIKey': compound.inchikey,
'Formula': compound.molecular_formula
}
```
### Workflow 2: Drug-Like Property Screening
Screen compounds using Lipinski's Rule of Five:
```python
import pubchempy as pcp
def check_drug_likeness(compound_name):
compound = pcp.get_compounds(compound_name, 'name')[0]
# Lipinski's Rule of Five
rules = {
'MW <= 500': compound.molecular_weight <= 500,
'LogP <= 5': compound.xlogp <= 5 if compound.xlogp else None,
'HBD <= 5': compound.h_bond_donor_count <= 5,
'HBA <= 10': compound.h_bond_acceptor_count <= 10
}
violations = sum(1 for v in rules.values() if v is False)
return rules, violations
rules, violations = check_drug_likeness('aspirin')
print(f"Lipinski violations: {violations}")
```
### Workflow 3: Finding Similar Drug Candidates
Identify structurally similar compounds to a known drug:
```python
import pubchempy as pcp
# Start with known drug
reference_drug = pcp.get_compounds('imatinib', 'name')[0]
reference_smiles = reference_drug.canonical_smiles
# Find similar compounds
similar = pcp.get_compounds(
reference_smiles,
'smiles',
searchtype='similarity',
Threshold=85,
MaxRecords=20
)
# Filter by drug-like properties
candidates = []
for comp in similar:
if comp.molecular_weight and 200 <= comp.molecular_weight <= 600:
if comp.xlogp and -1 <= comp.xlogp <= 5:
candidates.append(comp)
print(f"Found {len(candidates)} drug-like candidates")
```
### Workflow 4: Batch Compound Property Comparison
Compare properties across multiple compounds:
```python
import pubchempy as pcp
import pandas as pd
compound_list = ['aspirin', 'ibuprofen', 'naproxen', 'celecoxib']
properties_list = []
for name in compound_list:
try:
compound = pcp.get_compounds(name, 'name')[0]
properties_list.append({
'Name': name,
'CID': compound.cid,
'Formula': compound.molecular_formula,
'MW': compound.molecular_weight,
'LogP': compound.xlogp,
'TPSA': compound.tpsa,
'HBD': compound.h_bond_donor_count,
'HBA': compound.h_bond_acceptor_count
})
except Exception as e:
print(f"Error processing {name}: {e}")
df = pd.DataFrame(properties_list)
print(df.to_string(index=False))
```
### Workflow 5: Substructure-Based Virtual Screening
Screen for compounds containing specific pharmacophores:
```python
import pubchempy as pcp
# Define pharmacophore (e.g., sulfonamide group)
pharmacophore_smiles = 'S(=O)(=O)N'
# Search for compounds containing this substructure
hits = pcp.get_compounds(
pharmacophore_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
# Further filter by properties
filtered_hits = [
comp for comp in hits
if comp.molecular_weight and comp.molecular_weight < 500
]
print(f"Found {len(filtered_hits)} compounds with desired substructure")
```
## Reference Documentation
For detailed API documentation, including complete property lists, URL patterns, advanced query options, and more examples, consult `references/api_reference.md`. This comprehensive reference includes:
- Complete PUG-REST API endpoint documentation
- Full list of available molecular properties
- Asynchronous request handling patterns
- PubChemPy API reference
- PUG-View API for annotations
- Common workflows and use cases
- Links to official PubChem documentation
## Troubleshooting
**Compound Not Found**:
- Try alternative names or synonyms
- Use CID if known
- Check spelling and chemical name format
**Timeout Errors**:
- Reduce MaxRecords parameter
- Add delays between requests
- Use CIDs instead of names for faster queries
**Empty Property Values**:
- Not all properties are available for all compounds
- Check if property exists before accessing: `if compound.xlogp:`
- Some properties only available for certain compound types
**Rate Limit Exceeded**:
- Implement delays (0.2-0.3 seconds) between requests
- Use batch operations where possible
- Consider caching results locally
**Similarity/Substructure Search Hangs**:
- These are asynchronous operations that may take 15-30 seconds
- PubChemPy handles polling automatically
- Reduce MaxRecords if timing out
## Additional Resources
- PubChem Home: https://pubchem.ncbi.nlm.nih.gov/
- PUG-REST Documentation: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
- PUG-REST Tutorial: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial
- PubChemPy Documentation: https://pubchempy.readthedocs.io/
- PubChemPy GitHub: https://github.com/mcs07/PubChemPyWhat This Does
PubChem is the world's largest freely available chemical database with 110M+ compounds and 270M+ bioactivities. Query chemical structures by name, CID, or SMILES, retrieve molecular properties, perform similarity and substructure searches, access bioactivity data using PUG-REST API and PubChemPy.
Quick Start
Step 1: Create a Project Folder
mkdir -p ~/Projects/pubchem-database
Step 2: Download the Template
Click Download above, then:
mv ~/Downloads/CLAUDE.md ~/Projects/pubchem-database/
Step 3: Start Claude Code
cd ~/Projects/pubchem-database
claude
Core Capabilities
1. Chemical Structure Search
Search for compounds using multiple identifier types:
By Chemical Name:
import pubchempy as pcp
compounds = pcp.get_compounds('aspirin', 'name')
compound = compounds[0]
By CID (Compound ID):
compound = pcp.Compound.from_cid(2244) # Aspirin
By SMILES:
compound = pcp.get_compounds('CC(=O)OC1=CC=CC=C1C(=O)O', 'smiles')[0]
By InChI:
compound = pcp.get_compounds('InChI=1S/C9H8O4/...', 'inchi')[0]
By Molecular Formula:
compounds = pcp.get_compounds('C9H8O4', 'formula')
# Returns all compounds matching this formula
2. Property Retrieval
Retrieve molecular properties for compounds using either high-level or low-level approaches:
Using PubChemPy (Recommended):
import pubchempy as pcp
# Get compound object with all properties
compound = pcp.get_compounds('caffeine', 'name')[0]
# Access individual properties
molecular_formula = compound.molecular_formula
molecular_weight = compound.molecular_weight
iupac_name = compound.iupac_name
smiles = compound.canonical_smiles
inchi = compound.inchi
xlogp = compound.xlogp # Partition coefficient
tpsa = compound.tpsa # Topological polar surface area
Get Specific Properties:
# Request only specific properties
properties = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'CanonicalSMILES', 'XLogP'],
'aspirin',
'name'
)
# Returns list of dictionaries
Batch Property Retrieval:
import pandas as pd
compound_names = ['aspirin', 'ibuprofen', 'paracetamol']
all_properties = []
for name in compound_names:
props = pcp.get_properties(
['MolecularFormula', 'MolecularWeight', 'XLogP'],
name,
'name'
)
all_properties.extend(props)
df = pd.DataFrame(all_properties)
Available Properties: MolecularFormula, MolecularWeight, CanonicalSMILES, IsomericSMILES, InChI, InChIKey, IUPACName, XLogP, TPSA, HBondDonorCount, HBondAcceptorCount, RotatableBondCount, Complexity, Charge, and many more (see references/api_reference.md for complete list).
3. Similarity Search
Find structurally similar compounds using Tanimoto similarity:
import pubchempy as pcp
# Start with a query compound
query_compound = pcp.get_compounds('gefitinib', 'name')[0]
query_smiles = query_compound.canonical_smiles
# Perform similarity search
similar_compounds = pcp.get_compounds(
query_smiles,
'smiles',
searchtype='similarity',
Threshold=85, # Similarity threshold (0-100)
MaxRecords=50
)
# Process results
for compound in similar_compounds[:10]:
print(f"CID {compound.cid}: {compound.iupac_name}")
print(f" MW: {compound.molecular_weight}")
Note: Similarity searches are asynchronous for large queries and may take 15-30 seconds to complete. PubChemPy handles the asynchronous pattern automatically.
4. Substructure Search
Find compounds containing a specific structural motif:
import pubchempy as pcp
# Search for compounds containing pyridine ring
pyridine_smiles = 'c1ccncc1'
matches = pcp.get_compounds(
pyridine_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
print(f"Found {len(matches)} compounds containing pyridine")
Common Substructures:
- Benzene ring:
c1ccccc1 - Pyridine:
c1ccncc1 - Phenol:
c1ccc(O)cc1 - Carboxylic acid:
C(=O)O
5. Format Conversion
Convert between different chemical structure formats:
import pubchempy as pcp
compound = pcp.get_compounds('aspirin', 'name')[0]
# Convert to different formats
smiles = compound.canonical_smiles
inchi = compound.inchi
inchikey = compound.inchikey
cid = compound.cid
# Download structure files
pcp.download('SDF', 'aspirin', 'name', 'aspirin.sdf', overwrite=True)
pcp.download('JSON', '2244', 'cid', 'aspirin.json', overwrite=True)
6. Structure Visualization
Generate 2D structure images:
import pubchempy as pcp
# Download compound structure as PNG
pcp.download('PNG', 'caffeine', 'name', 'caffeine.png', overwrite=True)
# Using direct URL (via requests)
import requests
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
response = requests.get(url)
with open('structure.png', 'wb') as f:
f.write(response.content)
7. Synonym Retrieval
Get all known names and synonyms for a compound:
import pubchempy as pcp
synonyms_data = pcp.get_synonyms('aspirin', 'name')
if synonyms_data:
cid = synonyms_data[0]['CID']
synonyms = synonyms_data[0]['Synonym']
print(f"CID {cid} has {len(synonyms)} synonyms:")
for syn in synonyms[:10]: # First 10
print(f" - {syn}")
8. Bioactivity Data Access
Retrieve biological activity data from assays:
import requests
import json
# Get bioassay summary for a compound
cid = 2244 # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
# Process bioassay information
table = data.get('Table', {})
rows = table.get('Row', [])
print(f"Found {len(rows)} bioassay records")
For more complex bioactivity queries, use the scripts/bioactivity_query.py helper script which provides:
- Bioassay summaries with activity outcome filtering
- Assay target identification
- Search for compounds by biological target
- Active compound lists for specific assays
9. Comprehensive Compound Annotations
Access detailed compound information through PUG-View:
import requests
cid = 2244
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
response = requests.get(url)
if response.status_code == 200:
annotations = response.json()
# Contains extensive data including:
# - Chemical and Physical Properties
# - Drug and Medication Information
# - Pharmacology and Biochemistry
# - Safety and Hazards
# - Toxicity
# - Literature references
# - Patents
Get Specific Section:
# Get only drug information
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Drug and Medication Information"
Installation Requirements
Install PubChemPy for Python-based access:
uv pip install pubchempy
For direct API access and bioactivity queries:
uv pip install requests
Optional for data analysis:
uv pip install pandas
Helper Scripts
This skill includes Python scripts for common PubChem tasks:
scripts/compound_search.py
Provides utility functions for searching and retrieving compound information:
Key Functions:
search_by_name(name, max_results=10): Search compounds by namesearch_by_smiles(smiles): Search by SMILES stringget_compound_by_cid(cid): Retrieve compound by CIDget_compound_properties(identifier, namespace, properties): Get specific propertiessimilarity_search(smiles, threshold, max_records): Perform similarity searchsubstructure_search(smiles, max_records): Perform substructure searchget_synonyms(identifier, namespace): Get all synonymsbatch_search(identifiers, namespace, properties): Batch search multiple compoundsdownload_structure(identifier, namespace, format, filename): Download structuresprint_compound_info(compound): Print formatted compound information
Usage:
from scripts.compound_search import search_by_name, get_compound_properties
# Search for a compound
compounds = search_by_name('ibuprofen')
# Get specific properties
props = get_compound_properties('aspirin', 'name', ['MolecularWeight', 'XLogP'])
scripts/bioactivity_query.py
Provides functions for retrieving biological activity data:
Key Functions:
get_bioassay_summary(cid): Get bioassay summary for compoundget_compound_bioactivities(cid, activity_outcome): Get filtered bioactivitiesget_assay_description(aid): Get detailed assay informationget_assay_targets(aid): Get biological targets for assaysearch_assays_by_target(target_name, max_results): Find assays by targetget_active_compounds_in_assay(aid, max_results): Get active compoundsget_compound_annotations(cid, section): Get PUG-View annotationssummarize_bioactivities(cid): Generate bioactivity summary statisticsfind_compounds_by_bioactivity(target, threshold, max_compounds): Find compounds by target
Usage:
from scripts.bioactivity_query import get_bioassay_summary, summarize_bioactivities
# Get bioactivity summary
summary = summarize_bioactivities(2244) # Aspirin
print(f"Total assays: {summary['total_assays']}")
print(f"Active: {summary['active']}, Inactive: {summary['inactive']}")
API Rate Limits and Best Practices
Rate Limits:
- Maximum 5 requests per second
- Maximum 400 requests per minute
- Maximum 300 seconds running time per minute
Best Practices:
- Use CIDs for repeated queries: CIDs are more efficient than names or structures
- Cache results locally: Store frequently accessed data
- Batch requests: Combine multiple queries when possible
- Implement delays: Add 0.2-0.3 second delays between requests
- Handle errors gracefully: Check for HTTP errors and missing data
- Use PubChemPy: Higher-level abstraction handles many edge cases
- Leverage asynchronous pattern: For large similarity/substructure searches
- Specify MaxRecords: Limit results to avoid timeouts
Error Handling:
from pubchempy import BadRequestError, NotFoundError, TimeoutError
try:
compound = pcp.get_compounds('query', 'name')[0]
except NotFoundError:
print("Compound not found")
except BadRequestError:
print("Invalid request format")
except TimeoutError:
print("Request timed out - try reducing scope")
except IndexError:
print("No results returned")
Common Workflows
Workflow 1: Chemical Identifier Conversion Pipeline
Convert between different chemical identifiers:
import pubchempy as pcp
# Start with any identifier type
compound = pcp.get_compounds('caffeine', 'name')[0]
# Extract all identifier formats
identifiers = {
'CID': compound.cid,
'Name': compound.iupac_name,
'SMILES': compound.canonical_smiles,
'InChI': compound.inchi,
'InChIKey': compound.inchikey,
'Formula': compound.molecular_formula
}
Workflow 2: Drug-Like Property Screening
Screen compounds using Lipinski's Rule of Five:
import pubchempy as pcp
def check_drug_likeness(compound_name):
compound = pcp.get_compounds(compound_name, 'name')[0]
# Lipinski's Rule of Five
rules = {
'MW <= 500': compound.molecular_weight <= 500,
'LogP <= 5': compound.xlogp <= 5 if compound.xlogp else None,
'HBD <= 5': compound.h_bond_donor_count <= 5,
'HBA <= 10': compound.h_bond_acceptor_count <= 10
}
violations = sum(1 for v in rules.values() if v is False)
return rules, violations
rules, violations = check_drug_likeness('aspirin')
print(f"Lipinski violations: {violations}")
Workflow 3: Finding Similar Drug Candidates
Identify structurally similar compounds to a known drug:
import pubchempy as pcp
# Start with known drug
reference_drug = pcp.get_compounds('imatinib', 'name')[0]
reference_smiles = reference_drug.canonical_smiles
# Find similar compounds
similar = pcp.get_compounds(
reference_smiles,
'smiles',
searchtype='similarity',
Threshold=85,
MaxRecords=20
)
# Filter by drug-like properties
candidates = []
for comp in similar:
if comp.molecular_weight and 200 <= comp.molecular_weight <= 600:
if comp.xlogp and -1 <= comp.xlogp <= 5:
candidates.append(comp)
print(f"Found {len(candidates)} drug-like candidates")
Workflow 4: Batch Compound Property Comparison
Compare properties across multiple compounds:
import pubchempy as pcp
import pandas as pd
compound_list = ['aspirin', 'ibuprofen', 'naproxen', 'celecoxib']
properties_list = []
for name in compound_list:
try:
compound = pcp.get_compounds(name, 'name')[0]
properties_list.append({
'Name': name,
'CID': compound.cid,
'Formula': compound.molecular_formula,
'MW': compound.molecular_weight,
'LogP': compound.xlogp,
'TPSA': compound.tpsa,
'HBD': compound.h_bond_donor_count,
'HBA': compound.h_bond_acceptor_count
})
except Exception as e:
print(f"Error processing {name}: {e}")
df = pd.DataFrame(properties_list)
print(df.to_string(index=False))
Workflow 5: Substructure-Based Virtual Screening
Screen for compounds containing specific pharmacophores:
import pubchempy as pcp
# Define pharmacophore (e.g., sulfonamide group)
pharmacophore_smiles = 'S(=O)(=O)N'
# Search for compounds containing this substructure
hits = pcp.get_compounds(
pharmacophore_smiles,
'smiles',
searchtype='substructure',
MaxRecords=100
)
# Further filter by properties
filtered_hits = [
comp for comp in hits
if comp.molecular_weight and comp.molecular_weight < 500
]
print(f"Found {len(filtered_hits)} compounds with desired substructure")
Reference Documentation
For detailed API documentation, including complete property lists, URL patterns, advanced query options, and more examples, consult references/api_reference.md. This comprehensive reference includes:
- Complete PUG-REST API endpoint documentation
- Full list of available molecular properties
- Asynchronous request handling patterns
- PubChemPy API reference
- PUG-View API for annotations
- Common workflows and use cases
- Links to official PubChem documentation
Troubleshooting
Compound Not Found:
- Try alternative names or synonyms
- Use CID if known
- Check spelling and chemical name format
Timeout Errors:
- Reduce MaxRecords parameter
- Add delays between requests
- Use CIDs instead of names for faster queries
Empty Property Values:
- Not all properties are available for all compounds
- Check if property exists before accessing:
if compound.xlogp: - Some properties only available for certain compound types
Rate Limit Exceeded:
- Implement delays (0.2-0.3 seconds) between requests
- Use batch operations where possible
- Consider caching results locally
Similarity/Substructure Search Hangs:
- These are asynchronous operations that may take 15-30 seconds
- PubChemPy handles polling automatically
- Reduce MaxRecords if timing out
Additional Resources
- PubChem Home: https://pubchem.ncbi.nlm.nih.gov/
- PUG-REST Documentation: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
- PUG-REST Tutorial: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest-tutorial
- PubChemPy Documentation: https://pubchempy.readthedocs.io/
- PubChemPy GitHub: https://github.com/mcs07/PubChemPy
Tips
- Read the docs: Check the official pubchem-database documentation for latest API changes
- Start simple: Begin with basic examples before tackling complex workflows
- Save your work: Keep intermediate results in case of long-running analyses