Home
cd ../playbooks
Academic ResearchBeginner

Scientific Skill: Jaspar Database

Query JASPAR for transcription factor binding site (TFBS) profiles (PWMs/PFMs). Search by TF name, species, or class; scan DNA sequences for TF binding sites; compare matrices; essential for regulatory genomics, motif analysis, and GWAS regulatory...

5 minutes
By K-Dense AISource
#scientific#claude-code#jaspar-database#bioinformatics#database#protein#genomics#writing
CLAUDE.md Template

Download this file and place it in your project folder to get started.

# JASPAR Database

## Overview

JASPAR (https://jaspar.elixir.no/) is the gold-standard open-access database of curated, non-redundant transcription factor (TF) binding profiles stored as position frequency matrices (PFMs). JASPAR 2024 contains 1,210 non-redundant TF binding profiles for 164 eukaryotic species. Each profile is experimentally derived (ChIP-seq, SELEX, HT-SELEX, protein binding microarray, etc.) and rigorously validated.

**Key resources:**
- JASPAR portal: https://jaspar.elixir.no/
- REST API: https://jaspar.elixir.no/api/v1/
- API docs: https://jaspar.elixir.no/api/v1/docs/
- Python package: `jaspar` (via Biopython) or direct API

## When to Use This Skill

Use JASPAR when:

- **TF binding site prediction**: Scan a DNA sequence for potential binding sites of a TF
- **Regulatory variant interpretation**: Does a GWAS/eQTL variant disrupt a TF binding motif?
- **Promoter/enhancer analysis**: What TFs are predicted to bind to a regulatory element?
- **Gene regulatory network construction**: Link TFs to their target genes via motif scanning
- **TF family analysis**: Compare binding profiles across a TF family (e.g., all homeobox factors)
- **ChIP-seq analysis**: Find known TF motifs enriched in ChIP-seq peaks
- **ENCODE/ATAC-seq interpretation**: Match open chromatin regions to TF binding profiles

## Core Capabilities

### 1. JASPAR REST API

Base URL: `https://jaspar.elixir.no/api/v1/`

```python
import requests

BASE_URL = "https://jaspar.elixir.no/api/v1"

def jaspar_get(endpoint, params=None):
    url = f"{BASE_URL}/{endpoint}"
    response = requests.get(url, params=params, headers={"Accept": "application/json"})
    response.raise_for_status()
    return response.json()
```

### 2. Search for TF Profiles

```python
def search_jaspar(
    tf_name=None,
    species=None,
    collection="CORE",
    tf_class=None,
    tf_family=None,
    page=1,
    page_size=25
):
    """Search JASPAR for TF binding profiles."""
    params = {
        "collection": collection,
        "page": page,
        "page_size": page_size,
        "format": "json"
    }
    if tf_name:
        params["name"] = tf_name
    if species:
        params["species"] = species  # Use taxonomy ID or name, e.g., "9606" for human
    if tf_class:
        params["tf_class"] = tf_class
    if tf_family:
        params["tf_family"] = tf_family

    return jaspar_get("matrix", params)

# Examples:
# Search for human CTCF profile
ctcf = search_jaspar("CTCF", species="9606")
print(f"Found {ctcf['count']} CTCF profiles")

# Search for all homeobox TFs in human
hox_tfs = search_jaspar(tf_class="Homeodomain", species="9606")

# Search for a TF family
nfkb = search_jaspar(tf_family="NF-kappaB")
```

### 3. Fetch a Specific Matrix (PFM/PWM)

```python
def get_matrix(matrix_id):
    """Fetch a specific JASPAR matrix by ID (e.g., 'MA0139.1' for CTCF)."""
    return jaspar_get(f"matrix/{matrix_id}/")

# Example: Get CTCF matrix
ctcf_matrix = get_matrix("MA0139.1")

# Matrix structure:
# {
#   "matrix_id": "MA0139.1",
#   "name": "CTCF",
#   "collection": "CORE",
#   "tax_group": "vertebrates",
#   "pfm": { "A": [...], "C": [...], "G": [...], "T": [...] },
#   "consensus": "CCGCGNGGNGGCAG",
#   "length": 19,
#   "species": [{"tax_id": 9606, "name": "Homo sapiens"}],
#   "class": ["C2H2 zinc finger factors"],
#   "family": ["BEN domain factors"],
#   "type": "ChIP-seq",
#   "uniprot_ids": ["P49711"]
# }
```

### 4. Download PFM/PWM as Matrix

```python
import numpy as np

def get_pwm(matrix_id, pseudocount=0.8):
    """
    Fetch a PFM from JASPAR and convert to PWM (log-odds).
    Returns numpy array of shape (4, L) in order A, C, G, T.
    """
    matrix = get_matrix(matrix_id)
    pfm = matrix["pfm"]

    # Convert PFM to numpy
    pfm_array = np.array([pfm["A"], pfm["C"], pfm["G"], pfm["T"]], dtype=float)

    # Add pseudocount
    pfm_array += pseudocount

    # Normalize to get PPM
    ppm = pfm_array / pfm_array.sum(axis=0, keepdims=True)

    # Convert to PWM (log-odds relative to background 0.25)
    background = 0.25
    pwm = np.log2(ppm / background)

    return pwm, matrix["name"]

# Example
pwm, name = get_pwm("MA0139.1")  # CTCF
print(f"PWM for {name}: shape {pwm.shape}")
max_score = pwm.max(axis=0).sum()
print(f"Maximum possible score: {max_score:.2f} bits")
```

### 5. Scan a DNA Sequence for TF Binding Sites

```python
import numpy as np
from typing import List, Tuple

NUCLEOTIDE_MAP = {'A': 0, 'C': 1, 'G': 2, 'T': 3,
                  'a': 0, 'c': 1, 'g': 2, 't': 3}

def scan_sequence(sequence: str, pwm: np.ndarray, threshold_pct: float = 0.8) -> List[dict]:
    """
    Scan a DNA sequence for TF binding sites using a PWM.

    Args:
        sequence: DNA sequence string
        pwm: PWM array (4 x L) in ACGT order
        threshold_pct: Fraction of max score to use as threshold (0-1)

    Returns:
        List of hits with position, score, and matched sequence
    """
    motif_len = pwm.shape[1]
    max_score = pwm.max(axis=0).sum()
    min_score = pwm.min(axis=0).sum()
    threshold = min_score + threshold_pct * (max_score - min_score)

    hits = []
    seq = sequence.upper()

    for i in range(len(seq) - motif_len + 1):
        subseq = seq[i:i + motif_len]
        # Skip if contains non-ACGT
        if any(c not in NUCLEOTIDE_MAP for c in subseq):
            continue

        score = sum(pwm[NUCLEOTIDE_MAP[c], j] for j, c in enumerate(subseq))

        if score >= threshold:
            relative_score = (score - min_score) / (max_score - min_score)
            hits.append({
                "position": i + 1,  # 1-based
                "score": score,
                "relative_score": relative_score,
                "sequence": subseq,
                "strand": "+"
            })

    return hits

# Example: Scan a promoter sequence for CTCF binding sites
promoter = "AGCCCGCGAGGNGGCAGTTGCCTGGAGCAGGATCAGCAGATC"
pwm, name = get_pwm("MA0139.1")
hits = scan_sequence(promoter, pwm, threshold_pct=0.75)
for hit in hits:
    print(f"  Position {hit['position']}: {hit['sequence']} (score: {hit['score']:.2f}, {hit['relative_score']:.0%})")
```

### 6. Scan Both Strands

```python
def reverse_complement(seq: str) -> str:
    complement = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C', 'N': 'N'}
    return ''.join(complement.get(b, 'N') for b in reversed(seq.upper()))

def scan_both_strands(sequence: str, pwm: np.ndarray, threshold_pct: float = 0.8):
    """Scan forward and reverse complement strands."""
    fwd_hits = scan_sequence(sequence, pwm, threshold_pct)
    for h in fwd_hits:
        h["strand"] = "+"

    rev_seq = reverse_complement(sequence)
    rev_hits = scan_sequence(rev_seq, pwm, threshold_pct)
    seq_len = len(sequence)
    for h in rev_hits:
        h["strand"] = "-"
        h["position"] = seq_len - h["position"] - len(h["sequence"]) + 2  # Convert to fwd coords

    all_hits = fwd_hits + rev_hits
    return sorted(all_hits, key=lambda x: x["position"])
```

### 7. Variant Impact on TF Binding

```python
def variant_tfbs_impact(ref_seq: str, alt_seq: str, pwm: np.ndarray,
                          tf_name: str, threshold_pct: float = 0.7):
    """
    Assess impact of a SNP on TF binding by comparing ref vs alt sequences.
    Both sequences should be centered on the variant with flanking context.
    """
    ref_hits = scan_both_strands(ref_seq, pwm, threshold_pct)
    alt_hits = scan_both_strands(alt_seq, pwm, threshold_pct)

    max_ref = max((h["score"] for h in ref_hits), default=None)
    max_alt = max((h["score"] for h in alt_hits), default=None)

    result = {
        "tf": tf_name,
        "ref_max_score": max_ref,
        "alt_max_score": max_alt,
        "ref_has_site": len(ref_hits) > 0,
        "alt_has_site": len(alt_hits) > 0,
    }
    if max_ref and max_alt:
        result["score_change"] = max_alt - max_ref
        result["effect"] = "gained" if max_alt > max_ref else "disrupted"
    elif max_ref and not max_alt:
        result["effect"] = "disrupted"
    elif not max_ref and max_alt:
        result["effect"] = "gained"
    else:
        result["effect"] = "no_site"

    return result
```

## Query Workflows

### Workflow 1: Find All TF Binding Sites in a Promoter

```python
import requests, numpy as np

# 1. Get relevant TF matrices (e.g., all human TFs in CORE collection)
response = requests.get(
    "https://jaspar.elixir.no/api/v1/matrix/",
    params={"species": "9606", "collection": "CORE", "page_size": 500, "page": 1}
)
matrices = response.json()["results"]

# 2. For each matrix, compute PWM and scan promoter
promoter = "CCCGCCCGCCCGCCGCCCGCAGTTAATGAGCCCAGCGTGCC"  # Example

all_hits = []
for m in matrices[:10]:  # Limit for demo
    pwm_data = requests.get(f"https://jaspar.elixir.no/api/v1/matrix/{m['matrix_id']}/").json()
    pfm = pfm_data["pfm"]
    pfm_arr = np.array([pfm["A"], pfm["C"], pfm["G"], pfm["T"]], dtype=float) + 0.8
    ppm = pfm_arr / pfm_arr.sum(axis=0)
    pwm = np.log2(ppm / 0.25)

    hits = scan_sequence(promoter, pwm, threshold_pct=0.8)
    for h in hits:
        h["tf_name"] = m["name"]
        h["matrix_id"] = m["matrix_id"]
    all_hits.extend(hits)

print(f"Found {len(all_hits)} TF binding sites")
for h in sorted(all_hits, key=lambda x: -x["score"])[:5]:
    print(f"  {h['tf_name']} ({h['matrix_id']}): pos {h['position']}, score {h['score']:.2f}")
```

### Workflow 2: SNP Impact on TF Binding (Regulatory Variant Analysis)

1. Retrieve the genomic sequence flanking the SNP (±20 bp each side)
2. Construct ref and alt sequences
3. Scan with all relevant TF PWMs
4. Report TFs whose binding is created or destroyed by the SNP

### Workflow 3: Motif Enrichment Analysis

1. Identify a set of peak sequences (e.g., from ChIP-seq or ATAC-seq)
2. Scan all peaks with JASPAR PWMs
3. Compare hit rates in peaks vs. background sequences
4. Report significantly enriched motifs (Fisher's exact test or FIMO-style scoring)

## Collections Available

| Collection | Description | Profiles |
|------------|-------------|----------|
| `CORE` | Non-redundant, high-quality profiles | ~1,210 |
| `UNVALIDATED` | Experimentally derived but not validated | ~500 |
| `PHYLOFACTS` | Phylogenetically conserved sites | ~50 |
| `CNE` | Conserved non-coding elements | ~30 |
| `POLII` | RNA Pol II binding profiles | ~20 |
| `FAM` | TF family representative profiles | ~170 |
| `SPLICE` | Splice factor profiles | ~20 |

## Best Practices

- **Use CORE collection** for most analyses — best validated and non-redundant
- **Threshold selection**: 80% of max score is common for de novo prediction; 90% for high-confidence
- **Always scan both strands** — TFs can bind in either orientation
- **Provide flanking context** for variant analysis: at least (motif_length - 1) bp on each side
- **Consider background**: PWM scores relative to uniform (0.25) background; adjust for actual GC content
- **Cross-validate with ChIP-seq data** when available — motif scanning has many false positives
- **Use Biopython's motifs module** for full-featured scanning: `from Bio import motifs`

## Additional Resources

- **JASPAR portal**: https://jaspar.elixir.no/
- **API documentation**: https://jaspar.elixir.no/api/v1/docs/
- **JASPAR 2024 paper**: Castro-Mondragon et al. (2022) Nucleic Acids Research. PMID: 34875674
- **Biopython motifs**: https://biopython.org/docs/latest/Tutorial/chapter_motifs.html
- **FIMO tool** (for large-scale scanning): https://meme-suite.org/meme/tools/fimo
- **HOMER** (motif enrichment): http://homer.ucsd.edu/homer/
- **GitHub**: https://github.com/wassermanlab/JASPAR-UCSC-tracks
README.md

What This Does

JASPAR (https://jaspar.elixir.no/) is the gold-standard open-access database of curated, non-redundant transcription factor (TF) binding profiles stored as position frequency matrices (PFMs). JASPAR 2024 contains 1,210 non-redundant TF binding profiles for 164 eukaryotic species. Each profile is experimentally derived (ChIP-seq, SELEX, HT-SELEX, protein binding microarray, etc.) and rigorously validated.

Key resources:


Quick Start

Step 1: Create a Project Folder

mkdir -p ~/Projects/jaspar-database

Step 2: Download the Template

Click Download above, then:

mv ~/Downloads/CLAUDE.md ~/Projects/jaspar-database/

Step 3: Start Claude Code

cd ~/Projects/jaspar-database
claude

Core Capabilities

1. JASPAR REST API

Base URL: https://jaspar.elixir.no/api/v1/

import requests

BASE_URL = "https://jaspar.elixir.no/api/v1"

def jaspar_get(endpoint, params=None):
    url = f"{BASE_URL}/{endpoint}"
    response = requests.get(url, params=params, headers={"Accept": "application/json"})
    response.raise_for_status()
    return response.json()

2. Search for TF Profiles

def search_jaspar(
    tf_name=None,
    species=None,
    collection="CORE",
    tf_class=None,
    tf_family=None,
    page=1,
    page_size=25
):
    """Search JASPAR for TF binding profiles."""
    params = {
        "collection": collection,
        "page": page,
        "page_size": page_size,
        "format": "json"
    }
    if tf_name:
        params["name"] = tf_name
    if species:
        params["species"] = species  # Use taxonomy ID or name, e.g., "9606" for human
    if tf_class:
        params["tf_class"] = tf_class
    if tf_family:
        params["tf_family"] = tf_family

    return jaspar_get("matrix", params)

# Examples:
# Search for human CTCF profile
ctcf = search_jaspar("CTCF", species="9606")
print(f"Found {ctcf['count']} CTCF profiles")

# Search for all homeobox TFs in human
hox_tfs = search_jaspar(tf_class="Homeodomain", species="9606")

# Search for a TF family
nfkb = search_jaspar(tf_family="NF-kappaB")

3. Fetch a Specific Matrix (PFM/PWM)

def get_matrix(matrix_id):
    """Fetch a specific JASPAR matrix by ID (e.g., 'MA0139.1' for CTCF)."""
    return jaspar_get(f"matrix/{matrix_id}/")

# Example: Get CTCF matrix
ctcf_matrix = get_matrix("MA0139.1")

# Matrix structure:
# {
#   "matrix_id": "MA0139.1",
#   "name": "CTCF",
#   "collection": "CORE",
#   "tax_group": "vertebrates",
#   "pfm": { "A": [...], "C": [...], "G": [...], "T": [...] },
#   "consensus": "CCGCGNGGNGGCAG",
#   "length": 19,
#   "species": [{"tax_id": 9606, "name": "Homo sapiens"}],
#   "class": ["C2H2 zinc finger factors"],
#   "family": ["BEN domain factors"],
#   "type": "ChIP-seq",
#   "uniprot_ids": ["P49711"]
# }

4. Download PFM/PWM as Matrix

import numpy as np

def get_pwm(matrix_id, pseudocount=0.8):
    """
    Fetch a PFM from JASPAR and convert to PWM (log-odds).
    Returns numpy array of shape (4, L) in order A, C, G, T.
    """
    matrix = get_matrix(matrix_id)
    pfm = matrix["pfm"]

    # Convert PFM to numpy
    pfm_array = np.array([pfm["A"], pfm["C"], pfm["G"], pfm["T"]], dtype=float)

    # Add pseudocount
    pfm_array += pseudocount

    # Normalize to get PPM
    ppm = pfm_array / pfm_array.sum(axis=0, keepdims=True)

    # Convert to PWM (log-odds relative to background 0.25)
    background = 0.25
    pwm = np.log2(ppm / background)

    return pwm, matrix["name"]

# Example
pwm, name = get_pwm("MA0139.1")  # CTCF
print(f"PWM for {name}: shape {pwm.shape}")
max_score = pwm.max(axis=0).sum()
print(f"Maximum possible score: {max_score:.2f} bits")

5. Scan a DNA Sequence for TF Binding Sites

import numpy as np
from typing import List, Tuple

NUCLEOTIDE_MAP = {'A': 0, 'C': 1, 'G': 2, 'T': 3,
                  'a': 0, 'c': 1, 'g': 2, 't': 3}

def scan_sequence(sequence: str, pwm: np.ndarray, threshold_pct: float = 0.8) -> List[dict]:
    """
    Scan a DNA sequence for TF binding sites using a PWM.

    Args:
        sequence: DNA sequence string
        pwm: PWM array (4 x L) in ACGT order
        threshold_pct: Fraction of max score to use as threshold (0-1)

    Returns:
        List of hits with position, score, and matched sequence
    """
    motif_len = pwm.shape[1]
    max_score = pwm.max(axis=0).sum()
    min_score = pwm.min(axis=0).sum()
    threshold = min_score + threshold_pct * (max_score - min_score)

    hits = []
    seq = sequence.upper()

    for i in range(len(seq) - motif_len + 1):
        subseq = seq[i:i + motif_len]
        # Skip if contains non-ACGT
        if any(c not in NUCLEOTIDE_MAP for c in subseq):
            continue

        score = sum(pwm[NUCLEOTIDE_MAP[c], j] for j, c in enumerate(subseq))

        if score >= threshold:
            relative_score = (score - min_score) / (max_score - min_score)
            hits.append({
                "position": i + 1,  # 1-based
                "score": score,
                "relative_score": relative_score,
                "sequence": subseq,
                "strand": "+"
            })

    return hits

# Example: Scan a promoter sequence for CTCF binding sites
promoter = "AGCCCGCGAGGNGGCAGTTGCCTGGAGCAGGATCAGCAGATC"
pwm, name = get_pwm("MA0139.1")
hits = scan_sequence(promoter, pwm, threshold_pct=0.75)
for hit in hits:
    print(f"  Position {hit['position']}: {hit['sequence']} (score: {hit['score']:.2f}, {hit['relative_score']:.0%})")

6. Scan Both Strands

def reverse_complement(seq: str) -> str:
    complement = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C', 'N': 'N'}
    return ''.join(complement.get(b, 'N') for b in reversed(seq.upper()))

def scan_both_strands(sequence: str, pwm: np.ndarray, threshold_pct: float = 0.8):
    """Scan forward and reverse complement strands."""
    fwd_hits = scan_sequence(sequence, pwm, threshold_pct)
    for h in fwd_hits:
        h["strand"] = "+"

    rev_seq = reverse_complement(sequence)
    rev_hits = scan_sequence(rev_seq, pwm, threshold_pct)
    seq_len = len(sequence)
    for h in rev_hits:
        h["strand"] = "-"
        h["position"] = seq_len - h["position"] - len(h["sequence"]) + 2  # Convert to fwd coords

    all_hits = fwd_hits + rev_hits
    return sorted(all_hits, key=lambda x: x["position"])

7. Variant Impact on TF Binding

def variant_tfbs_impact(ref_seq: str, alt_seq: str, pwm: np.ndarray,
                          tf_name: str, threshold_pct: float = 0.7):
    """
    Assess impact of a SNP on TF binding by comparing ref vs alt sequences.
    Both sequences should be centered on the variant with flanking context.
    """
    ref_hits = scan_both_strands(ref_seq, pwm, threshold_pct)
    alt_hits = scan_both_strands(alt_seq, pwm, threshold_pct)

    max_ref = max((h["score"] for h in ref_hits), default=None)
    max_alt = max((h["score"] for h in alt_hits), default=None)

    result = {
        "tf": tf_name,
        "ref_max_score": max_ref,
        "alt_max_score": max_alt,
        "ref_has_site": len(ref_hits) > 0,
        "alt_has_site": len(alt_hits) > 0,
    }
    if max_ref and max_alt:
        result["score_change"] = max_alt - max_ref
        result["effect"] = "gained" if max_alt > max_ref else "disrupted"
    elif max_ref and not max_alt:
        result["effect"] = "disrupted"
    elif not max_ref and max_alt:
        result["effect"] = "gained"
    else:
        result["effect"] = "no_site"

    return result

Query Workflows

Workflow 1: Find All TF Binding Sites in a Promoter

import requests, numpy as np

# 1. Get relevant TF matrices (e.g., all human TFs in CORE collection)
response = requests.get(
    "https://jaspar.elixir.no/api/v1/matrix/",
    params={"species": "9606", "collection": "CORE", "page_size": 500, "page": 1}
)
matrices = response.json()["results"]

# 2. For each matrix, compute PWM and scan promoter
promoter = "CCCGCCCGCCCGCCGCCCGCAGTTAATGAGCCCAGCGTGCC"  # Example

all_hits = []
for m in matrices[:10]:  # Limit for demo
    pwm_data = requests.get(f"https://jaspar.elixir.no/api/v1/matrix/{m['matrix_id']}/").json()
    pfm = pfm_data["pfm"]
    pfm_arr = np.array([pfm["A"], pfm["C"], pfm["G"], pfm["T"]], dtype=float) + 0.8
    ppm = pfm_arr / pfm_arr.sum(axis=0)
    pwm = np.log2(ppm / 0.25)

    hits = scan_sequence(promoter, pwm, threshold_pct=0.8)
    for h in hits:
        h["tf_name"] = m["name"]
        h["matrix_id"] = m["matrix_id"]
    all_hits.extend(hits)

print(f"Found {len(all_hits)} TF binding sites")
for h in sorted(all_hits, key=lambda x: -x["score"])[:5]:
    print(f"  {h['tf_name']} ({h['matrix_id']}): pos {h['position']}, score {h['score']:.2f}")

Workflow 2: SNP Impact on TF Binding (Regulatory Variant Analysis)

  1. Retrieve the genomic sequence flanking the SNP (±20 bp each side)
  2. Construct ref and alt sequences
  3. Scan with all relevant TF PWMs
  4. Report TFs whose binding is created or destroyed by the SNP

Workflow 3: Motif Enrichment Analysis

  1. Identify a set of peak sequences (e.g., from ChIP-seq or ATAC-seq)
  2. Scan all peaks with JASPAR PWMs
  3. Compare hit rates in peaks vs. background sequences
  4. Report significantly enriched motifs (Fisher's exact test or FIMO-style scoring)

Collections Available

Collection Description Profiles
CORE Non-redundant, high-quality profiles ~1,210
UNVALIDATED Experimentally derived but not validated ~500
PHYLOFACTS Phylogenetically conserved sites ~50
CNE Conserved non-coding elements ~30
POLII RNA Pol II binding profiles ~20
FAM TF family representative profiles ~170
SPLICE Splice factor profiles ~20

Best Practices

  • Use CORE collection for most analyses — best validated and non-redundant
  • Threshold selection: 80% of max score is common for de novo prediction; 90% for high-confidence
  • Always scan both strands — TFs can bind in either orientation
  • Provide flanking context for variant analysis: at least (motif_length - 1) bp on each side
  • Consider background: PWM scores relative to uniform (0.25) background; adjust for actual GC content
  • Cross-validate with ChIP-seq data when available — motif scanning has many false positives
  • Use Biopython's motifs module for full-featured scanning: from Bio import motifs

Additional Resources

$Related Playbooks