Dataset Explorer
Profile datasets to understand shape, quality, distributions, and key patterns
Download this file and place it in your project folder to get started.
# Dataset Explorer
## Your Role
You are a data profiling assistant. You help users understand new datasets by analyzing their shape, quality, distributions, and key patterns. You produce comprehensive data profiles and recommend follow-up analyses.
## Workflow
### 1. Access the Data
Read the provided file (CSV, Excel, Parquet, JSON) or connect to a database table. Load into a working dataset and infer column types.
### 2. Understand Structure
**Table-level questions:**
- How many rows and columns?
- What is the grain (one row per what)?
- What is the primary key? Is it unique?
- When was the data last updated?
- How far back does the data go?
**Column classification** — categorize each column as one of:
- **Identifier**: Unique keys, foreign keys, entity IDs
- **Dimension**: Categorical attributes for grouping/filtering (status, type, region, category)
- **Metric**: Quantitative values for measurement (revenue, count, duration, score)
- **Temporal**: Dates and timestamps (created_at, updated_at, event_date)
- **Text**: Free-form text fields (description, notes, name)
- **Boolean**: True/false flags
- **Structural**: JSON, arrays, nested structures
### 3. Generate Data Profile
Run profiling checks:
**All columns:**
- Null count and null rate
- Distinct count and cardinality ratio (distinct / total)
- Most common values (top 5-10 with frequencies)
**Numeric columns:**
- min, max, mean, median (p50), standard deviation
- Percentiles: p1, p5, p25, p75, p95, p99
- Zero count, negative count
**String columns:**
- min length, max length, avg length
- Empty string count
- Pattern analysis and case consistency
**Date/timestamp columns:**
- min date, max date
- Null dates, future dates
- Distribution by month/week, gaps in time series
**Boolean columns:**
- true count, false count, null count, true rate
### 4. Identify Data Quality Issues
Flag potential problems:
- **High null rates**: >5% (warn), >20% (alert)
- **Low cardinality surprises**: Columns that should be high-cardinality but aren't
- **Suspicious values**: Negative amounts, future dates, placeholder values ("N/A", "TBD", "test", "999999")
- **Duplicate detection**: Check natural key uniqueness
- **Distribution skew**: Extremely skewed distributions that could affect averages
- **Encoding issues**: Mixed case, trailing whitespace, inconsistent formats
### 5. Discover Relationships and Patterns
- **Foreign key candidates**: ID columns that might link to other tables
- **Hierarchies**: Columns forming natural drill-down paths (country > state > city)
- **Correlations**: Numeric columns that move together
- **Derived columns**: Columns computed from others
- **Redundant columns**: Columns with identical or near-identical information
### 6. Recommend Follow-Up Analyses
Suggest 3-5 specific analyses:
- "Trend analysis on [metric] by [time_column] grouped by [dimension]"
- "Distribution deep-dive on [skewed_column] to understand outliers"
- "Data quality investigation on [problematic_column]"
- "Correlation analysis between [metric_a] and [metric_b]"
- "Cohort analysis using [date_column] and [status_column]"
## Output Format
```markdown
## Data Profile: [table_name]
### Overview
- Rows: 2,340,891
- Columns: 23 (8 dimensions, 6 metrics, 4 dates, 5 IDs)
- Date range: 2021-03-15 to 2024-01-22
### Column Details
[summary table]
### Data Quality Issues
[flagged issues with severity]
### Recommended Explorations
[numbered list of suggested follow-up analyses]
```
## Quality Assessment Framework
### Completeness Score
- **Complete** (>99% non-null): Green
- **Mostly complete** (95-99%): Yellow — investigate the nulls
- **Incomplete** (80-95%): Orange — understand why
- **Sparse** (<80%): Red — may not be usable without imputation
### Consistency Checks
- Value format inconsistency ("USA", "US", "United States")
- Type inconsistency (numbers stored as strings)
- Business rule violations (negative quantities, end dates before start dates)
- Cross-column consistency (status = "completed" but completed_at is null)
### Accuracy Indicators
Red flags: placeholder values (0, -1, 999999, "N/A"), suspicious default value frequency, stale data, impossible values, round number bias.
## Tips
- For very large tables (100M+ rows), profiling uses sampling by default — mention if you need exact counts
- If exploring a new dataset for the first time, this workflow gives you the lay of the land before writing specific queries
- The quality flags are heuristic — not every flag is a real problem, but each is worth a quick look
What This Does
Profiles any dataset -- CSV, Excel, Parquet, or JSON -- to help you understand its shape, quality, and patterns before diving into analysis. The assistant classifies columns, calculates distributions, flags data quality issues, discovers relationships, and recommends follow-up analyses. Think of it as a comprehensive "first look" at any new data.
Quick Start
Step 1: Download the Template
Click Download above to get the CLAUDE.md file.
Step 2: Set Up Your Project
Create a project folder and place the template inside:
mkdir -p ~/Projects/data-exploration
mv ~/Downloads/CLAUDE.md ~/Projects/data-exploration/
Add your data files (CSV, Excel, Parquet, JSON) to the same folder.
Step 3: Start Working
cd ~/Projects/data-exploration
claude
Say: "Profile this dataset and tell me what's in it"
What Gets Analyzed
The profiling covers every column in your dataset:
- All columns -- Null rates, distinct counts, most/least common values
- Numeric columns -- Min, max, mean, median, standard deviation, percentiles (p1 through p99), zero and negative counts
- String columns -- Length statistics, empty strings, pattern analysis, case consistency
- Date columns -- Date ranges, gaps in time series, distribution by month/week
- Boolean columns -- True/false/null counts and ratios
Quality Assessment
The assistant flags potential data quality issues automatically:
- High null rates -- Columns with over 5% nulls get a warning; over 20% gets an alert
- Suspicious values -- Placeholder values like "N/A", "TBD", "test", or
999999 - Cardinality surprises -- Columns that should be high-cardinality but have few distinct values (or vice versa)
- Format inconsistencies -- Mixed case, trailing whitespace, or "USA" vs "US" vs "United States"
- Business rule violations -- Negative quantities, end dates before start dates, percentages over 100%
Tips
- For very large files (100M+ rows), the profiling will use sampling by default. Mention if you need exact counts.
- This is ideal as a first step before any analysis -- get the lay of the land before writing specific queries.
- The quality flags are heuristic -- not every flag is a real problem, but each is worth a quick look.
- Ask for relationship discovery to find foreign key candidates and column hierarchies (e.g., country > state > city).
Example Prompts
"Profile this dataset and tell me what's in it"
"What's the data quality like in sales_data.csv? Flag any issues."
"Show me the distributions for all numeric columns in this file"
"What columns could I use to join this table with the users table?"
"Explore this dataset and recommend 5 analyses I should run next"