Data & ReportingIntermediate

Scientific Dask

Name: Scientific Dask
Author: K-Dense AI

Distributed computing for larger-than-RAM pandas/NumPy workflows. Use when you need to scale existing pandas/NumPy code beyond memory or across clusters. Best for parallel file processing, distributed ML, integration with existing pandas code. For...

10 minutes

By K-Dense AISource

#scientific#claude-code#dask#machine-learning#visualization#database#protein

Your pandas DataFrame is 50GB and Python crashes with a MemoryError every time you try to load it. Dask lets you scale your existing pandas and NumPy code to datasets larger than RAM — same API, parallel execution, optional cluster scaling — without rewriting your analysis in Spark or switching languages.

Who it's for: data scientists working with datasets that exceed available RAM, bioinformaticians processing large genomic datasets locally, analysts scaling pandas workflows to multi-file datasets, ML engineers parallelizing feature engineering on large datasets, researchers who know pandas but need to process terabyte-scale data

Example

"Process our 200GB log dataset using Dask instead of pandas" → Dask pipeline: lazy DataFrame loading with partitioned file reading, parallel aggregation and groupby operations matching your existing pandas code, memory-efficient joins across large tables, distributed computation with progress monitoring, and output export in partitioned parquet format

CLAUDE.md Template

New here? 3-minute setup guide → | Already set up? Copy the template below.

# Dask

## Overview

Dask is a Python library for parallel and distributed computing that enables three critical capabilities:
- **Larger-than-memory execution** on single machines for data exceeding available RAM
- **Parallel processing** for improved computational speed across multiple cores
- **Distributed computation** supporting terabyte-scale datasets across multiple machines

Dask scales from laptops (processing ~100 GiB) to clusters (processing ~100 TiB) while maintaining familiar Python APIs.

## When to Use This Skill

This skill should be used when:
- Process datasets that exceed available RAM
- Scale pandas or NumPy operations to larger datasets
- Parallelize computations for performance improvements
- Process multiple files efficiently (CSVs, Parquet, JSON, text logs)
- Build custom parallel workflows with task dependencies
- Distribute workloads across multiple cores or machines

## Core Capabilities

Dask provides five main components, each suited to different use cases:

### 1. DataFrames - Parallel Pandas Operations

**Purpose**: Scale pandas operations to larger datasets through parallel processing.

**When to Use**:
- Tabular data exceeds available RAM
- Need to process multiple CSV/Parquet files together
- Pandas operations are slow and need parallelization
- Scaling from pandas prototype to production

**Reference Documentation**: For comprehensive guidance on Dask DataFrames, refer to `references/dataframes.md` which includes:
- Reading data (single files, multiple files, glob patterns)
- Common operations (filtering, groupby, joins, aggregations)
- Custom operations with `map_partitions`
- Performance optimization tips
- Common patterns (ETL, time series, multi-file processing)

**Quick Example**:
```python
import dask.dataframe as dd

# Read multiple files as single DataFrame
ddf = dd.read_csv('data/2024-*.csv')

# Operations are lazy until compute()
filtered = ddf[ddf['value'] > 100]
result = filtered.groupby('category').mean().compute()
```

**Key Points**:
- Operations are lazy (build task graph) until `.compute()` called
- Use `map_partitions` for efficient custom operations
- Convert to DataFrame early when working with structured data from other sources

### 2. Arrays - Parallel NumPy Operations

**Purpose**: Extend NumPy capabilities to datasets larger than memory using blocked algorithms.

**When to Use**:
- Arrays exceed available RAM
- NumPy operations need parallelization
- Working with scientific datasets (HDF5, Zarr, NetCDF)
- Need parallel linear algebra or array operations

**Reference Documentation**: For comprehensive guidance on Dask Arrays, refer to `references/arrays.md` which includes:
- Creating arrays (from NumPy, random, from disk)
- Chunking strategies and optimization
- Common operations (arithmetic, reductions, linear algebra)
- Custom operations with `map_blocks`
- Integration with HDF5, Zarr, and XArray

**Quick Example**:
```python
import dask.array as da

# Create large array with chunks
x = da.random.random((100000, 100000), chunks=(10000, 10000))

# Operations are lazy
y = x + 100
z = y.mean(axis=0)

# Compute result
result = z.compute()
```

**Key Points**:
- Chunk size is critical (aim for ~100 MB per chunk)
- Operations work on chunks in parallel
- Rechunk data when needed for efficient operations
- Use `map_blocks` for operations not available in Dask

### 3. Bags - Parallel Processing of Unstructured Data

**Purpose**: Process unstructured or semi-structured data (text, JSON, logs) with functional operations.

**When to Use**:
- Processing text files, logs, or JSON records
- Data cleaning and ETL before structured analysis
- Working with Python objects that don't fit array/dataframe formats
- Need memory-efficient streaming processing

**Reference Documentation**: For comprehensive guidance on Dask Bags, refer to `references/bags.md` which includes:
- Reading text and JSON files
- Functional operations (map, filter, fold, groupby)
- Converting to DataFrames
- Common patterns (log analysis, JSON processing, text processing)
- Performance considerations

**Quick Example**:
```python
import dask.bag as db
import json

# Read and parse JSON files
bag = db.read_text('logs/*.json').map(json.loads)

# Filter and transform
valid = bag.filter(lambda x: x['status'] == 'valid')
processed = valid.map(lambda x: {'id': x['id'], 'value': x['value']})

# Convert to DataFrame for analysis
ddf = processed.to_dataframe()
```

**Key Points**:
- Use for initial data cleaning, then convert to DataFrame/Array
- Use `foldby` instead of `groupby` for better performance
- Operations are streaming and memory-efficient
- Convert to structured formats (DataFrame) for complex operations

### 4. Futures - Task-Based Parallelization

**Purpose**: Build custom parallel workflows with fine-grained control over task execution and dependencies.

**When to Use**:
- Building dynamic, evolving workflows
- Need immediate task execution (not lazy)
- Computations depend on runtime conditions
- Implementing custom parallel algorithms
- Need stateful computations

**Reference Documentation**: For comprehensive guidance on Dask Futures, refer to `references/futures.md` which includes:
- Setting up distributed client
- Submitting tasks and working with futures
- Task dependencies and data movement
- Advanced coordination (queues, locks, events, actors)
- Common patterns (parameter sweeps, dynamic tasks, iterative algorithms)

**Quick Example**:
```python
from dask.distributed import Client

client = Client()  # Create local cluster

# Submit tasks (executes immediately)
def process(x):
    return x ** 2

futures = client.map(process, range(100))

# Gather results
results = client.gather(futures)

client.close()
```

**Key Points**:
- Requires distributed client (even for single machine)
- Tasks execute immediately when submitted
- Pre-scatter large data to avoid repeated transfers
- ~1ms overhead per task (not suitable for millions of tiny tasks)
- Use actors for stateful workflows

### 5. Schedulers - Execution Backends

**Purpose**: Control how and where Dask tasks execute (threads, processes, distributed).

**When to Choose Scheduler**:
- **Threads** (default): NumPy/Pandas operations, GIL-releasing libraries, shared memory benefit
- **Processes**: Pure Python code, text processing, GIL-bound operations
- **Synchronous**: Debugging with pdb, profiling, understanding errors
- **Distributed**: Need dashboard, multi-machine clusters, advanced features

**Reference Documentation**: For comprehensive guidance on Dask Schedulers, refer to `references/schedulers.md` which includes:
- Detailed scheduler descriptions and characteristics
- Configuration methods (global, context manager, per-compute)
- Performance considerations and overhead
- Common patterns and troubleshooting
- Thread configuration for optimal performance

**Quick Example**:
```python
import dask
import dask.dataframe as dd

# Use threads for DataFrame (default, good for numeric)
ddf = dd.read_csv('data.csv')
result1 = ddf.mean().compute()  # Uses threads

# Use processes for Python-heavy work
import dask.bag as db
bag = db.read_text('logs/*.txt')
result2 = bag.map(python_function).compute(scheduler='processes')

# Use synchronous for debugging
dask.config.set(scheduler='synchronous')
result3 = problematic_computation.compute()  # Can use pdb

# Use distributed for monitoring and scaling
from dask.distributed import Client
client = Client()
result4 = computation.compute()  # Uses distributed with dashboard
```

**Key Points**:
- Threads: Lowest overhead (~10 µs/task), best for numeric work
- Processes: Avoids GIL (~10 ms/task), best for Python work
- Distributed: Monitoring dashboard (~1 ms/task), scales to clusters
- Can switch schedulers per computation or globally

## Best Practices

For comprehensive performance optimization guidance, memory management strategies, and common pitfalls to avoid, refer to `references/best-practices.md`. Key principles include:

### Start with Simpler Solutions
Before using Dask, explore:
- Better algorithms
- Efficient file formats (Parquet instead of CSV)
- Compiled code (Numba, Cython)
- Data sampling

### Critical Performance Rules

**1. Don't Load Data Locally Then Hand to Dask**
```python
# Wrong: Loads all data in memory first
import pandas as pd
df = pd.read_csv('large.csv')
ddf = dd.from_pandas(df, npartitions=10)

# Correct: Let Dask handle loading
import dask.dataframe as dd
ddf = dd.read_csv('large.csv')
```

**2. Avoid Repeated compute() Calls**
```python
# Wrong: Each compute is separate
for item in items:
    result = dask_computation(item).compute()

# Correct: Single compute for all
computations = [dask_computation(item) for item in items]
results = dask.compute(*computations)
```

**3. Don't Build Excessively Large Task Graphs**
- Increase chunk sizes if millions of tasks
- Use `map_partitions`/`map_blocks` to fuse operations
- Check task graph size: `len(ddf.__dask_graph__())`

**4. Choose Appropriate Chunk Sizes**
- Target: ~100 MB per chunk (or 10 chunks per core in worker memory)
- Too large: Memory overflow
- Too small: Scheduling overhead

**5. Use the Dashboard**
```python
from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # Monitor performance, identify bottlenecks
```

## Common Workflow Patterns

### ETL Pipeline
```python
import dask.dataframe as dd

# Extract: Read data
ddf = dd.read_csv('raw_data/*.csv')

# Transform: Clean and process
ddf = ddf[ddf['status'] == 'valid']
ddf['amount'] = ddf['amount'].astype('float64')
ddf = ddf.dropna(subset=['important_col'])

# Load: Aggregate and save
summary = ddf.groupby('category').agg({'amount': ['sum', 'mean']})
summary.to_parquet('output/summary.parquet')
```

### Unstructured to Structured Pipeline
```python
import dask.bag as db
import json

# Start with Bag for unstructured data
bag = db.read_text('logs/*.json').map(json.loads)
bag = bag.filter(lambda x: x['status'] == 'valid')

# Convert to DataFrame for structured analysis
ddf = bag.to_dataframe()
result = ddf.groupby('category').mean().compute()
```

### Large-Scale Array Computation
```python
import dask.array as da

# Load or create large array
x = da.from_zarr('large_dataset.zarr')

# Process in chunks
normalized = (x - x.mean()) / x.std()

# Save result
da.to_zarr(normalized, 'normalized.zarr')
```

### Custom Parallel Workflow
```python
from dask.distributed import Client

client = Client()

# Scatter large dataset once
data = client.scatter(large_dataset)

# Process in parallel with dependencies
futures = []
for param in parameters:
    future = client.submit(process, data, param)
    futures.append(future)

# Gather results
results = client.gather(futures)
```

## Selecting the Right Component

Use this decision guide to choose the appropriate Dask component:

**Data Type**:
- Tabular data → **DataFrames**
- Numeric arrays → **Arrays**
- Text/JSON/logs → **Bags** (then convert to DataFrame)
- Custom Python objects → **Bags** or **Futures**

**Operation Type**:
- Standard pandas operations → **DataFrames**
- Standard NumPy operations → **Arrays**
- Custom parallel tasks → **Futures**
- Text processing/ETL → **Bags**

**Control Level**:
- High-level, automatic → **DataFrames/Arrays**
- Low-level, manual → **Futures**

**Workflow Type**:
- Static computation graph → **DataFrames/Arrays/Bags**
- Dynamic, evolving → **Futures**

## Integration Considerations

### File Formats
- **Efficient**: Parquet, HDF5, Zarr (columnar, compressed, parallel-friendly)
- **Compatible but slower**: CSV (use for initial ingestion only)
- **For Arrays**: HDF5, Zarr, NetCDF

### Conversion Between Collections
```python
# Bag → DataFrame
ddf = bag.to_dataframe()

# DataFrame → Array (for numeric data)
arr = ddf.to_dask_array(lengths=True)

# Array → DataFrame
ddf = dd.from_dask_array(arr, columns=['col1', 'col2'])
```

### With Other Libraries
- **XArray**: Wraps Dask arrays with labeled dimensions (geospatial, imaging)
- **Dask-ML**: Machine learning with scikit-learn compatible APIs
- **Distributed**: Advanced cluster management and monitoring

## Debugging and Development

### Iterative Development Workflow

1. **Test on small data with synchronous scheduler**:
```python
dask.config.set(scheduler='synchronous')
result = computation.compute()  # Can use pdb, easy debugging
```

2. **Validate with threads on sample**:
```python
sample = ddf.head(1000)  # Small sample
# Test logic, then scale to full dataset
```

3. **Scale with distributed for monitoring**:
```python
from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # Monitor performance
result = computation.compute()
```

### Common Issues

**Memory Errors**:
- Decrease chunk sizes
- Use `persist()` strategically and delete when done
- Check for memory leaks in custom functions

**Slow Start**:
- Task graph too large (increase chunk sizes)
- Use `map_partitions` or `map_blocks` to reduce tasks

**Poor Parallelization**:
- Chunks too large (increase number of partitions)
- Using threads with Python code (switch to processes)
- Data dependencies preventing parallelism

## Reference Files

All reference documentation files can be read as needed for detailed information:

- `references/dataframes.md` - Complete Dask DataFrame guide
- `references/arrays.md` - Complete Dask Array guide
- `references/bags.md` - Complete Dask Bag guide
- `references/futures.md` - Complete Dask Futures and distributed computing guide
- `references/schedulers.md` - Complete scheduler selection and configuration guide
- `references/best-practices.md` - Comprehensive performance optimization and troubleshooting

Load these files when users need detailed information about specific Dask components, operations, or patterns beyond the quick guidance provided here.

README.md

What This Does

Dask is a Python library for parallel and distributed computing that enables three critical capabilities:

Larger-than-memory execution on single machines for data exceeding available RAM
Parallel processing for improved computational speed across multiple cores
Distributed computation supporting terabyte-scale datasets across multiple machines

Dask scales from laptops (processing ~100 GiB) to clusters (processing ~100 TiB) while maintaining familiar Python APIs.

Quick Start

Step 1: Create a Project Folder

mkdir -p ~/Projects/dask

Step 2: Download the Template

Click Download above, then:

mv ~/Downloads/CLAUDE.md ~/Projects/dask/

Step 3: Start Claude Code

cd ~/Projects/dask
claude

Core Capabilities

Dask provides five main components, each suited to different use cases:

1. DataFrames - Parallel Pandas Operations

Purpose: Scale pandas operations to larger datasets through parallel processing.

When to Use:

Tabular data exceeds available RAM
Need to process multiple CSV/Parquet files together
Pandas operations are slow and need parallelization
Scaling from pandas prototype to production

Reference Documentation: For comprehensive guidance on Dask DataFrames, refer to references/dataframes.md which includes:

Reading data (single files, multiple files, glob patterns)
Common operations (filtering, groupby, joins, aggregations)
Custom operations with map_partitions
Performance optimization tips
Common patterns (ETL, time series, multi-file processing)

Quick Example:

import dask.dataframe as dd

# Read multiple files as single DataFrame
ddf = dd.read_csv('data/2024-*.csv')

# Operations are lazy until compute()
filtered = ddf[ddf['value'] > 100]
result = filtered.groupby('category').mean().compute()

Key Points:

Operations are lazy (build task graph) until .compute() called
Use map_partitions for efficient custom operations
Convert to DataFrame early when working with structured data from other sources

2. Arrays - Parallel NumPy Operations

Purpose: Extend NumPy capabilities to datasets larger than memory using blocked algorithms.

When to Use:

Arrays exceed available RAM
NumPy operations need parallelization
Working with scientific datasets (HDF5, Zarr, NetCDF)
Need parallel linear algebra or array operations

Reference Documentation: For comprehensive guidance on Dask Arrays, refer to references/arrays.md which includes:

Creating arrays (from NumPy, random, from disk)
Chunking strategies and optimization
Common operations (arithmetic, reductions, linear algebra)
Custom operations with map_blocks
Integration with HDF5, Zarr, and XArray

Quick Example:

import dask.array as da

# Create large array with chunks
x = da.random.random((100000, 100000), chunks=(10000, 10000))

# Operations are lazy
y = x + 100
z = y.mean(axis=0)

# Compute result
result = z.compute()

Key Points:

Chunk size is critical (aim for ~100 MB per chunk)
Operations work on chunks in parallel
Rechunk data when needed for efficient operations
Use map_blocks for operations not available in Dask

3. Bags - Parallel Processing of Unstructured Data

Purpose: Process unstructured or semi-structured data (text, JSON, logs) with functional operations.

When to Use:

Processing text files, logs, or JSON records
Data cleaning and ETL before structured analysis
Working with Python objects that don't fit array/dataframe formats
Need memory-efficient streaming processing

Reference Documentation: For comprehensive guidance on Dask Bags, refer to references/bags.md which includes:

Reading text and JSON files
Functional operations (map, filter, fold, groupby)
Converting to DataFrames
Common patterns (log analysis, JSON processing, text processing)
Performance considerations

Quick Example:

import dask.bag as db
import json

# Read and parse JSON files
bag = db.read_text('logs/*.json').map(json.loads)

# Filter and transform
valid = bag.filter(lambda x: x['status'] == 'valid')
processed = valid.map(lambda x: {'id': x['id'], 'value': x['value']})

# Convert to DataFrame for analysis
ddf = processed.to_dataframe()

Key Points:

Use for initial data cleaning, then convert to DataFrame/Array
Use foldby instead of groupby for better performance
Operations are streaming and memory-efficient
Convert to structured formats (DataFrame) for complex operations

4. Futures - Task-Based Parallelization

Purpose: Build custom parallel workflows with fine-grained control over task execution and dependencies.

When to Use:

Building dynamic, evolving workflows
Need immediate task execution (not lazy)
Computations depend on runtime conditions
Implementing custom parallel algorithms
Need stateful computations

Reference Documentation: For comprehensive guidance on Dask Futures, refer to references/futures.md which includes:

Setting up distributed client
Submitting tasks and working with futures
Task dependencies and data movement
Advanced coordination (queues, locks, events, actors)
Common patterns (parameter sweeps, dynamic tasks, iterative algorithms)

Quick Example:

from dask.distributed import Client

client = Client()  # Create local cluster

# Submit tasks (executes immediately)
def process(x):
    return x ** 2

futures = client.map(process, range(100))

# Gather results
results = client.gather(futures)

client.close()

Key Points:

Requires distributed client (even for single machine)
Tasks execute immediately when submitted
Pre-scatter large data to avoid repeated transfers
~1ms overhead per task (not suitable for millions of tiny tasks)
Use actors for stateful workflows

5. Schedulers - Execution Backends

Purpose: Control how and where Dask tasks execute (threads, processes, distributed).

When to Choose Scheduler:

Threads (default): NumPy/Pandas operations, GIL-releasing libraries, shared memory benefit
Processes: Pure Python code, text processing, GIL-bound operations
Synchronous: Debugging with pdb, profiling, understanding errors
Distributed: Need dashboard, multi-machine clusters, advanced features

Reference Documentation: For comprehensive guidance on Dask Schedulers, refer to references/schedulers.md which includes:

Detailed scheduler descriptions and characteristics
Configuration methods (global, context manager, per-compute)
Performance considerations and overhead
Common patterns and troubleshooting
Thread configuration for optimal performance

Quick Example:

import dask
import dask.dataframe as dd

# Use threads for DataFrame (default, good for numeric)
ddf = dd.read_csv('data.csv')
result1 = ddf.mean().compute()  # Uses threads

# Use processes for Python-heavy work
import dask.bag as db
bag = db.read_text('logs/*.txt')
result2 = bag.map(python_function).compute(scheduler='processes')

# Use synchronous for debugging
dask.config.set(scheduler='synchronous')
result3 = problematic_computation.compute()  # Can use pdb

# Use distributed for monitoring and scaling
from dask.distributed import Client
client = Client()
result4 = computation.compute()  # Uses distributed with dashboard

Key Points:

Threads: Lowest overhead (~10 µs/task), best for numeric work
Processes: Avoids GIL (~10 ms/task), best for Python work
Distributed: Monitoring dashboard (~1 ms/task), scales to clusters
Can switch schedulers per computation or globally

Best Practices

For comprehensive performance optimization guidance, memory management strategies, and common pitfalls to avoid, refer to references/best-practices.md. Key principles include:

Start with Simpler Solutions

Before using Dask, explore:

Better algorithms
Efficient file formats (Parquet instead of CSV)
Compiled code (Numba, Cython)
Data sampling

Critical Performance Rules

1. Don't Load Data Locally Then Hand to Dask

# Wrong: Loads all data in memory first
import pandas as pd
df = pd.read_csv('large.csv')
ddf = dd.from_pandas(df, npartitions=10)

# Correct: Let Dask handle loading
import dask.dataframe as dd
ddf = dd.read_csv('large.csv')

2. Avoid Repeated compute() Calls

# Wrong: Each compute is separate
for item in items:
    result = dask_computation(item).compute()

# Correct: Single compute for all
computations = [dask_computation(item) for item in items]
results = dask.compute(*computations)

3. Don't Build Excessively Large Task Graphs

Increase chunk sizes if millions of tasks
Use map_partitions/map_blocks to fuse operations
Check task graph size: len(ddf.__dask_graph__())

4. Choose Appropriate Chunk Sizes

Target: ~100 MB per chunk (or 10 chunks per core in worker memory)
Too large: Memory overflow
Too small: Scheduling overhead

5. Use the Dashboard

from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # Monitor performance, identify bottlenecks

Common Workflow Patterns

ETL Pipeline

import dask.dataframe as dd

# Extract: Read data
ddf = dd.read_csv('raw_data/*.csv')

# Transform: Clean and process
ddf = ddf[ddf['status'] == 'valid']
ddf['amount'] = ddf['amount'].astype('float64')
ddf = ddf.dropna(subset=['important_col'])

# Load: Aggregate and save
summary = ddf.groupby('category').agg({'amount': ['sum', 'mean']})
summary.to_parquet('output/summary.parquet')

Unstructured to Structured Pipeline

import dask.bag as db
import json

# Start with Bag for unstructured data
bag = db.read_text('logs/*.json').map(json.loads)
bag = bag.filter(lambda x: x['status'] == 'valid')

# Convert to DataFrame for structured analysis
ddf = bag.to_dataframe()
result = ddf.groupby('category').mean().compute()

Large-Scale Array Computation

import dask.array as da

# Load or create large array
x = da.from_zarr('large_dataset.zarr')

# Process in chunks
normalized = (x - x.mean()) / x.std()

# Save result
da.to_zarr(normalized, 'normalized.zarr')

Custom Parallel Workflow

from dask.distributed import Client

client = Client()

# Scatter large dataset once
data = client.scatter(large_dataset)

# Process in parallel with dependencies
futures = []
for param in parameters:
    future = client.submit(process, data, param)
    futures.append(future)

# Gather results
results = client.gather(futures)

Selecting the Right Component

Use this decision guide to choose the appropriate Dask component:

Data Type:

Tabular data → DataFrames
Numeric arrays → Arrays
Text/JSON/logs → Bags (then convert to DataFrame)
Custom Python objects → Bags or Futures

Operation Type:

Standard pandas operations → DataFrames
Standard NumPy operations → Arrays
Custom parallel tasks → Futures
Text processing/ETL → Bags

Control Level:

High-level, automatic → DataFrames/Arrays
Low-level, manual → Futures

Workflow Type:

Static computation graph → DataFrames/Arrays/Bags
Dynamic, evolving → Futures

Integration Considerations

File Formats

Efficient: Parquet, HDF5, Zarr (columnar, compressed, parallel-friendly)
Compatible but slower: CSV (use for initial ingestion only)
For Arrays: HDF5, Zarr, NetCDF

Conversion Between Collections

# Bag → DataFrame
ddf = bag.to_dataframe()

# DataFrame → Array (for numeric data)
arr = ddf.to_dask_array(lengths=True)

# Array → DataFrame
ddf = dd.from_dask_array(arr, columns=['col1', 'col2'])

With Other Libraries

XArray: Wraps Dask arrays with labeled dimensions (geospatial, imaging)
Dask-ML: Machine learning with scikit-learn compatible APIs
Distributed: Advanced cluster management and monitoring

Debugging and Development

Iterative Development Workflow

Test on small data with synchronous scheduler:

dask.config.set(scheduler='synchronous')
result = computation.compute()  # Can use pdb, easy debugging

Validate with threads on sample:

sample = ddf.head(1000)  # Small sample
# Test logic, then scale to full dataset

Scale with distributed for monitoring:

from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # Monitor performance
result = computation.compute()

Common Issues

Memory Errors:

Decrease chunk sizes
Use persist() strategically and delete when done
Check for memory leaks in custom functions

Slow Start:

Task graph too large (increase chunk sizes)
Use map_partitions or map_blocks to reduce tasks

Poor Parallelization:

Chunks too large (increase number of partitions)
Using threads with Python code (switch to processes)
Data dependencies preventing parallelism

Reference Files

All reference documentation files can be read as needed for detailed information:

references/dataframes.md - Complete Dask DataFrame guide
references/arrays.md - Complete Dask Array guide
references/bags.md - Complete Dask Bag guide
references/futures.md - Complete Dask Futures and distributed computing guide
references/schedulers.md - Complete scheduler selection and configuration guide
references/best-practices.md - Comprehensive performance optimization and troubleshooting

Load these files when users need detailed information about specific Dask components, operations, or patterns beyond the quick guidance provided here.

Scientific Dask

What This Does

Quick Start

Step 1: Create a Project Folder

Step 2: Download the Template

Step 3: Start Claude Code

Core Capabilities

1. DataFrames - Parallel Pandas Operations

2. Arrays - Parallel NumPy Operations

3. Bags - Parallel Processing of Unstructured Data

4. Futures - Task-Based Parallelization

5. Schedulers - Execution Backends

Best Practices

Start with Simpler Solutions

Critical Performance Rules

Common Workflow Patterns

ETL Pipeline

Unstructured to Structured Pipeline

Large-Scale Array Computation

Custom Parallel Workflow

Selecting the Right Component

Integration Considerations

File Formats

Conversion Between Collections

With Other Libraries

Debugging and Development

Iterative Development Workflow

Common Issues

Reference Files

$Related Playbooks

Agency Dashboard

Analytics Insights

Anomaly Scan