Introduction to DuckDB
DuckDB has emerged as a game-changer in the data processing landscape. It’s an in-process SQL OLAP database management system, similar to SQLite but optimized for analytical queries rather than transactional workloads. What makes DuckDB particularly attractive is its remarkable performance for data analysis tasks combined with its simplicity - no server setup, no complex configuration, just a library you can import into your Python scripts or notebooks.
In this guide, we’ll explore how to use DuckDB effectively in both Jupyter Notebooks and Python scripts for Extract, Transform, Load (ETL) operations. We’ll cover installation, basic operations, and show practical examples for both environments.
Why Choose DuckDB for ETL?
Before diving into the how-to, let’s understand why DuckDB is worth considering for your ETL workflows:
- Exceptional Performance: DuckDB can process large datasets at speeds comparable to (or even faster than) many distributed systems for single-node workloads.
- Simplicity: No separate server process, no complex setup - just import and use.
- SQL Interface: Familiar SQL syntax makes it accessible for those already comfortable with SQL.
- Seamless Integration: Works directly with Pandas DataFrames, Parquet files, CSV, and other common data formats.
- Column-oriented Storage: Optimized for analytical queries common in ETL processes.
Installing DuckDB
Let’s start with installation. DuckDB is available via pip:
For Jupyter integration, you might also want to install the Jupyter extension:
1
|
pip install duckdb-jupyter
|
Using DuckDB in Jupyter Notebook
Jupyter Notebooks provide an interactive environment perfect for data exploration and prototyping ETL processes. Here’s how to integrate DuckDB into your notebook workflow:
Basic Setup and Queries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
import duckdb
import pandas as pd
import numpy as np
# Create a connection
conn = duckdb.connect(database=':memory:') # In-memory database
# Create some sample data
df = pd.DataFrame({
'id': range(1, 1001),
'value': np.random.randn(1000),
'category': np.random.choice(['A', 'B', 'C', 'D'], 1000)
})
# Query the DataFrame directly with SQL
result = conn.execute("""
SELECT
category,
COUNT(*) as count,
AVG(value) as avg_value,
SUM(value) as sum_value
FROM df
GROUP BY category
ORDER BY count DESC
""").fetchdf()
result
|
Loading and Processing Data Files
DuckDB excels at processing data files directly. Here’s an example loading and querying a CSV file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# Query a CSV file directly (without loading it into memory first)
conn.execute("CREATE TABLE sales AS SELECT * FROM read_csv_auto('sales_data.csv')")
# Execute analytical query
result = conn.execute("""
SELECT
date_trunc('month', order_date) as month,
region,
SUM(sales_amount) as total_sales,
COUNT(DISTINCT customer_id) as unique_customers
FROM sales
WHERE order_date >= '2022-01-01'
GROUP BY month, region
ORDER BY month, total_sales DESC
""").fetchdf()
# Result is a pandas DataFrame
result.head()
|
Interactive Visual Analysis
Combine DuckDB with visualization libraries for powerful visual analysis:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
import matplotlib.pyplot as plt
import seaborn as sns
# Execute query and get results
monthly_sales = conn.execute("""
SELECT
date_trunc('month', order_date) as month,
SUM(sales_amount) as total_sales
FROM sales
GROUP BY month
ORDER BY month
""").fetchdf()
# Plot results
plt.figure(figsize=(12, 6))
sns.lineplot(data=monthly_sales, x='month', y='total_sales')
plt.title('Monthly Sales Trend')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
|
Using DuckDB in Python Scripts for ETL
While Jupyter is great for exploration, production ETL processes often run as standalone Python scripts. Here’s how to use DuckDB effectively in this context:
Creating a Simple ETL Pipeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
|
import duckdb
import os
from datetime import datetime
# Configuration
input_dir = '/path/to/input'
output_dir = '/path/to/output'
log_file = 'etl_log.txt'
def log_message(message):
"""Write a log message with timestamp"""
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
with open(log_file, 'a') as f:
f.write(f"{timestamp} - {message}\n")
def main():
# Log start
log_message("ETL process started")
try:
# Create connection
conn = duckdb.connect('etl_process.db')
# Extract: Load data from multiple CSV files
log_message("Extracting data from CSV files")
conn.execute("""
CREATE OR REPLACE TABLE raw_data AS
SELECT * FROM read_csv_auto(
glob('{}/sales_*.csv')
)
""".format(input_dir))
# Transform: Clean and aggregate data
log_message("Transforming data")
conn.execute("""
CREATE OR REPLACE TABLE transformed_data AS
SELECT
date_trunc('month', order_date) as month,
product_category,
region,
SUM(CASE WHEN order_status = 'completed' THEN amount ELSE 0 END) as completed_sales,
COUNT(DISTINCT customer_id) as customer_count
FROM raw_data
WHERE order_date >= '2022-01-01'
AND amount > 0
GROUP BY month, product_category, region
""")
# Load: Export to Parquet for downstream use
output_file = os.path.join(output_dir, f"sales_summary_{datetime.now().strftime('%Y%m%d')}.parquet")
log_message(f"Loading data to {output_file}")
conn.execute(f"""
COPY transformed_data TO '{output_file}' (FORMAT PARQUET)
""")
# Get and log record counts
raw_count = conn.execute("SELECT COUNT(*) FROM raw_data").fetchone()[0]
transformed_count = conn.execute("SELECT COUNT(*) FROM transformed_data").fetchone()[0]
log_message(f"Processed {raw_count} raw records into {transformed_count} aggregated records")
log_message("ETL process completed successfully")
except Exception as e:
log_message(f"Error in ETL process: {str(e)}")
raise
finally:
# Close connection
conn.close()
if __name__ == "__main__":
main()
|
Handling Large Datasets Efficiently
For larger datasets, you can leverage DuckDB’s efficient processing capabilities:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
|
import duckdb
import os
import argparse
from datetime import datetime
def process_large_dataset(input_files, output_dir, partition_by=None):
"""Process large datasets with optional partitioning"""
conn = duckdb.connect(':memory:')
# Register input files (could be CSV, Parquet, etc.)
file_list_str = "', '".join(input_files)
# Load data with parallel processing
conn.execute(f"""
CREATE OR REPLACE TABLE source_data AS
SELECT * FROM read_parquet('{file_list_str}')
""")
# Perform transformations
conn.execute("""
CREATE OR REPLACE TABLE transformed_data AS
SELECT
/* Select your transformed columns here */
date_part('year', event_timestamp) AS year,
date_part('month', event_timestamp) AS month,
user_id,
SUM(IF(event_type = 'purchase', amount, 0)) AS purchase_amount,
COUNT(DISTINCT session_id) AS session_count
FROM source_data
WHERE event_timestamp >= '2022-01-01'
GROUP BY year, month, user_id
""")
# Write output (partitioned if requested)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
if partition_by:
output_path = os.path.join(output_dir, f"data_{timestamp}")
conn.execute(f"""
COPY transformed_data
TO '{output_path}' (
FORMAT PARQUET,
PARTITION_BY ({partition_by})
)
""")
else:
output_file = os.path.join(output_dir, f"data_{timestamp}.parquet")
conn.execute(f"""
COPY transformed_data
TO '{output_file}' (FORMAT PARQUET)
""")
return conn.execute("SELECT COUNT(*) FROM transformed_data").fetchone()[0]
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Process large datasets with DuckDB')
parser.add_argument('--input', nargs='+', required=True, help='Input file paths')
parser.add_argument('--output-dir', required=True, help='Output directory')
parser.add_argument('--partition-by', help='Column(s) to partition by, comma-separated')
args = parser.parse_args()
partition_cols = args.partition_by.split(',') if args.partition_by else None
record_count = process_large_dataset(args.input, args.output_dir, partition_cols)
print(f"Processed {record_count} records successfully")
|
To get the most out of DuckDB in your ETL processes, consider these optimization techniques:
-
Use Appropriate Data Types: DuckDB works best when data types are appropriate for the content (integers for IDs, etc.)
-
Leverage Parallelism: DuckDB can use multiple CPU cores. Enable this with:
1
|
conn.execute("PRAGMA threads=4") # Adjust number based on your CPU
|
-
Process Files Directly: When possible, process files directly rather than loading into memory first:
1
2
3
4
5
6
|
# Instead of:
df = pd.read_csv('large_file.csv')
result = conn.execute("SELECT * FROM df").fetchdf()
# Do this:
result = conn.execute("SELECT * FROM read_csv_auto('large_file.csv')").fetchdf()
|
-
Use Persistent Storage Wisely: For very large datasets, consider using a file-based database instead of in-memory:
1
|
conn = duckdb.connect('my_database.db') # This will persist to disk
|
-
Partition Outputs: For large output datasets, use partitioning to create more manageable files:
1
2
3
4
5
6
|
conn.execute("""
COPY results TO 'output_directory' (
FORMAT PARQUET,
PARTITION_BY (year, month)
)
""")
|
Comparing DuckDB with Alternatives
When should you choose DuckDB over alternatives like Pandas, SQL databases, or Spark?
Tool |
Best For |
Limitations |
DuckDB |
Single-node analytics, medium to large datasets, SQL users |
Limited transaction support, single machine only |
Pandas |
Small to medium datasets, complex transformations |
Memory-constrained, less SQL support |
Traditional SQL DB |
ACID transactions, complex relations |
Setup overhead, less analytical performance |
Spark |
Truly big data, distributed processing |
Complexity, overhead for smaller datasets |
Conclusion
DuckDB offers a powerful and easy-to-use solution for ETL processes in both Jupyter Notebooks and Python scripts. Its SQL interface, excellent performance, and seamless integration with Python data tools make it an ideal choice for many data engineering tasks.
Whether you’re exploring data interactively in a notebook or building production-ready ETL pipelines, DuckDB can significantly simplify your workflow while providing the performance needed for modern data processing challenges.
For more information and advanced usage, check out the official DuckDB documentation.
Have you tried DuckDB in your data projects? What has your experience been? Feel free to share in the comments below!