Install DuckDB
Run duckdb to start the DuckDB shell
load the TPC-H extension
INSTALL tpch;
LOAD tpch;Generate the TPC-H data at scale factor 1
CALL dbgen(sf = 1);
The scale factor can be adjusted as needed (e.g.,
sf = 10 for scale factor 10). Be careful with larger scale
factors as they will consume more disk space and memory.
| Scale factor | Database size | Generation time | Single-step generation memory |
|---|---|---|---|
| 100 | 26 GB | 17 minutes | 71 GB |
| 300 | 78 GB | 51 minutes | 211 GB |
| 1,000 | 265 GB | 2 h 53 minutes | 647 GB |
| 3,000 | 796 GB | 8 h 30 minutes | 1,799 GB |
Verify the data has been generated by querying a table, for example:
SELECT COUNT(*) FROM lineitem;Additional documentation at https://duckdb.org/docs/extensions/tpch
CALL dbgen(sf=1)): 1, 2, 3, 4, 5, 6, 7,
8, 9, 10SET memory_limit = '<value>';):
1GB, 2GB, 5GB, 10GBSET threads = <value>;): 1, 2, 4, 8temp_directory,
preserve_insertion_order, or any settings from
https://duckdb.org/docs/stable/configuration/overview, record the
rationale and observed effectI recommend having a setup.sql file that contains all
the setup commands. The dbgen call can take awhile, so its good to
separate this into its own file and isolate it from the rest of your
queries.
-- setup.sql
INSTALL tpch;
LOAD tpch;
CALL dbgen(sf = 1); -- adjust scale factor as needed
SET memory_limit = '10GB'; -- adjust memory limit as needed
SET threads = 4; -- adjust number of threads as needed, if constantThen create separate files for each query you want to run, for
example query3.sql:
.timer on
SET threads = 4; -- adjust number of threads as needed, if not constant
SELECt ...You can invoke DuckDB from the command line and run your setup and query files like this:
duckdb tpch.db < setup.sql
duckdb tpch.db < query3.sql # for one-offOr you can open the DuckDB shell and read in the files interactively
with duckdb tpch.db
duckdb> .read query3.sql
duckdb> .read query3.sql
duckdb> .read query3.sql