Groups


Setup

  1. Install DuckDB

  2. Run duckdb to start the DuckDB shell

  3. load the TPC-H extension

    INSTALL tpch;
    LOAD tpch;
  4. Generate the TPC-H data at scale factor 1

    CALL dbgen(sf = 1);

    The scale factor can be adjusted as needed (e.g., sf = 10 for scale factor 10). Be careful with larger scale factors as they will consume more disk space and memory.

    Scale factor Database size Generation time Single-step generation memory
    100 26 GB 17 minutes 71 GB
    300 78 GB 51 minutes 211 GB
    1,000 265 GB 2 h 53 minutes 647 GB
    3,000 796 GB 8 h 30 minutes 1,799 GB
  5. Verify the data has been generated by querying a table, for example:

    SELECT COUNT(*) FROM lineitem;
  6. Additional documentation at https://duckdb.org/docs/extensions/tpch

Configuration knobs

Working with DuckDB

I recommend having a setup.sql file that contains all the setup commands. The dbgen call can take awhile, so its good to separate this into its own file and isolate it from the rest of your queries.

-- setup.sql
INSTALL tpch;
LOAD tpch;
CALL dbgen(sf = 1); -- adjust scale factor as needed
SET memory_limit = '10GB'; -- adjust memory limit as needed
SET threads = 4; -- adjust number of threads as needed, if constant

Then create separate files for each query you want to run, for example query3.sql:

.timer on
SET threads = 4; -- adjust number of threads as needed, if not constant

SELECt ...

You can invoke DuckDB from the command line and run your setup and query files like this:

duckdb tpch.db < setup.sql
duckdb tpch.db < query3.sql # for one-off

Or you can open the DuckDB shell and read in the files interactively with duckdb tpch.db

duckdb> .read query3.sql
duckdb> .read query3.sql
duckdb> .read query3.sql