Skip to content

Commit

Permalink
revised for explicit note
Browse files Browse the repository at this point in the history
  • Loading branch information
hulmanaseer00 committed Dec 5, 2024
1 parent 701517b commit c6e67f4
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion docs/website/docs/reference/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,9 @@ As before, **if you have just a single table with millions of records, you shoul

<!--@@@DLT_SNIPPET ./performance_snippets/toml-snippets.toml::normalize_workers_2_toml-->

The normalize stage in `dlt` uses a process pool to create load packages concurrently, and the settings for `file_max_items` and `file_max_bytes` significantly influence load behavior. By setting a lower value for `file_max_items` or `file_max_bytes`, you can reduce the size of each data chunk sent to the destination database. This is particularly helpful for managing memory constraints on the database server and ensures data is inserted in manageable chunks. Without explicit configuration, `dlt` writes all data rows into one large intermediary file, attempting to insert all data at once. Adjusting these settings enables file rotation and splits the data into smaller, more efficient chunks, improving performance and avoiding potential memory issues, especially when working with large tables containing millions of records.
The **normalize** stage in `dlt` uses a process pool to create load packages concurrently, and the settings for `file_max_items` and `file_max_bytes` play a crucial role in determining the size of data chunks. Lower values for these settings reduce the size of each chunk sent to the destination database, which is particularly helpful for managing memory constraints on the database server. By default, `dlt` writes all data rows into one large intermediary file, attempting to load all data at once. Configuring these settings enables file rotation, splitting the data into smaller, more manageable chunks. This not only improves performance but also minimizes memory-related issues when working with large tables containing millions of records.

**Note:** The intermediary files generated during the **normalize** stage are also used in the **load** stage. Therefore, adjusting `file_max_items` and `file_max_bytes` in the **normalize** stage directly impacts the size and number of data chunks sent to the destination, influencing loading behavior and performance.

### Parallel pipeline config example
The example below simulates the loading of a large database table with 1,000,000 records. The **config.toml** below sets the parallelization as follows:
Expand Down

0 comments on commit c6e67f4

Please sign in to comment.