@@ -76,9 +76,8 @@ workload across multiple batches, using a defined chunk size.
7676
7777 You will observe that the optimal chunk size highly depends on the shape of
7878 your data, specifically the width of each record, i.e. the number of columns
79- and their individual sizes. You will need to determine a good chunk size by
80- running corresponding experiments on your own behalf. For that purpose, you
81- can use the `insert_pandas.py `_ program as a blueprint.
79+ and their individual sizes, which will in the end determine the total size of
80+ each batch/chunk.
8281
8382 A few details should be taken into consideration when determining the optimal
8483 chunk size for a specific dataset. We are outlining the two major ones.
@@ -106,8 +105,11 @@ workload across multiple batches, using a defined chunk size.
106105
107106 It is a good idea to start your explorations with a chunk size of 5_000, and
108107 then see if performance improves when you increase or decrease that figure.
109- Chunk sizes of 20000 may also be applicable, but make sure to take the limits
110- of your HTTP infrastructure into consideration.
108+ People are reporting that 10_000-20_000 is their optimal setting, but if you
109+ process, for example, just three "small" columns, you may also experiment with
110+ `leveling up to 200_000 `_, because `the chunksize should not be too small `_.
111+ If it is too small, the I/O cost will be too high to overcome the benefit of
112+ batching.
111113
112114 In order to learn more about what wide- vs. long-form (tidy, stacked, narrow)
113115 data means in the context of `DataFrame computing `_, let us refer you to `a
@@ -125,14 +127,16 @@ workload across multiple batches, using a defined chunk size.
125127.. _chunking : https://en.wikipedia.org/wiki/Chunking_(computing)
126128.. _CrateDB bulk operations : https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations
127129.. _DataFrame computing : https://realpython.com/pandas-dataframe/
128- .. _insert_pandas.py : https://github.com/crate/crate-python/blob/master/examples/insert_pandas.py
130+ .. _insert_pandas.py : https://github.com/crate/cratedb-examples/blob/main/by-language/python-sqlalchemy/insert_pandas.py
131+ .. _leveling up to 200_000 : https://acepor.github.io/2017/08/03/using-chunksize/
129132.. _NumPy : https://en.wikipedia.org/wiki/NumPy
130133.. _pandas : https://en.wikipedia.org/wiki/Pandas_(software)
131134.. _pandas DataFrame : https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
132135.. _Python : https://en.wikipedia.org/wiki/Python_(programming_language)
133136.. _relational databases : https://en.wikipedia.org/wiki/Relational_database
134137.. _SQL : https://en.wikipedia.org/wiki/SQL
135138.. _SQLAlchemy : https://aosabook.org/en/v2/sqlalchemy.html
139+ .. _the chunksize should not be too small : https://acepor.github.io/2017/08/03/using-chunksize/
136140.. _wide-narrow-general : https://en.wikipedia.org/wiki/Wide_and_narrow_data
137141.. _wide-narrow-data-computing : https://dtkaplan.github.io/DataComputingEbook/chap-wide-vs-narrow.html#chap:wide-vs-narrow
138142.. _wide-narrow-pandas-tutorial : https://anvil.works/blog/tidy-data
0 commit comments