Arrow: Avoid buffer-overflow by avoid doing a sort by Fokko · Pull Request #1539 · apache/iceberg-python

Fokko · 2025-01-20T11:58:56Z

This was already being discussed back here: #208 (comment)

This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually.

Fixes #1491

Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The combine_chunks method does this correctly.

Now:

0.42877754200890195
Run 1 took: 0.2507691659993725
Run 2 took: 0.24833179199777078
Run 3 took: 0.24401691700040828
Run 4 took: 0.2419595829996979
Average runtime of 0.28 seconds

Before:

Run 0 took: 1.0768639159941813
Run 1 took: 0.8784021250030492
Run 2 took: 0.8486490420036716
Run 3 took: 0.8614017910003895
Run 4 took: 0.8497851670108503
Average runtime of 0.9 seconds

So it comes with a nice speedup as well :)

kevinjqliu

LGTM i left a few comments

pyiceberg/partitioning.py

kevinjqliu · 2025-01-20T17:28:34Z

tests/integration/test_writes/test_partitioned_writes.py

+    y = ["fixed_string"] * 30_000
+    tb = pa.chunked_array([y] * 10_000)
+    # Create pa.table
+    arrow_table = pa.table({"a": ta, "b": tb})


it wasnt obv to me that this test offset is beyond 32 bits, but i ran it and 4800280000 is >2^32/4294967296

>>> len(arrow_table) 300000000 >>> arrow_table.get_total_buffer_size() 4800280000

tests/benchmark/test_benchmark.py

pyiceberg/io/pyarrow.py

kevinjqliu · 2025-01-20T19:24:52Z

pyiceberg/partitioning.py

+        # When adding files, it can be that we still need to convert from logical types to physical types
+        iceberg_typed_value = _to_partition_representation(iceberg_type, value)


is this due to the fact that we already transform the partition key value

partition.transform.pyarrow_transform(source_field.field_type)(arrow_table[source_field.name])

and this expects the untransformed value?

if thats the case, can we just omit the transformation before the group_by?

Ah, of course. We want to know the output tuples after the transform, so omitting the transformation is not possible. I think we could do a follow-up PR where we split out the logic for the write path, and the add-files path. Since after this PR, this is not needed when doing partitioned writes, we just need it to preprocess when importing partitions.

Fokko · 2025-01-21T14:15:45Z

Ugh, accidentally pushed main 🤦

bigluck · 2025-01-21T16:14:18Z

:'(

Second attempt of #1539 This was already being discussed back here: #208 (comment) This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually. Fixes #1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :) --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>

# Rationale for this change Found out I broke this myself after doing a `git bisect`: ``` 36d383d is the first bad commit commit 36d383d Author: Fokko Driesprong <fokko@apache.org> Date: Thu Jan 23 07:50:54 2025 +0100 PyArrow: Avoid buffer-overflow by avoid doing a sort (#1555) Second attempt of #1539 This was already being discussed back here: #208 (comment) This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually. Fixes #1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :) --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> pyiceberg/io/pyarrow.py | 129 ++- pyiceberg/partitioning.py | 39 +- pyiceberg/table/__init__.py | 6 +- pyproject.toml | 1 + tests/benchmark/test_benchmark.py | 72 ++ tests/integration/test_partitioning_key.py | 1299 ++++++++++++++-------------- tests/table/test_locations.py | 2 +- 7 files changed, 805 insertions(+), 743 deletions(-) create mode 100644 tests/benchmark/test_benchmark.py ``` Closes #1917 # Are these changes tested? # Are there any user-facing changes?

Found out I broke this myself after doing a `git bisect`: ``` 36d383d is the first bad commit commit 36d383d Author: Fokko Driesprong <fokko@apache.org> Date: Thu Jan 23 07:50:54 2025 +0100 PyArrow: Avoid buffer-overflow by avoid doing a sort (#1555) Second attempt of #1539 This was already being discussed back here: #208 (comment) This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually. Fixes #1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :) --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> pyiceberg/io/pyarrow.py | 129 ++- pyiceberg/partitioning.py | 39 +- pyiceberg/table/__init__.py | 6 +- pyproject.toml | 1 + tests/benchmark/test_benchmark.py | 72 ++ tests/integration/test_partitioning_key.py | 1299 ++++++++++++++-------------- tests/table/test_locations.py | 2 +- 7 files changed, 805 insertions(+), 743 deletions(-) create mode 100644 tests/benchmark/test_benchmark.py ``` Closes #1917

# Rationale for this change Found out I broke this myself after doing a `git bisect`: ``` 36d383d is the first bad commit commit 36d383d Author: Fokko Driesprong <fokko@apache.org> Date: Thu Jan 23 07:50:54 2025 +0100 PyArrow: Avoid buffer-overflow by avoid doing a sort (apache#1555) Second attempt of apache#1539 This was already being discussed back here: apache#208 (comment) This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually. Fixes apache#1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :) --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> pyiceberg/io/pyarrow.py | 129 ++- pyiceberg/partitioning.py | 39 +- pyiceberg/table/__init__.py | 6 +- pyproject.toml | 1 + tests/benchmark/test_benchmark.py | 72 ++ tests/integration/test_partitioning_key.py | 1299 ++++++++++++++-------------- tests/table/test_locations.py | 2 +- 7 files changed, 805 insertions(+), 743 deletions(-) create mode 100644 tests/benchmark/test_benchmark.py ``` Closes apache#1917 # Are these changes tested? # Are there any user-facing changes?

Fokko force-pushed the fd-fix-overflowing-buffer branch 2 times, most recently from e548117 to 4658c3c Compare January 20, 2025 13:25

Fokko mentioned this pull request Jan 20, 2025

[Bug] Error in overwrite(): pyarrow.lib.ArrowInvalid: offset overflow with large dataset (~3M rows) #1491

Closed

3 tasks

kevinjqliu approved these changes Jan 20, 2025

View reviewed changes

Fokko force-pushed the fd-fix-overflowing-buffer branch from 04a8218 to 3841fe7 Compare January 20, 2025 19:15

kevinjqliu reviewed Jan 20, 2025

View reviewed changes

Fokko closed this Jan 21, 2025

Fokko force-pushed the fd-fix-overflowing-buffer branch from 3841fe7 to c84dd8d Compare January 21, 2025 14:02

Fokko mentioned this pull request Jan 21, 2025

PyArrow: Avoid buffer-overflow by avoid doing a sort #1555

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow: Avoid buffer-overflow by avoid doing a sort#1539

Arrow: Avoid buffer-overflow by avoid doing a sort#1539
Fokko wants to merge 0 commit intoapache:mainfrom
Fokko:fd-fix-overflowing-buffer

Fokko commented Jan 20, 2025

Uh oh!

kevinjqliu left a comment

Uh oh!

Uh oh!

kevinjqliu Jan 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinjqliu Jan 20, 2025

Uh oh!

Fokko Jan 21, 2025

Uh oh!

Fokko commented Jan 21, 2025

Uh oh!

bigluck commented Jan 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# When adding files, it can be that we still need to convert from logical types to physical types
		iceberg_typed_value = _to_partition_representation(iceberg_type, value)

Conversation

Fokko commented Jan 20, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevinjqliu Jan 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinjqliu Jan 20, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko commented Jan 21, 2025

Uh oh!

bigluck commented Jan 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants