Arrow: Avoid buffer-overflow by avoid doing a sort#1539
Arrow: Avoid buffer-overflow by avoid doing a sort#1539Fokko wants to merge 0 commit intoapache:mainfrom
Conversation
e548117 to
4658c3c
Compare
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM i left a few comments
| y = ["fixed_string"] * 30_000 | ||
| tb = pa.chunked_array([y] * 10_000) | ||
| # Create pa.table | ||
| arrow_table = pa.table({"a": ta, "b": tb}) |
There was a problem hiding this comment.
it wasnt obv to me that this test offset is beyond 32 bits, but i ran it and 4800280000 is >2^32/4294967296
>>> len(arrow_table)
300000000
>>> arrow_table.get_total_buffer_size()
4800280000
04a8218 to
3841fe7
Compare
pyiceberg/partitioning.py
Outdated
| # When adding files, it can be that we still need to convert from logical types to physical types | ||
| iceberg_typed_value = _to_partition_representation(iceberg_type, value) |
There was a problem hiding this comment.
is this due to the fact that we already transform the partition key value
partition.transform.pyarrow_transform(source_field.field_type)(arrow_table[source_field.name])
and this expects the untransformed value?
if thats the case, can we just omit the transformation before the group_by?
There was a problem hiding this comment.
Ah, of course. We want to know the output tuples after the transform, so omitting the transformation is not possible. I think we could do a follow-up PR where we split out the logic for the write path, and the add-files path. Since after this PR, this is not needed when doing partitioned writes, we just need it to preprocess when importing partitions.
3841fe7 to
c84dd8d
Compare
|
Ugh, accidentally pushed |
|
:'( |
Second attempt of #1539 This was already being discussed back here: #208 (comment) This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually. Fixes #1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :) --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
Found out I broke this myself after doing a `git bisect`:
```
36d383d is the first bad commit
commit 36d383d
Author: Fokko Driesprong <fokko@apache.org>
Date: Thu Jan 23 07:50:54 2025 +0100
PyArrow: Avoid buffer-overflow by avoid doing a sort (#1555)
Second attempt of #1539
This was already being discussed back here:
#208 (comment)
This PR changes from doing a sort, and then a single pass over the table
to the approach where we determine the unique partition tuples filter on
them individually.
Fixes #1491
Because the sort caused buffers to be joined where it would overflow in
Arrow. I think this is an issue on the Arrow side, and it should
automatically break up into smaller buffers. The `combine_chunks` method
does this correctly.
Now:
```
0.42877754200890195
Run 1 took: 0.2507691659993725
Run 2 took: 0.24833179199777078
Run 3 took: 0.24401691700040828
Run 4 took: 0.2419595829996979
Average runtime of 0.28 seconds
```
Before:
```
Run 0 took: 1.0768639159941813
Run 1 took: 0.8784021250030492
Run 2 took: 0.8486490420036716
Run 3 took: 0.8614017910003895
Run 4 took: 0.8497851670108503
Average runtime of 0.9 seconds
```
So it comes with a nice speedup as well :)
---------
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
pyiceberg/io/pyarrow.py | 129 ++-
pyiceberg/partitioning.py | 39 +-
pyiceberg/table/__init__.py | 6 +-
pyproject.toml | 1 +
tests/benchmark/test_benchmark.py | 72 ++
tests/integration/test_partitioning_key.py | 1299 ++++++++++++++--------------
tests/table/test_locations.py | 2 +-
7 files changed, 805 insertions(+), 743 deletions(-)
create mode 100644 tests/benchmark/test_benchmark.py
```
Closes #1917
# Are these changes tested?
# Are there any user-facing changes?
<!-- In the case of user-facing changes, please add the changelog label.
-->
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
Found out I broke this myself after doing a `git bisect`:
```
36d383d is the first bad commit
commit 36d383d
Author: Fokko Driesprong <fokko@apache.org>
Date: Thu Jan 23 07:50:54 2025 +0100
PyArrow: Avoid buffer-overflow by avoid doing a sort (#1555)
Second attempt of #1539
This was already being discussed back here:
#208 (comment)
This PR changes from doing a sort, and then a single pass over the table
to the approach where we determine the unique partition tuples filter on
them individually.
Fixes #1491
Because the sort caused buffers to be joined where it would overflow in
Arrow. I think this is an issue on the Arrow side, and it should
automatically break up into smaller buffers. The `combine_chunks` method
does this correctly.
Now:
```
0.42877754200890195
Run 1 took: 0.2507691659993725
Run 2 took: 0.24833179199777078
Run 3 took: 0.24401691700040828
Run 4 took: 0.2419595829996979
Average runtime of 0.28 seconds
```
Before:
```
Run 0 took: 1.0768639159941813
Run 1 took: 0.8784021250030492
Run 2 took: 0.8486490420036716
Run 3 took: 0.8614017910003895
Run 4 took: 0.8497851670108503
Average runtime of 0.9 seconds
```
So it comes with a nice speedup as well :)
---------
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
pyiceberg/io/pyarrow.py | 129 ++-
pyiceberg/partitioning.py | 39 +-
pyiceberg/table/__init__.py | 6 +-
pyproject.toml | 1 +
tests/benchmark/test_benchmark.py | 72 ++
tests/integration/test_partitioning_key.py | 1299 ++++++++++++++--------------
tests/table/test_locations.py | 2 +-
7 files changed, 805 insertions(+), 743 deletions(-)
create mode 100644 tests/benchmark/test_benchmark.py
```
Closes #1917
<!-- In the case of user-facing changes, please add the changelog label.
-->
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
Found out I broke this myself after doing a `git bisect`:
```
36d383d is the first bad commit
commit 36d383d
Author: Fokko Driesprong <fokko@apache.org>
Date: Thu Jan 23 07:50:54 2025 +0100
PyArrow: Avoid buffer-overflow by avoid doing a sort (apache#1555)
Second attempt of apache#1539
This was already being discussed back here:
apache#208 (comment)
This PR changes from doing a sort, and then a single pass over the table
to the approach where we determine the unique partition tuples filter on
them individually.
Fixes apache#1491
Because the sort caused buffers to be joined where it would overflow in
Arrow. I think this is an issue on the Arrow side, and it should
automatically break up into smaller buffers. The `combine_chunks` method
does this correctly.
Now:
```
0.42877754200890195
Run 1 took: 0.2507691659993725
Run 2 took: 0.24833179199777078
Run 3 took: 0.24401691700040828
Run 4 took: 0.2419595829996979
Average runtime of 0.28 seconds
```
Before:
```
Run 0 took: 1.0768639159941813
Run 1 took: 0.8784021250030492
Run 2 took: 0.8486490420036716
Run 3 took: 0.8614017910003895
Run 4 took: 0.8497851670108503
Average runtime of 0.9 seconds
```
So it comes with a nice speedup as well :)
---------
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
pyiceberg/io/pyarrow.py | 129 ++-
pyiceberg/partitioning.py | 39 +-
pyiceberg/table/__init__.py | 6 +-
pyproject.toml | 1 +
tests/benchmark/test_benchmark.py | 72 ++
tests/integration/test_partitioning_key.py | 1299 ++++++++++++++--------------
tests/table/test_locations.py | 2 +-
7 files changed, 805 insertions(+), 743 deletions(-)
create mode 100644 tests/benchmark/test_benchmark.py
```
Closes apache#1917
# Are these changes tested?
# Are there any user-facing changes?
<!-- In the case of user-facing changes, please add the changelog label.
-->
This was already being discussed back here: #208 (comment)
This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually.
Fixes #1491
Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The
combine_chunksmethod does this correctly.Now:
Before:
So it comes with a nice speedup as well :)