fix: avoid "Unable to determine type" warning with JSON columns in `to_dataframe` by tswast · Pull Request #1876 · googleapis/python-bigquery

tswast · 2024-03-27T20:44:11Z

Based on #2144, which should merge first.

TODO:

tests
maybe we don't want string columns for JSON? Update: keeping string because that's consistent with (1) current behavior and (2) the BQ Storage Read API.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #1580
🦕

tswast · 2024-03-27T21:35:41Z

I might actually want to do something in db-dtypes so that even though it's a string the unboxed version would give a parsed object like the behavior is when the REST API is used.

tswast · 2024-03-27T21:36:00Z

Right now the behavior is inconsistent across REST and BQ Storage API.

tswast · 2024-03-28T15:57:43Z

Marking as do not merge for now, as this makes JSON dtype consistent now but always return string dtype like the BQ Storage Read API code path does, which isn't ideal.

tswast · 2025-03-10T15:44:56Z

Actually, I think this needs a few more tests. I'm testing manually with pytest 'tests/system/test_to_gbq.py::test_dataframe_round_trip_with_table_schema[load_csv-json]' from googleapis/python-bigquery-pandas#893, but it's currently failing because we parse the JSON string in _row_iterator_page_columns, but we actually want to keep those as strings to use the json_ pyarrow type.

tswast · 2025-03-11T21:14:32Z

google/cloud/bigquery/_pyarrow_helpers.py

+    # Prefer JSON type built-in to pyarrow (adding in 19.0.0), if available.
+    # Otherwise, fallback to db-dtypes, where the JSONArrowType was added in 1.4.0,
+    # but since they might have an older db-dtypes, have string as a fallback for that.
+    # TODO(https://github.com/pandas-dev/pandas/issues/60958): switch to
+    # pyarrow.json_(pyarrow.string()) if available and supported by pandas.
+    if hasattr(db_dtypes, "JSONArrowType"):
+        json_arrow_type = db_dtypes.JSONArrowType()
+    else:
+        json_arrow_type = pyarrow.string()


This is the key change. Mostly aligns with bigframes, but we've left off pyarrow.json_(pyarrow.string()) because of pandas-dev/pandas#60958.

tswast · 2025-03-11T21:36:41Z

Marking as do not merge again. I'll split out the refactor into a separate PR first.

Edit: Mailed #2144

tswast · 2025-03-12T16:38:07Z

I've added regression tests for #1580

tswast · 2025-03-12T18:26:21Z

tests/system/test_arrow.py

+        "json_array_col",
+    ]
+    assert table.shape == (0, 5)
+    assert list(table.field("struct_col").type.names) == ["json_field", "int_field"]


Test failure:

____________________ test_to_arrow_query_with_empty_results ____________________ bigquery_client = def test_to_arrow_query_with_empty_results(bigquery_client): """ JSON regression test for https://github.com/googleapis/python-bigquery/issues/1580. """ job = bigquery_client.query( """ select 123 as int_col, '' as string_col, to_json('{}') as json_col, struct(to_json('[]') as json_field, -1 as int_field) as struct_col, [to_json('null')] as json_array_col, from unnest([]) """ ) table = job.to_arrow() assert list(table.column_names) == [ "int_col", "string_col", "json_col", "struct_col", "json_array_col", ] assert table.shape == (0, 5) > assert list(table.field("struct_col").type.names) == ["json_field", "int_field"] E AttributeError: 'pyarrow.lib.StructType' object has no attribute 'names'

Need to update this to support older pyarrow.

GarrettWu · 2025-03-12T18:28:41Z

google/cloud/bigquery/_pyarrow_helpers.py

+        # but we'd like this to map as closely to the BQ Storage API as
+        # possible, which uses the string() dtype, as JSON support in Arrow
+        # predates JSON support in BigQuery by several years.
+        "JSON": pyarrow.string,


Mapping to pa.string won't achieve round-trip? Meaning a value saving to local won't be able to be identified as JSON back. Does it matter to bigframes?

BigQuery sets metadata on the Field that can be used to determine this type. I don't want to diverge from BigQuery Storage Read API behavior.

In bigframes and pandas-gbq, we have the BigQuery schema available to disambiguate to customize the pandas types.

I can investigate if such an override is also possible here.

I made some progress plumbing through a json_type everywhere it would need to go to be able to override this, but once I got to to_arrow_iterable, it kinda breaks down. There we very much just return the pages we get from BQ Storage Read API. I don't really want to override that, as it adds new layers of complexity to what was a relatively straightforward internal API.

I'd prefer to leave this as-is without the change to allow overriding the arrow type.

If there's still objections, I can try the same but just with the pandas data type. That gets a bit awkward when it comes to struct, though.

Thanks for digging into it.

It might be important to have this json_arrow_type feature for the work Chelsea is doing in bigframes and pandas-gbq. I'll give this another try, but I think since it's a feature it should be a separate PR.

Started #2149 but I'm hitting some roadblocks. Will pause for now.

…o_dataframe` * add regression tests for empty dataframe * fix arrow test to be compatible with old pyarrow

chalmerlowe · 2025-03-20T09:48:37Z

tests/system/test_pandas.py

+    assert list(df.columns) == [
+        "int_col",
+        "string_col",
+        "json_col",
+        "struct_col",
+        "json_array_col",
+    ]
+    assert len(df.index) == 0


#QUESTION:

The test suite for test_to_arrow_query_with_empty_results is more robust than this one.
Is there a reason for the difference?

We're using object dtype in pandas for STRUCT and ARRAY columns right now. This means there's not much we can inspect about subfields/subtypes like we can with Arrow.

Related: I'd like to introduce a dtype_backend parameter like pandas has to pandas-gbq: https://github.com/googleapis/python-bigquery-pandas/issues/621

In the meantime, I tend to use:

df = ( results .to_arrow(bqstorage_client=bqstorage_client) .to_pandas(types_mapper=lambda type_ : pandas.ArrowDtype(type_)) )

instead of to_dataframe() in my personal projects, as this will give me the more structured Arrow types all the time.

chalmerlowe

LGTM.

Included a question for my own edification.
Not a blocker.

Approved.

…o_dataframe` (#1876) * add regression tests for empty dataframe * fix arrow test to be compatible with old pyarrow

product-auto-label bot added size: xs Pull request size is extra small. api: bigquery Issues related to the googleapis/python-bigquery API. labels Mar 27, 2024

tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 28, 2024

tswast mentioned this pull request Jul 18, 2024

Support JSON data type googleapis/google-cloud-python#14477

Open

product-auto-label bot added size: s Pull request size is small. and removed size: xs Pull request size is extra small. labels Mar 10, 2025

tswast marked this pull request as ready for review March 10, 2025 15:28

tswast requested review from a team as code owners March 10, 2025 15:28

tswast requested a review from shollyman March 10, 2025 15:28

blunderbuss-gcf bot assigned GaoleMeng Mar 10, 2025

tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 10, 2025

tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 10, 2025

product-auto-label bot added size: m Pull request size is medium. size: xl Pull request size is extra large. and removed size: s Pull request size is small. size: m Pull request size is medium. labels Mar 10, 2025

tswast commented Mar 11, 2025

View reviewed changes

tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 11, 2025

tswast requested review from Linchin and chalmerlowe and removed request for shollyman March 11, 2025 21:17

tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 11, 2025

tswast mentioned this pull request Mar 11, 2025

chore: refactor cell data parsing to use classes for easier overrides #2144

Merged

4 tasks

tswast force-pushed the tswast-json branch from 8a1aad9 to 2580ef9 Compare March 11, 2025 21:59

tswast changed the base branch from main to tswast-refactor-cell-data March 11, 2025 21:59

tswast mentioned this pull request Mar 12, 2025

ValueError encountered when to_dataframe returns empty resultset with JSON field #1580

Closed

product-auto-label bot added size: l Pull request size is large. and removed size: xl Pull request size is extra large. labels Mar 12, 2025

tswast mentioned this pull request Mar 12, 2025

feat: support json_[string][pyarrow] dtype and make pandas-gbq dtypes more independent from google-cloud-bigquery logic googleapis/python-bigquery-pandas#893

Closed

4 tasks

tswast commented Mar 12, 2025

View reviewed changes

GarrettWu reviewed Mar 12, 2025

View reviewed changes

GarrettWu approved these changes Mar 13, 2025

View reviewed changes

tswast force-pushed the tswast-json branch from 8e87248 to c2d6881 Compare March 19, 2025 14:52

Base automatically changed from tswast-refactor-cell-data to main March 19, 2025 15:17

fix: avoid "Unable to determine type" warning with JSON columns in `t…

15ec79e

…o_dataframe` * add regression tests for empty dataframe * fix arrow test to be compatible with old pyarrow

tswast force-pushed the tswast-json branch from c2d6881 to 15ec79e Compare March 19, 2025 15:18

tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 19, 2025

tswast mentioned this pull request Mar 19, 2025

feat: add json_arrow_type parameter to allow overriding the JSON data type in to_arrow and to_arrow_iterable #2149

Closed

4 tasks

chalmerlowe reviewed Mar 20, 2025

View reviewed changes

chalmerlowe approved these changes Mar 20, 2025

View reviewed changes

tswast merged commit 968020d into main Mar 20, 2025
19 checks passed

tswast deleted the tswast-json branch March 20, 2025 16:08

release-please bot mentioned this pull request Mar 20, 2025

chore(main): release 3.31.0 #2139

Merged

chalmerlowe pushed a commit that referenced this pull request Mar 25, 2025

fix: avoid "Unable to determine type" warning with JSON columns in `t…

4c9cd69

…o_dataframe` (#1876) * add regression tests for empty dataframe * fix arrow test to be compatible with old pyarrow

This was referenced Apr 1, 2025

March 31, 2025 kitta65/bq-extension-vscode#709

Closed

March 31, 2025 kitta65/prettier-plugin-bq#626

Closed

March 31, 2025 kitta65/bq2cst#493

Closed

Conversation

tswast commented Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tswast commented Mar 27, 2024

Uh oh!

tswast commented Mar 27, 2024

Uh oh!

tswast commented Mar 28, 2024

Uh oh!

tswast commented Mar 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tswast commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tswast commented Mar 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chalmerlowe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tswast commented Mar 27, 2024 •

edited

Loading

tswast commented Mar 11, 2025 •

edited

Loading