fix: avoid "Unable to determine type" warning with JSON columns in to_dataframe#1876
fix: avoid "Unable to determine type" warning with JSON columns in to_dataframe#1876
to_dataframe#1876Conversation
|
I might actually want to do something in |
|
Right now the behavior is inconsistent across REST and BQ Storage API. |
|
Marking as |
|
Actually, I think this needs a few more tests. I'm testing manually with |
| # Prefer JSON type built-in to pyarrow (adding in 19.0.0), if available. | ||
| # Otherwise, fallback to db-dtypes, where the JSONArrowType was added in 1.4.0, | ||
| # but since they might have an older db-dtypes, have string as a fallback for that. | ||
| # TODO(https://github.com/pandas-dev/pandas/issues/60958): switch to | ||
| # pyarrow.json_(pyarrow.string()) if available and supported by pandas. | ||
| if hasattr(db_dtypes, "JSONArrowType"): | ||
| json_arrow_type = db_dtypes.JSONArrowType() | ||
| else: | ||
| json_arrow_type = pyarrow.string() |
There was a problem hiding this comment.
This is the key change. Mostly aligns with bigframes, but we've left off pyarrow.json_(pyarrow.string()) because of pandas-dev/pandas#60958.
|
Marking as Edit: Mailed #2144 |
|
I've added regression tests for #1580 |
tests/system/test_arrow.py
Outdated
| "json_array_col", | ||
| ] | ||
| assert table.shape == (0, 5) | ||
| assert list(table.field("struct_col").type.names) == ["json_field", "int_field"] |
There was a problem hiding this comment.
Test failure:
____________________ test_to_arrow_query_with_empty_results ____________________
bigquery_client =
def test_to_arrow_query_with_empty_results(bigquery_client):
"""
JSON regression test for https://github.com/googleapis/python-bigquery/issues/1580.
"""
job = bigquery_client.query(
"""
select
123 as int_col,
'' as string_col,
to_json('{}') as json_col,
struct(to_json('[]') as json_field, -1 as int_field) as struct_col,
[to_json('null')] as json_array_col,
from unnest([])
"""
)
table = job.to_arrow()
assert list(table.column_names) == [
"int_col",
"string_col",
"json_col",
"struct_col",
"json_array_col",
]
assert table.shape == (0, 5)
> assert list(table.field("struct_col").type.names) == ["json_field", "int_field"]
E AttributeError: 'pyarrow.lib.StructType' object has no attribute 'names'
Need to update this to support older pyarrow.
| # but we'd like this to map as closely to the BQ Storage API as | ||
| # possible, which uses the string() dtype, as JSON support in Arrow | ||
| # predates JSON support in BigQuery by several years. | ||
| "JSON": pyarrow.string, |
There was a problem hiding this comment.
Mapping to pa.string won't achieve round-trip? Meaning a value saving to local won't be able to be identified as JSON back. Does it matter to bigframes?
There was a problem hiding this comment.
BigQuery sets metadata on the Field that can be used to determine this type. I don't want to diverge from BigQuery Storage Read API behavior.
In bigframes and pandas-gbq, we have the BigQuery schema available to disambiguate to customize the pandas types.
There was a problem hiding this comment.
I can investigate if such an override is also possible here.
There was a problem hiding this comment.
I made some progress plumbing through a json_type everywhere it would need to go to be able to override this, but once I got to to_arrow_iterable, it kinda breaks down. There we very much just return the pages we get from BQ Storage Read API. I don't really want to override that, as it adds new layers of complexity to what was a relatively straightforward internal API.
I'd prefer to leave this as-is without the change to allow overriding the arrow type.
There was a problem hiding this comment.
If there's still objections, I can try the same but just with the pandas data type. That gets a bit awkward when it comes to struct, though.
There was a problem hiding this comment.
It might be important to have this json_arrow_type feature for the work Chelsea is doing in bigframes and pandas-gbq. I'll give this another try, but I think since it's a feature it should be a separate PR.
There was a problem hiding this comment.
Started #2149 but I'm hitting some roadblocks. Will pause for now.
…o_dataframe` * add regression tests for empty dataframe * fix arrow test to be compatible with old pyarrow
| assert list(df.columns) == [ | ||
| "int_col", | ||
| "string_col", | ||
| "json_col", | ||
| "struct_col", | ||
| "json_array_col", | ||
| ] | ||
| assert len(df.index) == 0 |
There was a problem hiding this comment.
#QUESTION:
The test suite for test_to_arrow_query_with_empty_results is more robust than this one.
Is there a reason for the difference?
There was a problem hiding this comment.
We're using object dtype in pandas for STRUCT and ARRAY columns right now. This means there's not much we can inspect about subfields/subtypes like we can with Arrow.
Related: I'd like to introduce a dtype_backend parameter like pandas has to pandas-gbq: https://github.com/googleapis/python-bigquery-pandas/issues/621
In the meantime, I tend to use:
df = (
results
.to_arrow(bqstorage_client=bqstorage_client)
.to_pandas(types_mapper=lambda type_ : pandas.ArrowDtype(type_))
)instead of to_dataframe() in my personal projects, as this will give me the more structured Arrow types all the time.
chalmerlowe
left a comment
There was a problem hiding this comment.
LGTM.
Included a question for my own edification.
Not a blocker.
Approved.
…o_dataframe` (#1876) * add regression tests for empty dataframe * fix arrow test to be compatible with old pyarrow
Based on #2144, which should merge first.
TODO:
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #1580
🦕