feat: Improve local data validation#1598
Conversation
| session.read_pandas(df) | ||
|
|
||
|
|
||
| def test_read_pandas_inline_w_noninlineable_type_raises_error(): |
There was a problem hiding this comment.
maybe you don't want to remove this test?
There was a problem hiding this comment.
everything is inlinable now, no way to get this error anymore
bigframes/core/local_data.py
Outdated
| def _adapt_pandas_series( | ||
| series: pandas.Series, | ||
| ) -> tuple[Union[pa.ChunkedArray, pa.Array], bigframes.dtypes.Dtype]: | ||
| if series.dtype == np.dtype("O"): |
There was a problem hiding this comment.
Attempting to identify geo_dtype by trying conversion on all object columns seems risky because pandas also classifies PyArrow types as object. I'm concerned this could lead to non-geographic PyArrow columns being incorrectly identified as geo_dtype.
There was a problem hiding this comment.
Yeah, valid concern, especially since we do not control geopandas constructor. Probably should try pyarrow conversion first, and try geopandas as a fallback
There was a problem hiding this comment.
ok, changed to fallback to attempting object->geo only after other conversions fail.
| return cls(total_bytes=table.nbytes, row_count=table.num_rows) | ||
|
|
||
|
|
||
| _MANAGED_STORAGE_TYPES_OVERRIDES: dict[bigframes.dtypes.Dtype, pa.DataType] = { |
There was a problem hiding this comment.
Now that we use ManagedArrowTable (leveraging PyArrow) for local data, should we centralize our conversion logic based on that? Specifically, DataFrameAndLabels also handles conversions (Pandas <-> BigQuery), unifying these under a single class seems beneficial. If we standardize on PyArrow as the intermediate "transfer station," the Pandas-to-BigQuery conversion could simply become a Pandas -> Arrow -> BigQuery process, simplifying the overall system.
There was a problem hiding this comment.
Yes, my plan is that everything goes through local managed storage, to ensure consistent normalization and validation.
| return pa.array(series, type=pa.string()), bigframes.dtypes.GEO_DTYPE | ||
| try: | ||
| return _adapt_arrow_array(pa.array(series)) | ||
| except Exception as e: |
There was a problem hiding this comment.
Instead of catching a generic Exception, what specific exception types should I expect when performing geographic data type conversions? Could TypeError be one of them?
There was a problem hiding this comment.
There are a couple TypeError, but also ArrowInvalid. I don't really trust it not to change though? the docs don't specify what error types they will give you.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕