feat: Support bigframes.pandas.to_datetime for scalars, iterables and series.#372
feat: Support bigframes.pandas.to_datetime for scalars, iterables and series.#372Genesis929 merged 34 commits intomainfrom
Conversation
| x.re_search(TIMEZONE_POS_REGEX), | ||
| ( | ||
| ( | ||
| x.substr(0, x.length() - 6).to_timestamp(op.format) |
There was a problem hiding this comment.
6 is a bit of a "magic" number here. Please make some constants and comments explaining the intention.
| ) | ||
| .cast(ibis_dtypes.Timestamp(timezone="UTC")) | ||
| .cast(ibis_dtypes.int64) | ||
| - x.substr(x.length() - 5, 2).cast(ibis_dtypes.int64) |
There was a problem hiding this comment.
Likewise, this is more "magic" with regards to which substrings we're look at. Perhaps some helper functions would be useful too.
There was a problem hiding this comment.
Likewise for the rest of this function. Please refactor so that it's easier to validate the correctness of each smaller part.
| elif x.type() == ibis_dtypes.Timestamp(timezone="UTC"): | ||
| return x | ||
| elif x.type() != ibis_dtypes.timestamp: | ||
| unit = op.unit if op.unit is not None else "ns" |
There was a problem hiding this comment.
Let's comment why we are making "ns" the default.
| # Note: Due to an issue where casting directly to a non-UTC | ||
| # timezone does not work, we first cast to UTC. This seems | ||
| # to bypass a potential bug in Ibis's cast function, allowing | ||
| # for subsequent casting to a non-UTC timezone. Further |
There was a problem hiding this comment.
What would it mean to cast to non-UTC timezone type in BigQuery? It only supports UTC at the data-type level, even though other timezones are supported for parsing and formatting.
Please raise an error if someone tries to cast to a non-UTC timezone.
There was a problem hiding this comment.
Sorry about the confusion, will update the comment, this means without timezone. Basically this is for utc=True vs utc=False. Because of some unknown issue related to data type, potentially because of ibis, it's impossible to cast to the proper type when utc=False, unless cast it to utc timezone first.
There was a problem hiding this comment.
Based on tests it would appear although the result of int64 to_timestamp is in utc timezone, the cast function think the datatype is actually without timezone, and skip the cast, this is to fix the issue.
There was a problem hiding this comment.
Comment updated.
bigframes/core/tools/datetimes.py
Outdated
| f"to datetime is not implemented. {constants.FEEDBACK_LINK}" | ||
| ) | ||
|
|
||
| if ~isinstance(arg, bigframes.series.Series): |
There was a problem hiding this comment.
Don't use bitwise negation to negate a boolean.
| if ~isinstance(arg, bigframes.series.Series): | |
| if not isinstance(arg, bigframes.series.Series): |
There was a problem hiding this comment.
Also, let's add a comment that this is intended to support pandas Series (and Index maybe?).
bigframes/core/tools/datetimes.py
Outdated
| # TODO: Currently, data upload is performed using pandas DataFrames | ||
| # combined with the `read_pandas` method due to the BigFrames DataFrame | ||
| # constructor's limitations in handling various data types. Plan to update | ||
| # the upload process to utilize the BigPandas DataFrame constructor directly |
There was a problem hiding this comment.
| # the upload process to utilize the BigPandas DataFrame constructor directly | |
| # the upload process to utilize the BigQuery DataFrame constructor directly |
| bf_result = ( | ||
| bpd.to_datetime(arg, utc=utc, unit=unit, format=format) | ||
| .to_pandas() | ||
| .astype("datetime64[ns, UTC]" if utc else "datetime64[ns]") |
There was a problem hiding this comment.
Fascinating. So utc=False will use DATETIME type in BigQuery?
There was a problem hiding this comment.
Yes, for utc=False, it will be later cast to DATETIME type.
There was a problem hiding this comment.
This is the example sql: SELECT
CAST(t0.0 AS DATETIME) AS Cast_0_ timestamp
FROM ...
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
https://screenshot.googleplex.com/34gywaqpL9yYGSm
https://screenshot.googleplex.com/9QCcdhPgiuwbRk7
https://screenshot.googleplex.com/3sjTQ3xwj3MWp9j
For string/datetime dtypes, currently are limited to utc=True as we have limited timezone support.
Fixes #<issue_number_goes_here> 🦕