perf: defer query in read_gbq with wildcard tables#1661
Conversation
|
Failures look like real failures. While we can query such fields, it looks like we can't materialize them. |
…g pseudocolumns
Fixes this code sample:
import bigframes.pandas as bpd
df = bpd.read_gbq("bigquery-public-data.google_analytics_sample.ga_sessions_*")
df[df["_TABLE_SUFFIX"] == "20161204"].peek()
|
Added |
|
From the notebook tests: Looks like we're missing some imports too. |
|
Tested the failing samples tests locally. I think my latest commits solve the issue of not being able to materialize to |
|
e2e and notebook failures are for the same: I don't think these relate to this change, but I do recall BQML having a hard time with time travel. Potentially we're missing a force cache somewhere now? |
|
Notebook failures appear to be flakes, as they are related to remote functions and succeeded in 3.10 but not 3.11. |
| sql, schema, ordering_info = self.compiler.compile_raw( | ||
| self.logical_plan(array_value.node) | ||
| self.logical_plan(renamed.node) | ||
| ) |
There was a problem hiding this comment.
maybe there is a better way, where the compiler just does not return invalid sql ever?
There was a problem hiding this comment.
I don't think that's possible. This is valid SQL before the renaming. It's just restricted as to where we can materialize.
| cached_replacement = ( | ||
| renamed.as_cached( | ||
| cache_table=self.bqclient.get_table(tmp_table), | ||
| ordering=ordering_info, | ||
| ) | ||
| .rename_columns(dict(zip(new_col_ids, prev_col_ids))) | ||
| .node | ||
| ) |
There was a problem hiding this comment.
I think we can apply the renaming as part of the scan definition of the cache node rather than as a selection node (via rename_columns) on top of it. Should be more robust that way
There was a problem hiding this comment.
I tried that in 5b0d0a0
The problem I was having is that the physical schema often contains more columns than the array value. If we rename there, then we lose track of which columns belong to which id.
There was a problem hiding this comment.
Added a "renames" argument to as_cached, which I does allow me to use the scan list to rename columns instead. Indeed that does seem more robust. I think some of these renames might have been pruned out in recent commits.
|
e2e failure might indicate a real issue |
|
Different set of e2e failures this time: Maybe it was just an LLM-related flake last time? |
|
I'll split to defer part from the pseudocolumn work into #1689 I think the defer step can be done without out that to make this PR a lot smaller. |
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes internal issue 405773140 🦕