Support wildcard tables (and filter on `_TABLE_SUFFIX`) in `read_gbq` / `read_gbq_table`
tswast opened this issue · 2 comments
Is your feature request related to a problem? Please describe.
In https://cloud.google.com/bigquery/docs/create-machine-learning-model there is the following SQL:
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
I'd like to be able to represent all of this in Python without any SQL code. There are two problems right now:
-
bigquery-public-data.google_analytics_sample.ga_sessions_*
isn't supported as a table ID inread_gbq
. This doesn't refer to any single table, so API requests based on the table ID will fail. -
Even if (1) were supported, it would try to copy all the data into a temp table. It would be best to be able to specify a filter on
_TABLE_SUFFIX
at data read time.
Describe the solution you'd like
bpd.read_gbq("bigquery-public-data.google_analytics_sample.ga_sessions_*")
should work.
Also, somewhat inspired by the BigQuery Storage API, accept a row_restriction
parameter to filter rows.
bpd.read_gbq(
"bigquery-public-data.google_analytics_sample.ga_sessions_*",
filters=[("_TABLE_SUFFIX", ">=", '20160801'), ("_TABLE_SUFFIX", "<=", '20170630')],
)
See also: "filters" parameter in pandas.read_parquet
Describe alternatives you've considered
SQL as an input to read_gbq
works as an alternative right now.
Additional context
Related feature request on pandas-gbq:
- googleapis/python-bigquery-pandas#694 Add
filters
parameter toread_gbq
@Genesis929 This is affecting more BQML tutorials. I think we should try and find a solution for this sooner rather than later. I think it may require translation to SQL similar to your filters
implementation.