pandas_corr(FULL_TABLE_ID STRING)
ThomasEllyatt opened this issue · 0 comments
Check the idea has not already been suggested
- I could not find my idea in existing issues
Edit the title above with self-explanatory function name and argument names
- The function name and the argument names I entered in the title above seems self explanatory to me.
BigFunction Description as it would appear in the documentation
This stored procedure, pandas_corr, is designed to replicate the functionality of the corr function found in the Pandas Python package, specifically tailored for BigQuery. It calculates correlation coefficients between all pairs of numeric fields in a specified table.
The procedure neatly organizes the output by row number and field name, mirroring the structured and intuitive output format provided by Pandas.
Examples of (arguments, expected output) as they would appear in the documentation
To execute this procedure, you need to specify the full table ID as a string variable. Below is an example demonstrating how to use the procedure with the 311_service_requests public table.
CALL us_dataset.pandas_corr('bigquery-public-data.austin_311.311_service_requests');
When run, the user will receive three job results.
- Provides an array of the eligible fields from the provided table which will be used to build the correlation map.
- Provide the query this procedure generated (just for reference, good to know how it works under the hood sometimes).
- Provides the correlation map, similar to the PANDAS corr function.
I've included below the code I've pulled together to create this function.
CREATE OR REPLACE PROCEDURE `spreadsheep-20220603.Dashboard_Datasets.pandas_corr`(FULL_TABLE_ID STRING)
BEGIN
DECLARE COLUMN_NAMES ARRAY<STRING>;
DECLARE QUERY STRING DEFAULT '';
DECLARE OUTER_LOOP, INNER_LOOP INT64 DEFAULT 0;
DECLARE PROJECT_DATASET, TABLEID STRING;
SET PROJECT_DATASET = SPLIT(FULL_TABLE_ID, ".")[0] || "." || SPLIT(FULL_TABLE_ID, ".")[1];
SET TABLEID = SPLIT(FULL_TABLE_ID, ".")[2];
EXECUTE IMMEDIATE
(
"SELECT array_agg(column_name) as column_names FROM " || PROJECT_DATASET || ".INFORMATION_SCHEMA.COLUMNS WHERE table_name = '" || TABLEID || "' AND data_type IN ('INT64','FLOAT64','NUMERIC')"
) INTO COLUMN_NAMES
;
LOOP
IF
OUTER_LOOP > ARRAY_LENGTH(COLUMN_NAMES) - 1 THEN LEAVE;
ELSE
SET QUERY = QUERY || "SELECT '" || OUTER_LOOP || "' as row_number, '" || COLUMN_NAMES[OUTER_LOOP] || "' as field, \n";
LOOP
IF
INNER_LOOP > ARRAY_LENGTH(COLUMN_NAMES) - 1 THEN LEAVE;
ELSE
SET QUERY = QUERY || "CORR("|| COLUMN_NAMES[OUTER_LOOP] || ", " || COLUMN_NAMES[INNER_LOOP] || ") as " || COLUMN_NAMES[INNER_LOOP] || ", \n";
END IF;
SET INNER_LOOP = INNER_LOOP + 1;
END LOOP;
SET INNER_LOOP = 0;
IF
OUTER_LOOP < ARRAY_LENGTH(COLUMN_NAMES) - 1 THEN SET QUERY = QUERY || "FROM `" || PROJECT_DATASET || "." || TABLEID || "`\n\n UNION ALL \n\n";
ELSE
SET QUERY = QUERY || "FROM `" || PROJECT_DATASET || "." || TABLEID || "`";
END IF;
END IF;
SET OUTER_LOOP = OUTER_LOOP + 1;
END LOOP;
SELECT QUERY;
EXECUTE IMMEDIATE(QUERY || " ORDER BY row_number");
END;