pandas_corr(FULL_TABLE_ID STRING)

Question

pandas_corr(FULL_TABLE_ID STRING)

ThomasEllyatt opened this issue 10 months ago · 0 comments

ThomasEllyatt commented 10 months ago

Check the idea has not already been suggested

I could not find my idea in existing issues

Edit the title above with self-explanatory function name and argument names

The function name and the argument names I entered in the title above seems self explanatory to me.

BigFunction Description as it would appear in the documentation

This stored procedure, pandas_corr, is designed to replicate the functionality of the corr function found in the Pandas Python package, specifically tailored for BigQuery. It calculates correlation coefficients between all pairs of numeric fields in a specified table.

The procedure neatly organizes the output by row number and field name, mirroring the structured and intuitive output format provided by Pandas.

Examples of (arguments, expected output) as they would appear in the documentation

To execute this procedure, you need to specify the full table ID as a string variable. Below is an example demonstrating how to use the procedure with the 311_service_requests public table.

CALL us_dataset.pandas_corr('bigquery-public-data.austin_311.311_service_requests');

When run, the user will receive three job results.

Provides an array of the eligible fields from the provided table which will be used to build the correlation map.
Provide the query this procedure generated (just for reference, good to know how it works under the hood sometimes).
Provides the correlation map, similar to the PANDAS corr function.

I've included below the code I've pulled together to create this function.

CREATE OR REPLACE PROCEDURE `spreadsheep-20220603.Dashboard_Datasets.pandas_corr`(FULL_TABLE_ID STRING)
BEGIN

DECLARE COLUMN_NAMES ARRAY<STRING>;
DECLARE QUERY STRING DEFAULT '';
DECLARE OUTER_LOOP, INNER_LOOP INT64 DEFAULT 0;
DECLARE PROJECT_DATASET, TABLEID STRING;

SET PROJECT_DATASET = SPLIT(FULL_TABLE_ID, ".")[0] || "." || SPLIT(FULL_TABLE_ID, ".")[1];
SET TABLEID = SPLIT(FULL_TABLE_ID, ".")[2];

EXECUTE IMMEDIATE
(
"SELECT array_agg(column_name) as column_names FROM " || PROJECT_DATASET || ".INFORMATION_SCHEMA.COLUMNS WHERE table_name = '" || TABLEID || "' AND data_type IN ('INT64','FLOAT64','NUMERIC')"
) INTO COLUMN_NAMES
;

LOOP

    IF 
        OUTER_LOOP > ARRAY_LENGTH(COLUMN_NAMES) - 1 THEN LEAVE;
    ELSE
        SET QUERY = QUERY || "SELECT '" || OUTER_LOOP || "' as row_number, '" || COLUMN_NAMES[OUTER_LOOP] || "' as field, \n";

        LOOP

            IF
                INNER_LOOP > ARRAY_LENGTH(COLUMN_NAMES) - 1 THEN LEAVE;
            ELSE
                SET QUERY = QUERY || "CORR("|| COLUMN_NAMES[OUTER_LOOP] || ", " || COLUMN_NAMES[INNER_LOOP] || ") as " || COLUMN_NAMES[INNER_LOOP] || ", \n";
            END IF;

            SET INNER_LOOP = INNER_LOOP + 1;

        END LOOP;
        SET INNER_LOOP = 0;

        IF
            OUTER_LOOP < ARRAY_LENGTH(COLUMN_NAMES) - 1 THEN SET QUERY = QUERY || "FROM `" || PROJECT_DATASET || "." || TABLEID || "`\n\n UNION ALL \n\n";
        ELSE 
            SET QUERY = QUERY || "FROM `" || PROJECT_DATASET || "." || TABLEID || "`";
        END IF;

    END IF;
    SET OUTER_LOOP = OUTER_LOOP + 1;

END LOOP;

SELECT QUERY;
EXECUTE IMMEDIATE(QUERY || " ORDER BY row_number");

END;