Consistent data_type for various FILER_ID fields
Closed this issue · 4 comments
There are 22 CalAccess tables that include a FILER_ID
column. On 15 of these models, the field is typed as integer, and on the remaining seven, it is typed as char.
table_name data_type
-------------------------- -----------------
F501_502_CD character varying
EFS_FILING_LOG_CD character varying
CVR_SO_CD character varying
CVR_REGISTRATION_CD character varying
CVR_LOBBY_DISCLOSURE_CD character varying
CVR_F470_CD character varying
CVR_CAMPAIGN_DISCLOSURE_CD character varying
From the looks of it, we were adhering to the official documentation (e.g., F501_502_CD, CVR_SO_CD).
But do we have to? Writing queries for the processed-data app will be easier if we have consistent data types for columns in the JOIN
criteria. Also, the postgres / mysql syntax would be big hassle otherwise.
Actually, looks like only three of these fields can be cast as integer.
No errors for these three:
select "FILER_ID"::int, "FILER_ID", *
from "F501_502_CD";
select "FILER_ID"::int, "FILER_ID", *
from "CVR_SO_CD";
select "FILER_ID"::int, "FILER_ID", *
from "CVR_CAMPAIGN_DISCLOSURE_CD";
Whereas there are values with non-integer chars in these three:
-- [Error Code: 0, SQL State: 22P02] ERROR: invalid input syntax for integer: "L25358"
select "FILER_ID"::int, "FILER_ID", *
from "CVR_LOBBY_DISCLOSURE_CD";
-- [Error Code: 0, SQL State: 22P02] ERROR: invalid input syntax for integer: "C01851"
select "FILER_ID"::int, "FILER_ID", *
from "EFS_FILING_LOG_CD";
-- [Error Code: 0, SQL State: 22P02] ERROR: invalid input syntax for integer: "L22607"
select "FILER_ID"::int, "FILER_ID", *
from "CVR_REGISTRATION_CD";
Actually...hold up...those three seemingly safe queries above might not actually be.
The FILER_ID
column on the CVR_SO_CD
table is safe. There are no values with non-numeric characters:
SELECT "FILER_ID"
FROM "CVR_SO_CD"
WHERE "FILER_ID" ~ E'\D';
On the F501_502_CD
table, there 9,856 distinct FILER_ID
values:
SELECT COUNT(*)
FROM (
SELECT DISTINCT "FILER_ID"
FROM "F501_502_CD"
WHERE "FILER_ID" is not NULL and "FILER_ID" <> ''
) as foo;
And five of these include non-numeric characters:
SELECT "FILER_ID"
FROM "F501_502_CD"
WHERE "FILER_ID" ~ E'\D';
These:
FILER_ID
--------
CA
R98*4826
R98*3760
R98*1282
R98*4402
There are also 459 F501_502_CD
records with negative integer FILER_ID
values. Not sure what the hell is going on with that.
In CVR_CAMPAIGN_DISCLOSURE_CD
, there are 11,438 distinct FILER_ID
values:
SELECT COUNT(*)
FROM (
SELECT DISTINCT "FILER_ID"
FROM "CVR_CAMPAIGN_DISCLOSURE_CD"
WHERE "FILER_ID" is not NULL and "FILER_ID" <> ''
) as foo;
And six that have include non-numeric characters:
SELECT "FILER_ID"
FROM "CVR_CAMPAIGN_DISCLOSURE_CD"
WHERE "FILER_ID" ~ '\D';
These:
FILER_ID
--------
C24067
C24067
C00578
C01980
C28234
C27120
Again, the only column we're safe to change is CVR_SO_CD.FILER_ID
. Probably not worth the potential hassle of broken morning updates if in the future ever they ship values in this column with non-numeric chars.