palewire/django-calaccess-raw-data

Consistent data_type for various FILER_ID fields

Closed this issue · 4 comments

There are 22 CalAccess tables that include a FILER_ID column. On 15 of these models, the field is typed as integer, and on the remaining seven, it is typed as char.

table_name                  data_type          
--------------------------  -----------------  
F501_502_CD                 character varying  
EFS_FILING_LOG_CD           character varying  
CVR_SO_CD                   character varying  
CVR_REGISTRATION_CD         character varying  
CVR_LOBBY_DISCLOSURE_CD     character varying  
CVR_F470_CD                 character varying  
CVR_CAMPAIGN_DISCLOSURE_CD  character varying  

From the looks of it, we were adhering to the official documentation (e.g., F501_502_CD, CVR_SO_CD).

But do we have to? Writing queries for the processed-data app will be easier if we have consistent data types for columns in the JOIN criteria. Also, the postgres / mysql syntax would be big hassle otherwise.

Actually, looks like only three of these fields can be cast as integer.

No errors for these three:

select "FILER_ID"::int, "FILER_ID", *
from "F501_502_CD";

select "FILER_ID"::int, "FILER_ID", *
from "CVR_SO_CD";

select "FILER_ID"::int, "FILER_ID", *
from "CVR_CAMPAIGN_DISCLOSURE_CD";

Whereas there are values with non-integer chars in these three:

-- [Error Code: 0, SQL State: 22P02]  ERROR: invalid input syntax for integer: "L25358"
select "FILER_ID"::int, "FILER_ID", *
from "CVR_LOBBY_DISCLOSURE_CD";

-- [Error Code: 0, SQL State: 22P02]  ERROR: invalid input syntax for integer: "C01851"
select "FILER_ID"::int, "FILER_ID", *
from "EFS_FILING_LOG_CD";

-- [Error Code: 0, SQL State: 22P02]  ERROR: invalid input syntax for integer: "L22607"
select "FILER_ID"::int, "FILER_ID", *
from "CVR_REGISTRATION_CD";

Actually...hold up...those three seemingly safe queries above might not actually be.

The FILER_ID column on the CVR_SO_CD table is safe. There are no values with non-numeric characters:

SELECT "FILER_ID"
FROM "CVR_SO_CD"
WHERE "FILER_ID" ~ E'\D';

On the F501_502_CD table, there 9,856 distinct FILER_ID values:

SELECT COUNT(*)
FROM (
        SELECT DISTINCT "FILER_ID" 
        FROM "F501_502_CD"
        WHERE "FILER_ID" is not NULL and "FILER_ID" <> ''
) as foo;

And five of these include non-numeric characters:

SELECT "FILER_ID"
FROM "F501_502_CD"
WHERE "FILER_ID" ~ E'\D';

These:

FILER_ID
--------  
CA        
R98*4826  
R98*3760  
R98*1282  
R98*4402  

There are also 459 F501_502_CD records with negative integer FILER_ID values. Not sure what the hell is going on with that.

In CVR_CAMPAIGN_DISCLOSURE_CD, there are 11,438 distinct FILER_ID values:

SELECT COUNT(*)
FROM (
        SELECT DISTINCT "FILER_ID" 
        FROM "CVR_CAMPAIGN_DISCLOSURE_CD"
        WHERE "FILER_ID" is not NULL and "FILER_ID" <> ''
) as foo;

And six that have include non-numeric characters:

SELECT "FILER_ID"
FROM "CVR_CAMPAIGN_DISCLOSURE_CD"
WHERE "FILER_ID" ~ '\D';

These:

FILER_ID  
--------  
C24067    
C24067    
C00578    
C01980    
C28234    
C27120

Again, the only column we're safe to change is CVR_SO_CD.FILER_ID. Probably not worth the potential hassle of broken morning updates if in the future ever they ship values in this column with non-numeric chars.