bxparks/bigquery-schema-generator

error with white spaces & other wrong characters in column names

iuiu34 opened this issue · 6 comments

When the schema is created, column names with spaces are writen as they are.
Therefore, when uploading to bq generates the following error
<BigQuery error in load operation: Invalid field name "utm_medium-partners". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.>

Would be posible to substitute blank spaces and other wrong characters with '_' as the '--autodetect' option does?
For example:
'Column.example 1' is written as 'Column_example_1'

Hi,
Can you describe the bq load command where you see the automatic normalization of field names? Because when I run bq load --autodetect ... with invalid field names, it prints an error message. I don't see it automatically normalizing the field.

$ cat invalid_name.json.data
{ "Column.example 1": [1, 2] }
{ "example2": 3 }

$  bq load --source_format NEWLINE_DELIMITED_JSON --ignore_unknown_values --autodetect tmp.invalid invalid_name.data.json

Error in query string: Error processing job 'vital-future-582:bqjob_r6c4ce23bd29c0cf4_000001657d7b2300_1': Invalid field name "Column.example 1". Fields must
contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long. Table:
invalid_e9a4f223_84d6_4ca3_900d_7e17c34bf5e4_source

It's easy enough to make generate_schema.py perform this normalization, if bq load is able to make use of it.

I also tried to upload using the web interface in the BigQuery Console, with the "auto detect" checkbox selected, and I get the same error:

Invalid field name "Column.example 1". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long. Table: invalid_11c968ff_2e40_4c8a_a8ec_eca2fd442215_source

You're right. I mistake with csv.
I'll ask for the feature in the Google Platform: that the ingestion automatically remove invalid characters in columns.

I suspect that they will just tell you to pre-process your JSON file to modify your invalid column names. The code to do that is fairly straightforward. For example, save the following as convert.py, and run it like this:

$ ./convert.py < raw.data.json > converted.data.json

convert.py:

#!/usr/bin/env python3

from collections import OrderedDict
import json
import logging
import re
import sys

FIELD_MATCHER = re.compile(r'(^[^a-zA-Z])|(\W)')

line_number = 0


def normalize_file(file):
    global line_number
    for line in file:
        line_number += 1
        json_object = json.loads(line)

        if not isinstance(json_object, dict):
            logging.error('%s: Top level record must be a dict but was a %s',
                         line_number, type(json_object))
            continue

        try:
            normalized_dict = normalize_dict(json_object)
        except Exception as e:
            logging.error('%s: %s', line_number, e)
            continue
        json.dump(normalized_dict, sys.stdout)
        print()
    logging.info("Processed %s lines", line_number)


def normalize_dict(json_dict):
    normalized_dict = OrderedDict()
    for key, value in json_dict.items():
        key = FIELD_MATCHER.sub('_', key)
        if isinstance(value, dict):
            value = normalize_dict(value)
        normalized_dict[key] = value
    return normalized_dict


if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)
    normalize_file(sys.stdin)

Hi Iuiu,
Can I close this issue, or do you have additional questions/comments?

Closing with no activity for 4 months. Let me know if you have further questions.