error with white spaces & other wrong characters in column names
iuiu34 opened this issue · 6 comments
When the schema is created, column names with spaces are writen as they are.
Therefore, when uploading to bq generates the following error
<BigQuery error in load operation: Invalid field name "utm_medium-partners". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.>
Would be posible to substitute blank spaces and other wrong characters with '_' as the '--autodetect' option does?
For example:
'Column.example 1' is written as 'Column_example_1'
Hi,
Can you describe the bq load
command where you see the automatic normalization of field names? Because when I run bq load --autodetect ...
with invalid field names, it prints an error message. I don't see it automatically normalizing the field.
$ cat invalid_name.json.data
{ "Column.example 1": [1, 2] }
{ "example2": 3 }
$ bq load --source_format NEWLINE_DELIMITED_JSON --ignore_unknown_values --autodetect tmp.invalid invalid_name.data.json
Error in query string: Error processing job 'vital-future-582:bqjob_r6c4ce23bd29c0cf4_000001657d7b2300_1': Invalid field name "Column.example 1". Fields must
contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long. Table:
invalid_e9a4f223_84d6_4ca3_900d_7e17c34bf5e4_source
It's easy enough to make generate_schema.py
perform this normalization, if bq load
is able to make use of it.
I also tried to upload using the web interface in the BigQuery Console, with the "auto detect" checkbox selected, and I get the same error:
Invalid field name "Column.example 1". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long. Table: invalid_11c968ff_2e40_4c8a_a8ec_eca2fd442215_source
You're right. I mistake with csv.
I'll ask for the feature in the Google Platform: that the ingestion automatically remove invalid characters in columns.
I suspect that they will just tell you to pre-process your JSON file to modify your invalid column names. The code to do that is fairly straightforward. For example, save the following as convert.py
, and run it like this:
$ ./convert.py < raw.data.json > converted.data.json
convert.py:
#!/usr/bin/env python3
from collections import OrderedDict
import json
import logging
import re
import sys
FIELD_MATCHER = re.compile(r'(^[^a-zA-Z])|(\W)')
line_number = 0
def normalize_file(file):
global line_number
for line in file:
line_number += 1
json_object = json.loads(line)
if not isinstance(json_object, dict):
logging.error('%s: Top level record must be a dict but was a %s',
line_number, type(json_object))
continue
try:
normalized_dict = normalize_dict(json_object)
except Exception as e:
logging.error('%s: %s', line_number, e)
continue
json.dump(normalized_dict, sys.stdout)
print()
logging.info("Processed %s lines", line_number)
def normalize_dict(json_dict):
normalized_dict = OrderedDict()
for key, value in json_dict.items():
key = FIELD_MATCHER.sub('_', key)
if isinstance(value, dict):
value = normalize_dict(value)
normalized_dict[key] = value
return normalized_dict
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO)
normalize_file(sys.stdin)
Hi Iuiu,
Can I close this issue, or do you have additional questions/comments?
Closing with no activity for 4 months. Let me know if you have further questions.