Frictionless fails to describe the table with the correct field type when the data file is big
Opened this issue · 4 comments
Overview
When the field contain integer and float, frictionless describe this filed as a number type. It works well when the data file is small. But we have some issue with it when our data file is big. For example we have a big data file that is about 2GB, one of the field can have 0 or a float number. For this field, most rows have a value of 0, only a few have a float value. And when frictionless describe the table, it describe this filed as a integer type instead of a number type. It fails to see those float values in this row. Can this bug be fixed? Thanks!
This is the output to describe a small data file (test.tsv) and a big data file (TSTFI46007602.tsv) with the same fields. You can see ref_score
identified as a number type in the small size file but a integer type in the big size file:
(checkfiles_venv) (base) mingjie@Mingjies-MacBook-Pro checkfiles % frictionless describe src/tests/data/test.tsv
# --------
# metadata: src/tests/data/test.tsv
# --------
name: test
type: table
path: src/tests/data/test.tsv
scheme: file
format: tsv
encoding: utf-8
mediatype: text/tsv
dialect:
csv:
delimiter: "\t"
schema:
fields:
- name: '#chrom'
type: string
- name: start
type: integer
- name: end
type: integer
- name: spdi
type: string
- name: ref
type: string
- name: alt
type: string
- name: kmer_coord
type: string
- name: ref_score
type: number
- name: alt_score
type: number
- name: relative_binding_affinity
type: number
- name: effect_on_binding
type: string
(checkfiles_venv) (base) mingjie@Mingjies-MacBook-Pro checkfiles % frictionless describe src/tests/data/TSTFI46007602.tsv
# --------
# metadata: src/tests/data/TSTFI46007602.tsv
# --------
name: tstfi46007602
type: table
path: src/tests/data/TSTFI46007602.tsv
scheme: file
format: tsv
encoding: utf-8
mediatype: text/tsv
dialect:
csv:
delimiter: "\t"
schema:
fields:
- name: '#chrom'
type: string
- name: start
type: integer
- name: end
type: integer
- name: spdi
type: string
- name: ref
type: string
- name: alt
type: string
- name: kmer_coord
type: string
- name: ref_score
type: integer
- name: alt_score
type: integer
- name: relative_binding_affinity
type: integer
- name: effect_on_binding
type: string
Thx for the report.
Diving into the code it looks like the sample that is analysed to "guess" the type of the column is hardcoded to 100 rows here.
I can reproduce with a csv file with 1 column, 100 rows of zeros followed by a decimal value.
Can you confirm that your data starts with at least 100 lines of zeros ?
Unfortunately I can't think of a workaround right now... Can I ask you what your use case is ? Is it for validation ?
Yes the first several hundred rows are 0s. We use frictionless to validate big tsv files. Right now what I do is to skip the type error if no schema is provided. Let me know if there is a better way. Thank you!
Thx for your feedback. The only way I see is correcting the output of describe inside a schema - but of course your answer shows you thought of that.
Actually the hard coded SAMPLE_SIZE does not seem to be the culprit.
The following csv fails already despite being less than 100 rows :
a,b
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
1.2,3.4
I tried the following command, which fails as well :
frictionless describe --sample-size=11 --field-confidence=1 test.csv
So there is something wrong here, I need to investigate further.
Please keep me updated. Thank you so much!