frictionlessdata/frictionless-py

Frictionless fails to describe the table with the correct field type when the data file is big

Opened this issue · 4 comments

Overview

When the field contain integer and float, frictionless describe this filed as a number type. It works well when the data file is small. But we have some issue with it when our data file is big. For example we have a big data file that is about 2GB, one of the field can have 0 or a float number. For this field, most rows have a value of 0, only a few have a float value. And when frictionless describe the table, it describe this filed as a integer type instead of a number type. It fails to see those float values in this row. Can this bug be fixed? Thanks!

This is the output to describe a small data file (test.tsv) and a big data file (TSTFI46007602.tsv) with the same fields. You can see ref_score identified as a number type in the small size file but a integer type in the big size file:

(checkfiles_venv) (base) mingjie@Mingjies-MacBook-Pro checkfiles % frictionless describe  src/tests/data/test.tsv
# --------
# metadata: src/tests/data/test.tsv
# --------

name: test
type: table
path: src/tests/data/test.tsv
scheme: file
format: tsv
encoding: utf-8
mediatype: text/tsv
dialect:
  csv:
    delimiter: "\t"
schema:
  fields:
    - name: '#chrom'
      type: string
    - name: start
      type: integer
    - name: end
      type: integer
    - name: spdi
      type: string
    - name: ref
      type: string
    - name: alt
      type: string
    - name: kmer_coord
      type: string
    - name: ref_score
      type: number
    - name: alt_score
      type: number
    - name: relative_binding_affinity
      type: number
    - name: effect_on_binding
      type: string

(checkfiles_venv) (base) mingjie@Mingjies-MacBook-Pro checkfiles % frictionless describe  src/tests/data/TSTFI46007602.tsv
# --------
# metadata: src/tests/data/TSTFI46007602.tsv
# --------

name: tstfi46007602
type: table
path: src/tests/data/TSTFI46007602.tsv
scheme: file
format: tsv
encoding: utf-8
mediatype: text/tsv
dialect:
  csv:
    delimiter: "\t"
schema:
  fields:
    - name: '#chrom'
      type: string
    - name: start
      type: integer
    - name: end
      type: integer
    - name: spdi
      type: string
    - name: ref
      type: string
    - name: alt
      type: string
    - name: kmer_coord
      type: string
    - name: ref_score
      type: integer
    - name: alt_score
      type: integer
    - name: relative_binding_affinity
      type: integer
    - name: effect_on_binding
      type: string

Thx for the report.

Diving into the code it looks like the sample that is analysed to "guess" the type of the column is hardcoded to 100 rows here.
I can reproduce with a csv file with 1 column, 100 rows of zeros followed by a decimal value.

Can you confirm that your data starts with at least 100 lines of zeros ?

Unfortunately I can't think of a workaround right now... Can I ask you what your use case is ? Is it for validation ?

Yes the first several hundred rows are 0s. We use frictionless to validate big tsv files. Right now what I do is to skip the type error if no schema is provided. Let me know if there is a better way. Thank you!

Thx for your feedback. The only way I see is correcting the output of describe inside a schema - but of course your answer shows you thought of that.

Actually the hard coded SAMPLE_SIZE does not seem to be the culprit.

The following csv fails already despite being less than 100 rows :

a,b
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
1.2,3.4

I tried the following command, which fails as well :

frictionless describe --sample-size=11 --field-confidence=1 test.csv

So there is something wrong here, I need to investigate further.

Please keep me updated. Thank you so much!