`resource.infer(stats=True)` changes value of `dialect.delimiter` depending on resource data
fjuniorr opened this issue · 4 comments
I did not find this information in the documentation, but by extrapolating this behavior:
Also, won't resource.infer() reset the whole schema of the resource?
No, it will not change existing properties except for the recalculation of resource.stats.
I assumed that resource.infer(stats=True)
would not change dialect.delimiter
(even if it was not explicitly set).
However, this is not the case, and data like:
pkey,year,code,desc
2008|1001,2008,1001,PROGRAMA LARES HABITACAO POPULAR
2008|1003,2008,1003,DIVULGACAO DE MINAS GERAIS COMO ESTADO DESCOMPLICADO
is causing resource.infer(stats=True)
to set dialect.delimiter
to |
.
There is a reproducible example here.
Hi, thanks for the report and reproducible example.
In your example, I get indeed dialect.delimiter
to |
for the first test (with data.csv
), but not with the sample or with explicitely specified delimiter (I guess that was intended).
I am not very familiar with the frictionless codebase, so sorry if I'm off base, I am guessing a bit. My understanding is that when it is not explicitly set, frictionless makes a guess about the delimiter being |
(not the best, I agree), and infer
requires to do so, as without delimiter there are e.g. no possible stats about the number of fields. I observe the same delimiter if I resource.read_rows()
instead of infer
(with a TableResource
)
So do I understand correctly that the issue is not that it changes any behaviour, but that it explicitely sets the delimiter in the resource? If yes, why is it an issue?
I did not notice that resource.read_rows()
was also inferring the delimiter of data.csv
to be |
.
In general1 I would expect frictionless-py
to use ,
as delimiter for operations such as read_rows
because ,
is the default value per the data package standard. The wording in CSV Dialect is even more explicit:
delimiter
- specifies the character sequence which should separate fields (aka columns). Default =,
. Example\t
. If not present, consumers should assume that it’s,
.
For my particular use case the inference behaviour is problematic because I have several descriptors in which I do not explicitly set delimiter
when the proper value is ,
. Mostly it works just fine except when it breaks.
Footnotes
-
The exception would be in functionality that is explicitly created for inference purposes, such as
describe
and maybeinfer
(which was my original question). ↩
Ok thanks, I understand what bothers you, and I agree that using ,
as default instead of guessing the delimiter would probably a less error-prone behaviour, and more consistent with the documentation.
I also understand why you specifically asked about infer
. Any change on this (infer
or read_rows
) would however be a breaking change.
In any case I'll try to find the time to understand the exact behaviour of the delimiter detection.
Knowing that this is also the behaviour in read_rows
I agree that changing it is not a good idea.
Maybe in the future making the delimiter detection more robust (which I'm not sure is high priority TBH).