Cannot round-trip a file (read, write, read) in some circumstances
Opened this issue · 5 comments
Refer to this discussion on the Julialang Discourse:
Can you file an issue against
CSV.jl
on GitHub? There’s probably a bug when the cut point to attribute parts of the file to tasks is in a particular position.
The error described there is
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
ERROR: LoadError: TaskFailedException
nested task error: CSV.Error("thread = 2 fatal error, encountered an invalidly quoted field while parsing around row = 175539, col = 3: \"\"I will undertake a research trip hosted by Michele Bryd-McPhee curator of ‘Ladies of Hip-Hop Festival’ in New York City in March and July 2018 with 3 fundamental areas of enquiry; \n\", error=INVALID: OK | QUOTED | EOF | INVALID_QUOTED_FIELD , check your `quotechar` arguments or manually fix the field in the file itself")
Stacktrace:
[1] fatalerror(buf::Vector{UInt8}, pos::Int64, len::Int64, code::Int16, row::Int64, col::Int64)
@ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:590
[2] parsevalue!(::Type{String}, buf::Vector{UInt8}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context)
@ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:798
[3] parserow
@ C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:640 [inlined]
[4] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{CSV.Column}, ::Type{Tuple{}})
@ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:550
[5] multithreadparse(ctx::CSV.Context, pertaskcolumns::Vector{Vector{CSV.Column}}, rowchunkguess::Int64, i::Int64, rows::Vector{Int64}, wholecolumnslock::ReentrantLock)
@ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:360
[6] (::CSV.var"#34#39"{CSV.Context, Vector{Vector{CSV.Column}}, Int64, Int64, Vector{Int64}, ReentrantLock})()
@ CSV C:\Users\TGebbels\.julia\packages\WorkerUtilities\ey0fP\src\WorkerUtilities.jl:384
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base .\task.jl:448
[2] macro expansion
@ .\task.jl:480 [inlined]
[3] CSV.File(ctx::CSV.Context, chunking::Bool)
@ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:240
[4] File
@ C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:227 [inlined]
[5] #File#32
@ C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:223 [inlined]
[6] CSV.File(source::String)
@ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:162
[7] read(source::String, sink::Type; copycols::Bool, kwargs::@Kwargs{})
@ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\CSV.jl:117
[8] read(source::String, sink::Type)
@ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\CSV.jl:113
[9] top-level scope
@ c:\Users\TGebbels...\Documents\DCMS Database\CompareCsv.jl:361
[10] include(fname::String)
@ Base.MainInclude .\client.jl:489
[11] run(debug_session::VSCodeDebugger.DebugAdapter.DebugSession, error_handler::VSCodeDebugger.var"#3#4"{String})
@ VSCodeDebugger.DebugAdapter c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.105.2\scripts\packages\DebugAdapter\src\packagedef.jl:126
[12] startdebugger()
@ VSCodeDebugger c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.105.2\scripts\packages\VSCodeDebugger\src\VSCodeDebugger.jl:45
[13] top-level scope
@ c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.105.2\scripts\debugger\run_debugger.jl:12
[14] include(mod::Module, _path::String)
@ Base .\Base.jl:495
[15] exec_options(opts::Base.JLOptions)
@ Base .\client.jl:318
[16] _start()
@ Base .\client.jl:552
in expression starting at c:\Users\TGebbels\...\Documents\DCMS Database\CompareCsv.jl:361
@quinnj What's interesting is that the error doesn't happen when passing ntasks=1
to CSV.read
.
Possibly, but hard to tell without having seen the files and/or identified the root cause.
Files are public, from the UK Department of Culture, Media and Sport, here, or by HTTP.get call to https://nationallottery.dcms.gov.uk/api/v1/grants/csv-export/
. Typically just over 300MB, but growing. Updates are relatively frequent as new grant records are added.
At least one field, Description
, is a quoted text field that sometimes contains new lines and can be quite lengthy. Only a quite small proportion of the 700,000 records contain new lines, though, unlike the file in #1139. This may be the reason the problem is intermittent and depends on sort order.