RFC 4180 compliant, composable CSV parsing and encoding for Elixir.
Add
{:csv, "~> 3.0"}
to your deps in mix.exs
like so:
defp deps do
[{:csv, "~> 3.0"}]
end
CSV is a notoriously fickle format, with many implementations and files interpreting it differently.
For that reason, CSV
implements a normal mode CSV.decode
that will return a stream of ok: ["field1", "field2"]
and err: "Message"
tuples. It will also reparse lines after a previous line has opened an unterminated escape sequence,
ensuring you get all correctly formatted rows.
The goal of this library is to allow to extract all correctly formatted rows, while displaying descriptive errors for incorrectly formatted rows.
In strict mode using CSV.decode!
the library will raise an exception when it encounters the first error, aborting the
operation.
This library uses fast binary matching and is able to parse about half a million rows of a moderately complex CSV file per second in a single process on a small cloud instance spec (2vCPU, 2GB Memory). CSV parsing will unlikely become a bottleneck in your data pipeline.
If you are reading from a large file, CSV
will perform best when streaming with :read_ahead
in byte mode:
File.stream!("data.csv", [read_ahead: 100_000], 1000) |> CSV.decode()
While 1000
is usually a good default number of bytes to stream, you should measure performance and fine-tune
byte size according to your environment.
The main goal for 3.x has been to streamline the library API and leverage binary matching.
- Parallelism has been removed, alongside its options
:num_workers
and:worker_work_ratio
. You can safely remove them. CSV
now expects line breaks to be present in the data. If you used to parse strings by applyingString.split/2
before passing it to decode, you can do the same now feeding in the string as a single item of a list:["a,b,c\nd,e,f"] |> CSV.decode()
StrayQuoteError
is nowStrayEscapeCharacterError
. If you catch this error in your code, you need to rename it.- The
:strip_fields
option needs to be replaced with the:field_transform
option:File.stream!("data.csv") |> CSV.decode(field_transform: &String.trim/1)
:validate_row_length
now defaults tofalse
. This option produces an error for rows with different length. Set it totrue
to get the same behaviour as in 2.x:escape_formulas
is now:unescape_formulas
fordecode
anddecode!
. It is still:escape_formulas
forencode
. Change:escape_formulas
to:unescape_formulas
indecode
calls to get the same behaviour as in 2.x:escape_max_lines
now defaults to10
instead of1000
. To get the same behaviour as in 2.x, use:File.stream!("data.csv") |> CSV.decode(escape_max_lines: 1000)
:replace
has been removed.CSV
will now return fields with incorrect encoding as-is. You can use the new:field_transform
option to provide a function transforming fields while they are being parsed. This allows to e.g. replace incorrect encoding:defp replace_bad_encoding(field) do if String.valid?(field) do field else field |> String.codepoints() |> Enum.map(fn codepoint -> if String.valid?(codepoint), do: codepoint, else: "?" end) |> Enum.join() end end File.stream!("data.csv") |> CSV.decode(field_transform: &replace_bad_encoding/1)
That's it! Please open an issue if you see any other non-backward compatible behaviour so it can be documented.
- Elixir
1.5.0
is required for all versions above2.5.0
. - Elixir
1.1.0
is required for all versions above1.1.5
.
This library aims to to solve concerns related to csv parsing in data pipelines, following the UNIX philosophy: It consumes streams or enumerables, producing streams of lists, maps or tuples depending on configuration. This simplifies using it in data pipelines, where CSV encoding or decoding is only one of the processing steps.
CSV
can decode and encode from and to a stream of bytes or lines.
Do this to decode data:
# Decode file line by line
File.stream!("data.csv")
|> CSV.decode()
# Decode a UTF-16 file with BOM
File.stream!([:trim_bom, encoding: {:utf16, :little}])
|> CSV.decode()
# Decode file in chunks of 1000 bytes
File.stream!("data.csv", [], 1000)
|> CSV.decode()
# Decode a csv formatted string
["long,csv,string\\nwith,multiple,lines"]
|> CSV.decode()
# Decode a list of arbitrarily chunked csv data
["list,", "of,arbitrarily", "\\nchun", "ked,csv,data\\n"]
|> CSV.decode()
And you'll get a stream of row tuples:
[ok: ["a", "b"], ok: ["c", "d"]]
And, potentially error tuples:
[error: "", ok: ["c", "d"]]
Use strict mode decode!
to get a two-dimensional list, raising errors as they
occur, aborting the operation:
File.stream!("data.csv") |> CSV.decode!
For all available options check the docs on decode
and decode!
Specify a semicolon separator:
stream |> CSV.decode(separator: ?;)
Specify a custom escape character:
stream |> CSV.decode(escape_character: ?@)
Apply a transformation to a field when parsed, e.g. trimming the field:
stream |> CSV.decode(field_transform: &String.trim/1)
Unescape formulas that have been escaped:
stream |> CSV.decode(unescape_formulas: true)
Do this to encode a table (two-dimensional list):
table_data |> CSV.encode
And you'll get a stream of lines ready to be written to an IO. So, this is writing to a file:
file = File.open!("test.csv", [:write, :utf8])
table_data |> CSV.encode |> Enum.each(&IO.write(file, &1))
Use a semicolon separator:
your_data |> CSV.encode(separator: ?;)
Use a specific escape character:
your_data |> CSV.encode(escape_character: ?@)
You can also specify headers when encoding, which will encode map values into the right place:
[%{"a" => "value!"}] |> CSV.encode(headers: ["z", "a"])
# ["z,a\\r\\n", ",value!\\r\\n"]
You can also specify a keyword list, the keys of the list will be used as the keys for the rows, but the values will be the value used for the header row name in CSV output
[%{a: "value!"}] |> CSV.encode(headers: [a: "x", b: "y"])
# ["x,y\\r\\n", "value!,\\r\\n"]
You'll surely appreciate some more info on encode
.
Make sure your data gets encoded the way you want - implement the CSV.Encode
protocol for whatever you wish to encode:
defimpl CSV.Encode, for: MyData do
def encode(%MyData{has: fun}, env \\ []) do
"so much #{fun}" |> CSV.Encode.encode(env)
end
end
Or similar.
The encoding protocol implements a fallback to Any for types where a simple call
o to_string
will provide unambiguous results. Protocol dispatch for the
fallback to Any is very slow when protocols are not consolidated, so make sure
you have consolidate_protocols: true
in your mix.exs
or you consolidate protocols manually for production in order
to get good performance.
There is more to know about everything ™️ - Check the doc
Please make sure to add tests. I will not look at PRs that are either failing or lowering coverage. Also, solve one problem at a time.
Copyright (c) 2022 Beat Richartz
CSV source code is licensed under the MIT License.