csvzip is a standalone CLI tool to reduce CSVs size by converting categorical columns in a list of unique integers.
The execution produces two files:
- a CSV with the compressed values
- a JSON dictionary with the mappings
- the CSV has to have a headers row
We love csvkit and csvzip has been inspired by that great tool
You can download the latest binary from the releases page.
You can get the latest darwin build from the releases page.
Until the Crystal Windows porting is completed, you can go with Windows Subsystem for Linux.
Let's say you have downloaded a very big CSV, for example Madrid census:
If we inspect it, it's a 22Mb file that looks like this:
"COD_DISTRITO";"DESC_DISTRITO";"COD_DIST_BARRIO";"DESC_BARRIO";"COD_BARRIO";"COD_DIST_SECCION";"COD_SECCION";"COD_EDAD_INT";"EspanolesHombres";"EspanolesMujeres";"ExtranjerosHombres";"ExtranjerosMujeres"
"1";"CENTRO ";"101";"PALACIO ";"1";"1006";"6";"99";"";"1";"";""
"1";"CENTRO ";"101";"PALACIO ";"1";"1006";"6";"102";"";"1";"";""
"1";"CENTRO ";"101";"PALACIO ";"1";"1007";"7";"0";"2";"2";"";"1"
"1";"CENTRO ";"101";"PALACIO ";"1";"1007";"7";"1";"3";"3";"";""
"1";"CENTRO ";"101";"PALACIO ";"1";"1007";"7";"2";"4";"3";"";""
"1";"CENTRO ";"101";"PALACIO ";"1";"1007";"7";"3";"1";"3";"";""
"1";"CENTRO ";"101";"PALACIO ";"1";"1007";"7";"4";"";"6";"";"1"
"1";"CENTRO ";"101";"PALACIO ";"1";"1007";"7";"5";"2";"1";"";""
"1";"CENTRO ";"101";"PALACIO ";"1";"1007";"7";"6";"3";"4";"";""
...
Let's compress it:
csvzip -i Rango_Edades_Seccion_202005.csv -o compressed.csv -c "DESC_DISTRITO,DESC_BARRIO" -k census -s ';'
COD_DISTRITO,DESC_DISTRITO,COD_DIST_BARRIO,DESC_BARRIO,COD_BARRIO,COD_DIST_SECCION,COD_SECCION,COD_EDAD_INT,EspanolesHombres,EspanolesMujeres,ExtranjerosHombres,ExtranjerosMujeres
1,0,101,0,1,1006,6,99,,1,,
1,0,101,0,1,1006,6,102,,1,,
1,0,101,0,1,1007,7,0,2,2,,1
1,0,101,0,1,1007,7,1,3,3,,
1,0,101,0,1,1007,7,2,4,3,,
1,0,101,0,1,1007,7,3,1,3,,
...
And the size is now 7.2Mb.
If we inspect the dictionary, it contains the values of those columns:
{
"census": {
"DESC_DISTRITO": {
"CENTRO": "0",
"ARGANZUELA": "1",
"RETIRO": "2",
"SALAMANCA": "3",
...
},
"DESC_BARRIO": {
"PALACIO": "0",
"EMBAJADORES": "1",
"UNIVERSIDAD": "2",
"CHOPERA": "3",
"PACIFICO": "4",
...
}
}
}
- Improve specs coverage
- accept headers parameter
- decompress operation
- Fork it (https://github.com/PopulateTools/csvzip/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
Thanks to marcobellaccini from nanvault for the inspiration to build this CLI tool. The structure of the project and the Github actions scripts are copied from that repository.
- Fernando Blat - creator and maintainer