/csvstream

Excercise for streaming large CSV for data analysis

Primary LanguageHaskellBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

csvstream

I'm learning how to parse big CSV files in Haskell. This is one attempt. The data I'll be analysing are the public records of Australian patents accumulated over more than 100 years. You can get the CSV files from data.gov.au. No, I will not include 785 MB of data (compressed) in this repository.

First task

Reading from a stream.

Second dask

Doing some basic data analysis, like counting records.

Second and a half task

Being able to inspect the stream using something like take or show with indexing. I assume I would be doing it in GHCi.

Third task

Extracting relevant info from unstructured text, such as addresses. That's a big part of what I do for work, and the main motivation for looking beyond Python. I want to move away from regular expressions and do it fast.

Fourth task

GROUP BY

At some point

  • Encoding results back into an output file.

Proposed libraries

There are good tutorials for cassava by Chris Allen and stackbuilders. I worked through these, in that order. So I'll retrace their steps with this new dataset as a starting point. Next, I may move on to conduit because the syntax seems good and I read some comments suggesting it is more resilient to ugly data than pipes. Or machines. Or streaming...

Finally, I eagerly welcome help to move this forward. Get in touch!