Processing 1Million+ rows of csv via Node.js
This was an interesting problem to come across. Filtering 1M+ rows from a csv to sql seems like an architechtural flaw to begin with. That being said, provided csv file had 89 occurences of misplaced column. Probably a large dataset was copy-pasted repeatedly and forgot to put in a newline before pasting.
Problem Approach
I can only think of two ways to solve this problem.
- Stream it! => use
streams
topipe
data in chunks, process it, store it. - All SQL => use
LOAD DATA INFILE
fromMySQL
to directly import it and postprocess into separate tables
I went with the first approach as second would introduce some major vulnerability if done through an webserver. A bash script inside myqsl server may work just fine.
I could have not used a webserver, but for a complete app
included this. avoided using multer
to upload as instruction sets did not mention anything explicitly about it.
The following are the execution time for the process
- 87 seconds - no filter, insert all in one table
- 93 seconds - with filter
- 94 seconds - with filter and export to csv
Project Structure
This is a fastify
application. It is known to be extremely performant and scalabale, beats express anyday.
- app -> contains all encapsulated features
- handlers => app logic only
- services => business logic only
- schemas => request validation & response serialization
Installation Steps
Local
Requirements:
MySQL >= 8
Node.js >= 18
NPM >= 9
create .env
from .env.example
file and replace with proper database values.
Host should be 0.0.0.0
if you are using Docker. Then install the app.
npm install
once you have proper values for DB in env, run
npm run db:migrate
this will process migration and create the necessary tables. incase of testing docker, need to already have migrations in your db as we are using a distroless image.
Docker :
this is a production image build using distroless Node by Google, just barebones Kernel hotwired with Nodejs. Make sure your MySQL
instance is on the same Virtual Network
of your application instance. Otherwise you will not be able to run this app. I have included Kubernetes
manifests file, so feel free to use that as well.
first build your image, give it a name and tag. Do Not Forget
to keep the csv
you want to process within data
folder of app. Otherwise, It's not going to work.
docker build . -t millioncsv:v1
then use the built image to create a compute instance, it will keep running in the background
docker run -d -p 3000:3000 --name appinstance millioncsv:v1
Execution Steps
Please ensure you have the csv
you want to process in the data
folder on the root of your application. Once the installation is done, run npm run dev
and your webserver will start. If you are using docker, just ensure you have your csv in data folder.
There is only one endpoint that takes the filename to look for in data
folder and processes it. Here is the API Doc for it:
filename = file to look for in data folder
HTTP/1.1
HOST: localhost:3000
POST /v1/customer/parse?filename=<filename>
i.e : localhost:3000/v1/customer/parse?filename=<filename>
request returns execution time in ms - miliseconds
.
This will trigger the process. given csv will be parsed, processed, inserted to sql and simultaneously filtered data written to data/output
Note
if you get errors related to columns breaking, use a proper csv
file. Or use VSCode
and regex to filter it. the csv
values are in data/output
folder