Heroku Staging: Live Demo (With SideKiq or Parallel Jobs - Faster on Heroku and accepts larger file)
UPDATE: Added a sidekiq based background job to make the csv importing faster for bulky csv importing. (See parallel_processing branch)
The parallel branch involves usage of SideKiq for Job queueing, Google Drive for S3 Substitute on uploading files and handling them later
The process:
Upload -> Saved to GDrive (public folder) -> SideKiq queues a File Open to GDrive Path -> From File Open Parsing Sidekiq queues each processing of row of csv to ensure only valid data is recorded in the database.
We have a CSV (Comma Separated Values) file of N-rows for the following Objects:
ObjectA: {property1, property2, property3} ObjectB: {property1, property2, property3} . .. ... object_id | object_type | timestamp | object_changes :-------: | :---------: | :--------: | :------------ 1 | ObjectA | 412351252 | {property1: "value1", property3: "value2"} 1 | ObjectB | 456662343 | {property1: “another value1"} 1 | ObjectA | 467765765 | {property1: "altered value1", property2: “random value2"} 2 | ObjectA | 451232123 | {property2: “some value2"} ... | ... | ... | ...
The CSV columns are:
- **object_id:** is a unique identifier per-object type. - **object_type:** denotes the object type. - **timestamp:** - **object_changes:** the properties changed for specified object at **timestamp**.
The program should run behind a simple http interface, accept a file upload (the csv file in this case) and then allow users to query the system about states of objects at a specific point of time .e.g. what were the properties associated with ObjectA holding Id=100 On May 4th, 2016 12:00:00AM UTC ?
For simplicity, assume a moderate number of classes/type (no more than 3) and moderate/acceptable number of properties per-class (with simple-serializable data types).
-
Ruby version
ruby 2.3.1
-
System dependencies (For Tests and Database Drivers - Only tested and deveoped on Mac OSX)
$ bundle install
$ brew install mongodb
$ brew services start mongodb
$ npm install phantomjs
-
Configuration
Mongoid configuration can be done through a mongoid.yml that specifies your options and clients. The simplest configuration is as follows, which sets the default client to “localhost:27017” and provides a single database in that client named “mongoid”.
development: clients: default: database: mongoid hosts: - localhost:27017
Rails Applications
I have already configured the mongoid.yml but if you want to specify, you can generate a config file by executing the generator and then editing myapp/config/mongoid.yml to your heart’s desire. Mongoid will then handle everything else from there.
$ rails g mongoid:config
-
Database creation / initialization
Since Mongodb is a document based database - it does not need any migrations whatsoever so you can just immediately proceed.
In a clean / empty state, you can upload your csv file on the running app (e.g. localhost:3000) or by running a premade database seeder (contains 4 default given records and 100 randomly generated records):
$ rake db:seed
-
How to run the test suite
I used Rspec in conjuction with Capybara, PhantomJs, FactoryGirl, and Guard. I have already deleted any tests from Rails generator. So just run:
$ rake
or for the automated testing:
$ bundle exec run guard (then press enter on the guard command to execute all)
-
Services (job queues, cache servers, search engines, etc.)
I have used the SmarterCSV gem to make csv importing faster, but reading its documentation - it can still make the uploading faster by using job queues. Currently this is still under development.
-
Deployment instructions
Note that the deployed environment I am using is Heroku’s free-tier which only has 25mb of Ram and shared CPU allocation. This might become really slow on csv with 100,000+ datasets. But it only is slow on the upload part - I have architect the system to do ajax / lazy loading to fasten up.
$ heroku create $ heroku addons:create mongolab:sandbox
then on your mongoid.yml:
production: clients: default: # The standard MongoDB connection URI allows for easy # replica set connection setup. # Use environment variables or a config file to keep your # credentials safe e.g. <%= ENV['MONGODB_URI'] %>. uri: <%= ENV['MONGODB_URI'] %> options: # The default timeout is 5, which is the time in seconds # for a connection to time out. # We recommend 15 because it allows for plenty of time # in most operating environments. connect_timeout: 15