MusicLoader CLI application
Introduction
This program is a CLI batch application that applies a batch of changes to a input file in order to create an output file.
Author: Bogdan Kulbida November 23, 2019. Seattle, USA
Considerations
Assuming we are building a real-word batch script that can be used in a semi-production environment, the following decisions have been made.
-
Using streams instead of pure files (still, the input data file supported as requested)
-
Input data validation, the program has to gracefully process bad data in the changeset files without bad data propagation down the stream. Instructions that do not pass validation are captured in a separate
error.log
file for further investigation and a potential rerun of the batch command. -
Security. In our proposed solution we have used streams instead of files as you need to persist the file, make sure it is securely stored and deleted afterward. Streams are, on the other hand, can be easily consumed and produced via using secure transport, such as HTTPS, for example, or TCPSocket. Another consideration is there is a potential security vulnerability due to the nature of the
changeset files
since the program serializes class name. For further improvements, the additional layer for changeset files validation can be implemented. -
Using streams, we can easily distribute the load across the workers that are dispersed across the wire.
-
We used data serialization and de-serialization to provide data integrity checks and data validation. It slows down the performance a little bit, but since we are building real-world batch applications, data consistency is our priority over performance.
-
The proposed solution should process small to medium-sized JSON files, close to 500MB per file. For larger files, our suggestion is to replace home baked in-memory storage with NoSQL database. Another option is to use Elasticsearch service. It heavily depends on how the output data will be consumed.
-
The design of the application allows to chain commands and apply various command (changes.json) files by using UNIX STDIN and STDOUT interfaces
-
Scalability. Due to the distributed nature of the batch command, the proposed solution can be deployed into multiple nodes, which allows applying changes incrementally.
-
The proposed solution uses lazy-loading design pattern when possible to consume resources efficiently.
-
Exceptions for exceptional cases, we have designed this solution to consume bad data and void bad data propagation. However, this may not be the best idea for some specific instances in which data integrity is a significant factor.
-
The proposed design can serve as a base foundation for a simple ETL. Data validation allows you to call external services to check for e-mail validness or making database requests to get additional information. See the details below for a complete list of features.
-
Note: for simplicity's sake, we do not check for the uniqueness of the data objects when we add to the collection as well as other extra features. The proposed solution allows for developing additional functionality in a modular way using public or private API. For example, we could move some of the logic into separate methods or classes for this example, we decided to keep things simple.
-
Quite a few more considerations. One is to move to AWS serverless model and utilize AWS Lambda with queuing mechanics to distribute processing and code dicoupling. Also, this approach will help to dramatically save cost for unpredictable load. Another one is to utilize AWS Spot Block-Instances.
Dependencies
Our program is written in Ruby 2.x and has only one optional dependency, yajl
. We used this library as it provides support for JSON loading via TCPSocket, URL, etc. It is optional can be removed with a standard Ruby JSON library.
Execution
1. Checking Ruby version:
ruby --version
=> ruby 2.3
2. Installing dependencies (using rubygems:
gem install yajl-ruby
Building native extensions. This could take a while...
Successfully installed yajl-ruby-1.4.1
Parsing documentation for yajl-ruby-1.4.1
Done installing documentation for yajl-ruby after 0 seconds
1 gem installed
3. Diff-file (or changes.json)
We named our change files such as ops0.json
... ops4.json
. We will call these files changeset files.
cat ops4.json
[
{
"optype" : "AddSong",
"playlist__id" : "1",
"song_id" : "1"
},
{
"optype" : "RemovePlaylist",
"id" : "1"
},
{
"optype" : "RemovePlaylist",
"id" : "3"
},
{
"optype" : "AddPlaylist",
"user_id" : "1",
"payload" : [
{"song_id" : "1"}
]
},
{
"optype" : "RemovePlaylist",
"id" : "3"
},
{
"optype" : "AddPlaylist",
"user_id" : "2",
"payload" : [
{"song_id" : "1"}
]
},
{
"optype" : "AddSong",
"playlist_id" : "1",
"song_id" : "6"
},
{
"optype" : "AddSong",
"playlist_id" : "2",
"song_id" : "7"
},
{
"optype" : "AddSong",
"playlist_id" : "3",
"song_id" : "5"
},
{
"optype" : "AddPlaylist",
"user_id" : "1",
"payload" : [
{"song_id" : "1"},
{"song_id" : "2"},
{"song_id" : "3"},
{"song_id" : "4"},
{"song_id" : "5"},
{"song_id" : "6"}
]
}
]
The script supports the following mutation classes:
- AddSong
- RemovePlaylist
- AddPlaylist
If you would like to add more operations to the stack, please do the following:
- Create a new mutation operation in
reducers.rb
file. For example:
class RemovePlaylist < BaseReducer
attr_accessor :id
def initialize(data)
super
@id = data['id']
end
def run!(storage)
if playlist = storage[Playlist::SCOPE].detect{|p| p.id == self.id}
storage[Playlist::SCOPE] = storage[Playlist::SCOPE] - [playlist]
else
self.errors.push("Playlist not found. Operation #{self.class.name} failed.")
STDERR.puts(self.to_json)
end
end
def valid? ; id ; end
end
- Inherit from the base class
BaseReducer
- Define attributes and implement
constructor
,run!
andvalid?
methods.
1 run!
- this method will be executed by the Processor
class as an operation. Here you can define all the logic to mutate the input data or report the issue to the log file.
2 valid?
- this method is used for the object integrity checks to avoid missing fields and broked inter-object relations.
4. Supported collections:
At this point the program supports 3 types of collections:
users
songs
playlists
If you would like to add more collections for the input data file, please do the following:
- Create a new mutation (serializer) class in
serializers.rb
, for example:
class User < BaseSerializer
SCOPE = :users
attr_accessor :id, :name
def valid?
id && name
end
end
Here the SCOPE
key has to correspond to the JSON key in the input data file, in this case we use :users
.
Here is an example from the input JSON file for SCOPE
users:
"users" : [
{
"id" : "1",
"name" : "Albin Jaye"
},
{
"id" : "2",
"name" : "Dipika Crescentia"
},
{
"id" : "3",
"name" : "Ankit Sacnite"
},
{
"id" : "4",
"name" : "Galenos Neville"
},
{
"id" : "5",
"name" : "Loviise Nagib"
},
{
"id" : "6",
"name" : "Ryo Daiki"
},
{
"id" : "7",
"name" : "Seyyit Nedim"
}
],
As you can see both attributes, id
and name
are defined in the User
class as attributes.
- Inherit the class form the
BaseSerializer
- Define attributes and implement
valid?
method to make sure the change action from the changeset files (see above) validates.
valid?
method here is used when we add a new object to the output file.
- Add the new class to facade collection in the
entrypoint.rb
file, L16
collector = Collector.new(processor, [User, Song, Playlist])
5. Ready. Steady. Go!
- Clone this repository
- Make sure you have at least Ruby 2.3 installed.
- Run the command:
To run our programm, please run the following command:
cat mixtape-data.json | ruby entrypoint.rb ops0.json 2> error.log | ruby entrypoint.rb ops1.json 2>> error.log | ruby entrypoint.rb ops2.json 2>> error.log | ruby entrypoint.rb ops3.json 2>> error.log | ruby entrypoint.rb ops4.json 2>> error.log > output.json
Here we use 4 changeset files, each has 3 to 11 commands. You may add as many commands as you would like.
6. Errors investigation
The command above produces a file errors.log
and populates error for each run. Here is an output:
cat error.log
{"optype":"RemovePlaylist","errors":["Playlist not found. Operation RemovePlaylist failed."],"id":"1"}
{"optype":"RemovePlaylist","errors":["Playlist not found. Operation RemovePlaylist failed."],"id":"3"}
{"optype":"AddSong","errors":["Song not found. Operation AddSong failed."],"playlist_id":"1","song_id":"100"}
{"optype":"RemovePlaylist","errors":["Playlist not found. Operation RemovePlaylist failed."],"id":"1"}
{"optype":"RemovePlaylist","errors":["Playlist not found. Operation RemovePlaylist failed."],"id":"3"}
{"optype":"RemovePlaylist","errors":["Playlist not found. Operation RemovePlaylist failed."],"id":"3"}
{"optype":"AddSong","errors":["Playlist not found. Operation AddSong failed."],"playlist_id":"200","song_id":"1"}
{"optype":"AddSong","errors":["Playlist not found. Operation AddSong failed."],"playlist_id":null,"song_id":"1"}
{"optype":"RemovePlaylist","errors":["Playlist not found. Operation RemovePlaylist failed."],"id":"1"}
{"optype":"RemovePlaylist","errors":["Playlist not found. Operation RemovePlaylist failed."],"id":"3"}
{"optype":"RemovePlaylist","errors":["Playlist not found. Operation RemovePlaylist failed."],"id":"3"}
{"optype":"AddSong","errors":["Playlist not found. Operation AddSong failed."],"playlist_id":"3","song_id":"5"}
7. Results
As requested the results are available in the output.json
file. It does not have extra spaces to use space efficiently, but we have provided here pretty-print version:
{
"users": [
{
"id": "1",
"name": "Albin Jaye"
},
{
"id": "2",
"name": "Dipika Crescentia"
},
{
"id": "3",
"name": "Ankit Sacnite"
},
{
"id": "4",
"name": "Galenos Neville"
},
{
"id": "5",
"name": "Loviise Nagib"
},
{
"id": "6",
"name": "Ryo Daiki"
},
{
"id": "7",
"name": "Seyyit Nedim"
}
],
"songs": [
{
"id": "1",
"artist": "Camila Cabello",
"title": "Never Be the Same"
},
{
"id": "2",
"artist": "Zedd",
"title": "The Middle"
},
{
"id": "3",
"artist": "The Weeknd",
"title": "Pray For Me"
},
{
"id": "4",
"artist": "Drake",
"title": "God's Plan"
},
{
"id": "5",
"artist": "Bebe Rexha",
"title": "Meant to Be"
},
{
"id": "6",
"artist": "Imagine Dragons",
"title": "Whatever It Takes"
},
{
"id": "7",
"artist": "Maroon 5",
"title": "Wait"
},
...
},
{
"id": "20",
"artist": "Taylor Swift",
"title": "Delicate"
},
{
"id": "21",
"artist": "Calvin Harris",
"title": "One Kiss"
},
{
"id": "22",
"artist": "Ed Sheeran",
"title": "Perfect"
},
{
"id": "23",
"artist": "Meghan Trainor",
"title": "No Excuses"
},
{
"id": "24",
"artist": "Niall Horan",
"title": "On The Loose"
},
{
"id": "25",
"artist": "Halsey",
"title": "Alone"
},
{
"id": "26",
"artist": "Charlie Puth",
"title": "Done For Me"
},
...
],
"playlists": [
{
"id": "1",
"user_id": "1",
"song_ids": [
"1",
"6"
]
},
{
"id": "2",
"user_id": "2",
"song_ids": [
"1",
"7"
]
},
{
"id": "3",
"user_id": "1",
"song_ids": [
"1",
"2",
"3",
"4",
"5",
"6"
]
}
]
}
Note: Some lines are omitted.
8. Conclusion
The proposed solution is a tiny framework which allows you to add more functionality. Out goal was also provide proper testing ergonomics. The code can be easily tested, since there is explicit convention in place, dependent objects can be stubbed or mocked which makes testing simpler. We also have used Dependency Injection pattern (among others) to minimize the code coupling.
9. Thank you
If you have more questions, please feel free to reach out.