murakami
To take advantage of functionality, one has to start up each of the two services.
File Import Service
Reads in files and stores data in the data/ directory
Starting the Import Service
java -Dconfig.file=conf/import.conf -jar dist/file-importer-1.0-SNAPSHOT-shaded.jar
To Read a File
One can use the importData script in bin/ like so:
bin/importData "file-importer/importData/importAdvanced.psv"
The input argument is the path of the file to import. The file must contain all columns and must have a header, the columns can be in any order.
Query Service
Queries stored data using a robust query language
Starting the Query Service
java -Dconfig.file=conf/query.conf -jar dist/murakami-core-1.0-SNAPSHOT-shaded.jar
To Query the Service
Any combination of the below query methods can be used to return data.
Available Columns
- STB (String)
- TITLE (String)
- REV (Double)
- VIEW_TIME (Date)
- DATE (Date)
- PROVIDER (String)
Select Queries
A simple select query will look across all data to get all the specified columns
bin/query -s $column1,$column2,...
EXAMPLE
bin/query -s STB,TITLE
Ordering Results
Any query can be ordered, simply input any number of columns and the query will be ordered by those columns starting with the first (each following column is used to break ties)
bin/query -s $column1,$column2,... -o $column1,$column3,...
EXAMPLE
bin/query -s STB,TITLE -o STB
Advanced Filtering of Results
Queries can be filtered with checks on equality for any column (in the strict format $col="$val"), also one can combine any number of 'and' or 'or' statements to create a filter.
bin/query -s $column1,$column2,... -f '$column1=$value1 and ($column2 = $value2 or $column1=$value3) ...'
EXAMPLE
bin/query -s STB,TITLE -f 'STB="stb1" or (STB="stb2" and TITLE="the hobbit")'
Grouping and Aggregating
Instead of a raw select, one can choose to aggregate data for each column based on a column group. To use this functionality one will have to set a group and specify aggregates for all other columns.
Available Aggregates:
- min - Numerical only, finds the minimum value
- max - Numerical only, finds the maximum value
- sum - Numerical only, adds up all values
- Count - Any, adds up the number of unique values
- Collect - Any, creates a list of all unique values
bin/query -s $groupColumn,$column1:agg1,$column2:agg2,... -g $groupColumn
EXAMPLE
bin/query -s STB,TITLE:collect,REV:sum -g STB
OUTPUT
stb1,[the game,unbreakable,the matrix],18.5
stb3,[bruce lee,the matrix],6.0
stb2,[mighty ducks,the hobbit],16.0
Future Ideas
- Switch to using column oriented storage for compaction and faster reads
- Parallelize the final combination step
- Order each combiner in parallel before final combination so that sorts are divide-and-conquer
- Don't restrict order of columns, and make taking in new columns seamless
- Metrics and healthchecks