There is a lot of tests for both java and sql code and some dependencies. Fully working gradle project (jdk8 required) can be found at: https://github.com/piotrturski/wh-task
Algorithms' space and time complexities are inside javadocs. Some of them uses expected and amortized time as they rely on hashmap and fibonacci heap.
Sql solutions are in: src/main/resources
, They were tested on mysql 5.7.17. To run sql tests (written in java),
mysql db is needed. Easiest way to setup everything (port, user, password, db name) is to use docker:
docker run -p 3306:3306 --rm -e MYSQL_ALLOW_EMPTY_PASSWORD=true -e MYSQL_DATABASE=db mysql:5.7.17
In case above values can’t be used, they can be changed on java side in DbTest.java
file.
to run all tests: /gradlew test
compile 'com.google.guava:guava:20.0' compile 'org.neo4j:neo4j-graph-algo:3.1.0' compile 'one.util:streamex:0.6.4' compile 'org.apache.spark:spark-core_2.11:2.0.2' testCompile 'junit:junit:4.12' testCompile 'com.googlecode.zohhak:zohhak:1.1.1' testCompile 'org.assertj:assertj-core:3.5.2' testCompile 'org.springframework:spring-jdbc:4.3.5.RELEASE' testCompile 'mysql:mysql-connector-java:5.1.40' testCompile "org.flywaydb:flyway-core:4.0.3"
Two solutions are provided to handle different situations:
there is a simple in-memory streaming solution:
-
take reader as input param (so anything can be used: file, network stream, in-memory data)
-
read input line by line,
-
update counter for each label on the fly (storing in memory only unique lables),
-
later use fibonacci heap to extract top k labels (it benefits
k << n
while maintaining optimum asymptotic complexity for any value) -
and return stream allowing client to take as many labels as needed
use spark:
-
take RDD as input (so in-memory collection or any hadoop-supported location can be used like hdfs, s3, HBase etc) and number of top labels we want to get
-
count by label and sort unique labels everything
-
filter to provide only top k
-
return RDD so client can process it however he wants (store on hadoop, continue processing etc)
it’s tested using concurrent local spark mode
It reads from sometbl
and writes to result_table
table.
It seems like a good real world scenario would be to use it as a 'before trigger'
to on-the-fly normalize data inserted into a table;