GitLaw ⚖️
- get access to source code
- Prerequisite
- install Java8 and Scala 2.11 (will run into compatibility issues with 2.12)
- install Spark, Cassandra on local machine/remote servers
- install sbt for dependency management and packaging/shipping
git clone
this repo- make your contribution
- open a pull request
- ?
- profit
- Prerequisite
- Version Control for Laws and Legislations that helps making the creation and passage of legislation a more transparant process.
- Demo slides
➡️Data sources:
🔄Processing:
- getting files into S3 using Congress web scraper:
- follow the steps to start spark batch processing data from S3 -> Cassandra and create database schema
⬅️Output:
- up-to-date U.S. Laws and Legislation in Cassandra
- public facing API
- A minimal web UI
- Cleaning, aggregating data from various formats
- Optimizing Spark performance with customized partitioner
- Integrating diff-ing algorithm
- NoSQL vs SQL
- Due to the less-oftenly-changed nature of law schema, GitLaw can live with just one table, thus eliminating join operation and providing better performance
- When working with just one table, I value the speed of retrieval and ease of scalability.
- Overall, the ease of use is the main concern when building a 4 week MVP.
- Spark RDD vs Dataframe
- Dataframe being a table-structured data object, would have been, in the hindsight, a better choice for working with JSON formatted file
- However, as far as I researched, operations on RDD are well-documented, thanks to Datastax.
- The main concern is still the ease of development, RDD despite being a less suitable data object for my use case, allowed me to shorted dev time with its documentation.