GitSecure: A JavaScript repository from amZotti

Service Runner

app.js is the file that is responsible for coordinating execution of all the services. basic-server.js is the file that is responsible for serving web pages. For this application to be fully functional 2 distinct node instances need to be ran simultaneously. There needs to be a node process which runs the web server. There also needs to be another node process which runs the Github API key detection services. The key distinction here is that both app.js and basic-server.js need to be ran simultaneously for the application to be fully functional.

Services

Scrape Git repository meta data
Download Git repositories based on meta data
Parse Git repository content to detect security flaws

Scraping

The scraping service is responsible for downloading the most recently updated Github repository metadata using Github API. It then persists those repositories metadata to a MongoDB data store. After the scraping service is done downloading repository metadata to MongoDb the downloading service takes over and downloads the actual repositories associated with the metadata that the scraping service acquired.

The current query for this API call is:

https://api.github.com/search/repositories?q=pushed:>=' + dateString +
'&order=desc&per_page=100

Downloading

The downloading service is responsible for downloading github repositories whose metadata was retrieved by the scraping service. The downloading service pulls information from the metadata MongoDb collection. Once downloading service has the metadata collection, it gets the git_url property from each instance of metadata. It then uses the nodegit module to download the contents of the git_url from Github. Downloaded repositories are stored in the git_data directory. After the downloading service is finished downloading repositories to the git_data directory the parsing service becomes activated and API key detection begins.

Parsing

The parsing service initiates after the downloading service finishes acquiring repositories. When this occurs the parsing service pulls repositories from the database and uses bash and regex to scan for API keys.

As repositories are downloaded to the database they are attributed a processed property which is initialized to false. Once the parsing service pulls out a repository from the database, it immediately marks its processed property to true. This makes it so the parsing service will never process duplicate instances of any single document.

If the parsing service detects an API key it registers the violation in the hitdata MongoDb collection. Once all repositories are scanned garbage collection occurs via the fileSystem subservice and then the entire cycle of scraping, downloading, and parsing is recursively restarted.

Database

Rather than each service creating it's own database connection, each service shares a single connection. This single connection is established in app.js and is accessible to all services via GLOBAL.db

Currently, the services are architectured in such a way whereby MongoDB is the first thing to be initialized. All services are being passed into MongoDb's connection callback function. Basing our service architecture around callbacks is not ideal, but it is stable for our current data load and necessary to reach MVP.

Ideally, services would use EventEmitters instead of callbacks as a means of communicating with each other asynchronously. Using EventEmitters instead of callbacks would decouple services and allow for much greater code flexibility. Until MVP is reached however, the current service architecture will be utilized.

Tools

The downloading and parsing services use the async module in order to use an asynchronous for loop to retrieve the metadata from MongoDb.

Required Documentation

Add section for outcome and purpose of application as a whole
Add section for backend
Add section for frontend

amZotti/GitSecure