app.js
is the file that is responsible for coordinating execution
of all the services. basic-server.js
is the file that is responsible
for serving web pages. For this application to be fully functional 2 distinct node instances need to
be ran simultaneously. There needs to be a node process which runs the
web server. There also needs to be another node process which runs the
Github API key detection services. The key distinction here is that both
app.js
and basic-server.js
need to be ran
simultaneously for the application to be fully functional.
- Scrape Git repository meta data
- Download Git repositories based on meta data
- Parse Git repository content to detect security flaws
The scraping service is responsible for downloading the most
recently updated Github repository metadata
using Github API. It then persists those
repositories metadata
to a MongoDB data store. After the scraping service is done downloading repository
metadata
to MongoDb the downloading service takes over and downloads the
actual repositories associated with the metadata
that the scraping service
acquired.
The current query for this API call is:
https://api.github.com/search/repositories?q=pushed:>=' + dateString + '&order=desc&per_page=100
The downloading service is responsible for downloading github repositories
whose metadata
was retrieved by the scraping service. The downloading service pulls information
from the metadata
MongoDb collection. Once downloading service has
the metadata
collection, it gets the git_url
property
from each instance of metadata
. It then uses the
nodegit module to download the contents of the git_url
from Github. Downloaded repositories are stored in the git_data
directory. After the downloading service is finished downloading repositories to
the git_data
directory the parsing service becomes activated and
API key detection begins.
The parsing service initiates after the downloading service finishes acquiring repositories. When this occurs the parsing service pulls repositories from the database and uses bash and regex to scan for API keys.
As repositories are downloaded to the database they are attributed a processed
property which is initialized to false
. Once the parsing service
pulls out a repository from the database, it immediately marks its
processed
property to true
. This makes it so the
parsing service will never process duplicate instances of any single
document.
If the parsing service detects an API key it registers the violation in the
hitdata
MongoDb collection. Once all repositories are scanned
garbage collection occurs via the fileSystem
subservice and then
the entire cycle of scraping, downloading, and parsing is recursively restarted.
Rather than each service creating it's own database connection, each service
shares a single connection. This single connection is established in
app.js
and is accessible to all services via GLOBAL.db
Currently, the services are architectured in such a way whereby MongoDB is the first thing to be initialized. All services are being passed into MongoDb's connection callback function. Basing our service architecture around callbacks is not ideal, but it is stable for our current data load and necessary to reach MVP.
Ideally, services would use EventEmitters instead of callbacks as a means of communicating with each other asynchronously. Using EventEmitters instead of callbacks would decouple services and allow for much greater code flexibility. Until MVP is reached however, the current service architecture will be utilized.
The downloading and parsing services use the async
module in
order to use an asynchronous for
loop to retrieve the
metadata
from MongoDb.
- Add section for outcome and purpose of application as a whole
- Add section for backend
- Add section for frontend