/cve-database-ingestion

A take home exercise for ingesting data from the NVD feeds and OSV vulnerabilities database.

Primary LanguageGo

Take home exercise - konvu

I this take home exercise, I was asked to write a program that can create and update a database of Vulnerabilities and their corresponding affected Java packages (So primarily the Maven repositories). The CVE feeds are updated here which may also be indexed on the OSV database.

Methodology

I will briefly explain what my thought process was, for this project.

  • First, I had to get the NVD feeds. I was explicitly asked to analyze feeds for the years 2023 and 2024, so I downloaded the archives from the given link above. However, this was not my first approach since I saw that the Feeds were going to be migrated to an API. So I tried to implement a search after / pagination request to gather all the feeds from start-date to end-date, but I soon realized that this will take me a lot of time to do, and I was told that this test shouldn't take more than 3 hours. So I went ahead and just downloaded the feeds.

  • The feeds from NVD contain not only packages but also other applications, which do not concern us, and the only common fields in the NVD databse and the OSV database are either the IDs Which can be one of GHSA, OSV, or even a CVE if it exist's in the alias of the IDs OR the commit SHA of the packge affected and you can have either one of them, or both. (I'm guessing it's some sort of an elasticsearch/opensearch ish database). The problem is that The JSON from the feeds are very nested and you find these sometimes, in different parts of the JSON. So in order to simply extracting information from the JSON, I used an LLM, namely, Llama-3 (8b parameters). In general, when parsing very dense JSONs, using an LLM seems like a good choice as they have gotten very efficient in understanding JSONs.

  • Luckily, the OSV database API is not rate limited. So the moment we receive the extracted information, we can pass that on to the OSV api and search if that package exists and extract the affected package ecosystem, version ranges, etc. and filter based on the desired ecosystem, in this case, Maven repositories.

  • Once, that information is gathered, we will have, ideally: a CVE id and their corresponding affected maven packages and their versions in a JSON file. Note: I did not use an actual database, I just used a json file called pkg_info.json

Tech choices

  • I made this project in Go due to it's concurrency patterns.
  • The LLM I used is coming from an inference engine called Groq as they have very generous rate limits in the free tier and it's extremely fast. ~0.26 seconds per request on average, which for a model like Llama-3 is extraordinary!

Special tools and libraries

  • I mostly stuck to the Go standard library, except for a scheduler package called gocron For starting an update job every 2 hours, since that's roughly the time period in which the NVD database get's updated.

Installation

Make sure to have Go installed

Then run:

go build

Get your Groq API key here It should be very straightforward, just creating an account should do the job, however, if you want to use my API key, Please shoot me an email, I'll send it over :)

Check the .env-dist file to fill in the necessary variables into a .env:

DATA_PATH=
GROQ_API_KEY=

The DATA_PATH is folder containing the feeds from 2023 and 2024 in a JSON format (They have to downloaded in order to test the code, since they are relatively large files for github). and the GROQ_API_KEY is, well, the Groq API key.

To ingest current data from 2023 - 2024:

./nvdbase -c ingest

To launch a cron job that updates the last modified feeds from NVD:

./nvdbase -c update

Challenges faced

  • The Groq API, inspite of having a generous free tier, is very limiting for a large and continuos database feed like the NVD, so I was rate limited quite a few times and had to wait.
  • There are significantly lesser Maven packges, when compared to let's say PyPI, so to even test if my code was working, I had to first test it on the PyPI ecosystem to see that the JSON file was being written properly. (I included some test results for the PyPI package ecosystem in a file called pkg_info_pypi_test.json) as I was rate limited before I even hit my first maven package search.
  • Since I initially spent time exploring the NVD feeds API, I had lost some amount of time before I started coding and I could not finish this test in the given time frame of 2-3 hours. (It took me roughly 5h20 minutes in total).

Potential improvements

This code is by no means, perfect and can use some improvements, if this project scales up. Here are some things that I think can be nice:

  • If a Mistral/OpenAI API key is available, then integrating Tool calling / function calling capability of LLMs to directly call the OSV endpoint automatically would be a plus!
  • Actually implementing the Pagination search for NVD feeds to not be solely based on the Webpage.
  • Parametrize the ecosystem variable to run this ingestion pipeline against many different types of packages.

Final thoughts

I really enjoyed doing this take home exercise, and I would definitely love to hear your feedback on how I did and where I can improve.