/catalog-data

Repo which holds some panda solutions and papers

Primary LanguageJupyter Notebook

Catalog Data

This repo contains the source JSON files used to build the CAIDA resource catalog ( catalog.caida.org)'s database.

The Makefile supports the following commands:

  • run (default): this will build the data files with up-to-date dates and un-indented output
  • fast : this will build the data files without up-to-date dates and un-indented output
  • read,readable : this will build the data files without up-to-date dates and indented output this version is more human readable output
  • clean : this will remove all the placeholder files

how to links

code

Generate the database json files by typing make on the command line.
If you would like to contribute to the catalog:

Overview of these files

The catalog is built from three sources: CAIDA's pubdb database, local json files in sources dirctory, and markdown files stored in the catalog-data-caida repository. The pubdb is a seperate database and we parse a static dump here PANDA-Papers-json.pl.json and PANDA-Presentations-json.pl.json. The raw caida metadata files are kept in catalog-data-caida to allow a different set of permissions.

The original and convert JSON files are parsed by data-build.py are then used to generate:

  • id_object.json : a id to object dictionary, it has all the object data execpt for links
  • id_id_link.json : stores the link files in {from,to,label} nested dictionaries
  • word_score_id.json : dictionary for each word of scores and id pairs.
  • types_ids.json : mapps the object types and the ids of that type
  • suggestions.json : a list of suggestions for 0 results queries copied from data/suggestions.json The order of suggestions in the file, is the 'default order'
  • status_word_id.json : a status to id (of object) dictionary.

These are made by two scripts:

  • scripts/pubdb_placeholder.py : creates the pubdb objects
  • scripts/data-build.py : creates the id_object.json, id_id_link.json, and word_score.id objects

Build the catalog data

1. Start your Environment

Windows

If you are on a Windows system, you will need to install a Linux terminal for Windows.

Then run the following commands:

sudo apt install make 
sudo apt install python3-pip
pip3 install -r requirements.txt 
python3 -m nltk.downloader all

Unix/Mac

You will need to have b4 installed in python3. You can do this using virtualenv.

python3 -m venv env
source env/bin/activate
pip3 install -r requirements.txt
download NLTK

https://www.nltk.org/data.html (own-1.4)

2. Get catalog-data-caida (if you have access, else skip to 3)

All the CAIDA datasets are stored in catalog-data-caida, which has restricted access. If you have access to this repo you should clone it, if not then its IDs will be supplied by data/data_id___caida.json.

a. Make a new clone

## clone the repo if you do not have it (inside of catalog-data, as a sub directory)
clone git@github.com:CAIDA/catalog-data-caida.git

b. Or if you already have the catalog-data-caida repo cloned, update the catalog-data-caida

cd catalog-data-caida
git checkout v1
git pull

3. Run make

This will be handled by the Makefile.

make
# If you haven't activated virtualenv yet
# You need only do this once per shell
source env/bin/activate 

make

You can also do make clean to remove the pubdb files and id_* files.

Notes on catalog-data-caida usage

If you do have access to catalog-data-caida and may often update both repos, try making a symlink between your catalog-data and catalog-data-caida repositories so that you only keep 1 copy of each repository. For example, your directory structure may look like this:

catalog-data
 |-> catalog-data-caida -> ../catalog-data-caida
catalog-data-caida