README

master:

This is the crawler component of Buzzbang search.

Setup

These instructions are for Linux. Windows is not supported.

1. Create the intermediate crawl database

./setup/bsbang-setup-sqlite.py <path-to-crawl-db>

Example:

./setup/bsbang-setup-sqlite.py data/crawl.db

2. Queue URLs for Bioschemas JSON-LD extraction by adding them directly and crawling sitemaps

./bsbang-crawl.py <path-to-crawl-db> <location>

The location can be:

a sitemap (e.g. http://beta.synbiomine.org/synbiomine/sitemap.xml)
a webpage (e.g. http://identifiers.org or file://test/examples/FAIRsharing.html)
a path (e.g. conf/default-targets.txt which will then crawl all the locations in that file)

Example:

./bsbang-crawl.py data/crawl.db conf/default-targets.txt

3. Extract Bioschemas JSON-LD from webpages and insert into the crawl database.

./bsbang-extract.py <path-to-crawl-db>

4. Install Solr.

5. Create a Solr core named 'bsbang'

cd $SOLR/bin
./solr create -c bsbang

6. Run Solr setup

cd $BSBANG
./setup/bsbang-setup-solr.py <path-to-bsbang-config-file> --solr-core-url <URL-of-solr-endpoint>

Example:

./setup/bsbang-setup-solr.py conf/bsbang-solr-setup.xml --solr-core-url http://localhost:8983/solr/testcore/

7. Index the extracted Bioschemas JSON-LD in Solr

./bsbang-index.py <path-to-crawl-db> --solr-core-url <URL-of-solr-endpoint>

Example:

./bsbang-index.py data/crawl.db --solr-core-url http://localhost:8983/solr/testcore/

$ python3 -m unittest discover

Future possibilities include:

Possibly switch to using a 3rd party crawler or components rather than this custom-built one. Please see buzzbangorg#5
Make crawler periodically re-crawl.
Understand much more structure (e.g. DataSet elements within DataCatalog).
Parse other Bioschemas and schema.org types used by life sciences websites (e.g. Organization, Service, Product)
Instead of using Sqlite as intermediate crawl store, use something more scalable (perhaps mongodb, cassandra, etc.). But also see the item where we may want to replace parts/most of crawling infrastructure with a third party project, which will already have solved some, if not all, of the scalability issues.
Crawl and understand PhysicalEntity/BioChemEntity/ResearchEntity once this matures further.

Any other suggestions welcome as Github issues for discussion or as pull requests.

Contributions welcome! Please

Thanks!