PublSpider: A Python repository from dracit7

PublSpider

PublSpider is a web crawler based on scrapy. It is designed to gather the information about academic publications.

To start using PublSpider

It's necessary to install the scrapy library of python in advance.

pip install scrapy

How to use

Firstly, ensure that your working directory is the root directory of this project. Then, edit targets.json and add some conferences you like to it.

Attention: name of conferences in targets.json should be consistent with its respective name in dblp.

For example: I want to gather information about publications published in USENIX ATC, and the corresponding page on dblp is https://dblp.org/db/conf/usenix/, then I should add an "usenix" to targets.json.

After that, execute scrapy crawl <media> to start gathering information.

<media> is the media for storage. PublSpider currently supports two kinds of media:

sqlite

sqlite is a library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine.

After executing scrapy crawl sqlite, a file names data.db should be generated in your working directory. feel free to browse it with any sqlite browser you like.

json

json is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.

After executing scrapy crawl json, a file names data.json should be generated in your working directory. JSON files could be viewed or edited by any text editor.

Configuring

Some advanced features will not be available until you enable them. Edit settings.py and you can see some extra settings: