We aim to collect data about a specific domain from various sources such as Wikipedia, wiki data and web pages, documents etc. Our final objective is to provide a database(JSON or CSV ) that contains information from all these sources in a coherent and concise manner. This database can be used by anyone who plans to work in a particular domain without worrying about data collection and data quality.
The domain we have chosen for is Minerals.
- Ensure that you have python 3.8+ installed in your system.
- Clone/ Download the repository
- Run the following commands to create and activate a python environment in the root folder.
$ python -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
The directory contains only a single file which we used to obtain the links of various minerals in Wikipedia. Further descriptions of the code can be found in the later sections.
The directory contains all the code used to scrape various data sources listed above. Each file is responsible for scraping a single data source.
A data folder is a place where all the scraped, processed and merged data is stored. It is further divided into sub-directories to store data associated with each source.
This directory is responsible for merging and cleaning data obtained from the sources. The code is divided into multiple files based on the use cases. It also contains a utils file that contains some helper functions used in the cleaning and merging process.
This folder contains notebooks used to analyze data and create plots about data distribution. Primary libraries used to create plots are matplotlib, seaborn and plotly.
This is a sample directory that is used to save logs obtained when scraping data from Wikipedia. The scraping took a large time to finish, hence logging the output. A sample log is present in the directory.
Jupyter notebooks can be run cell by cell to obtain the output. Any important information is described below below.
The webmineral notebook uses selenium to retrieve data. It is expected to have a chromedriver installed in the system.
You can install chromedriver from here.
Some cleaning notebooks use nltk.corpus.wordnet
as a dependency. Ensure that you have download
nltk wordnet to use it. To download, you can use the below procedure in a python interpreter.
import nltk
nltk.download('wordnet')
The below section describes how to explicitly run `.py` files used in scraping.
Run the following commands to avoid any errors in the project root.
$ mkdir -p logs
$ mkdir -p data/wikipedia
$ python scrapers/wikipedia.py <input_file_path>
The output will be present in data/wikipedia
directory. The corresponding log will be present
inlogs
directory.
Ensure that you have Java 8+ installed in your system. The library tabula-py
requires java
runtime environment.
$ mkdir -p data/ima
$ python scrapers/ima_pdf.py <input_file_path>
The output will be present in data/ima
directory.