Team members:
- Avik Kuthiala (101803116)
- Naman Tuli (101983040)
Video Link:
Presentation File Here.
Make sure to use venv
for installing dependencies. Use the following line of code to install all the required dependencies in your created virtual environment. To make an env, run:
python -m venv myenv
myenv\Scripts\activate.bat
Run the following in your terminal:
pip install -r requirements.txt
Note: If we missed any dependency, kindly pip install the library which was not found.
First navigate to /Bicameral-Minds
, the repository you cloned.
After installing the dependencies, navigate to directory Code/NER Models/
go to each folder and select Extract Here
for all 4 zip files.
Setup complete.
Navigate to directory Code/
To run the crawler with default settings, just use:
scrapy crawl mygovscraper
The default running time for the crawler is 3 minutes. To run the crawler for a specific amount of time, use:
scrapy crawl mygovscraper -s CLOSESPIDER_TIMEOUT=<time in secs>
Example (To run for 1800 seconds):
scrapy crawl mygovscraper -s CLOSESPIDER_TIMEOUT=1800
Note: Do not give space after CLOSESPIDER_TIMEOUT= as it will give error.
Other specific settings to configure the crawler can be found on official scrapy documentation.
A database will be created:
database.csv
Navigate to directory Code/
and run in terminal:
python postprocess.py
New database will be created:
clean_database.csv
- The problem statement mentioned that 15 countries will be considered for the hackathon. Hence we used only 3 countries to train our NLP models(since training for 15 countries and then producing results on them would not make sense). But in the FAQs of the e-mail sent by MSC on 17-09-2020(for deadline extention), it was mentioned, "Try to train with as many countries' govt websites' HTML structures as possible". We trained with data for 3 countries and scaled it for 15, hence we believe we can train for 15 countries and scale the solution to 70-80 countries. However, it was not feasible to carry out this upscaling in under 4 days.
- The crawler visits all the sites, but without following any particular order. It may happen that you would have to wait for some while before you start seeing meaningful websites appear in logs.
- After the crawler starts running, you can see the sites that are being crawled in
Code/log.txt
file. Currently it has been emptied out. - The sites which are to be crawled in are to be mentioned in
starter_sites.txt
currently, the file contains all 14 sites to be considered. We strongly advise that for testing purposes, try with only 1 site since crawling govt sites is a very computationally costly process. - The sample database that we created was run for a total of 16 hours.
- Results on news article pages:
The above image is an example of a news article webpage from which Prefix, Name and Position held was correctly extracted. The scraper is not designed to extract profiles from huge paragraphs, yet we noticed a few profiles being successfully extracted from news articles as well.
The API has been built using node.js and express.js and functions by fetching data from a mongo.db database.
Ensure that you have a functioning mongo.db database setup before running the API on postman.
Navigate to the db_api
directory and install the required modules using npm:
npm install express node body-parser mongoose nodemon
First start the database server using -
mongod
To import the csv database into mongo.db, run the following command -
mongoimport --type csv -d record_db -c records --headerline --drop final_db.csv
Then make sure you are in db_api/
and run the command -
npm run start
Now the setup is running and you can use postman
to test your API. The parameters need to be passed using x-www-form-urlencode
to the Body
.
The API will be hosted on http://localhost:3000/
To get all records, use - http://localhost:3000/records
\
To get details of a single person, use -
http://localhost:3000/findrecord
and pass parameter name
as explained above and in the accompanying video.