A wikipedia scraper bot made in python. Developed with Python 3.6.0 using Spyder.
Necessary Modules
- BeautifulSoup
- Requests
- Html2Text
- Validators
- OS
- Regex
Installation
First install Python 3.6.0
Clone the repository to your desktop. Run Main.py using CMD or Terminal by using the command python Main.py
To do
-
Create a very basic wikipedia scraper that scrapes the title and the first few paragraphs. We will provide a url and the scraped text will be stored in a text file in a output folder. -
Create a separate file for the downloaded text -
Divide the code among various modules -
Clean the scraped data -
Add comments to the code
-
Handle the exceptions that could occur -
Make a reddit bot out of it
-
The Structure will look like=> Main.py, Scraper.py, Cleaner.py,TxtToFile.py, downloadedTxt Folder
The Scraper doesn't work correctly on pages(like Illuminati) that have quotes text in them. Needs to be fixed