Implemented crawler using Breadth First Search Algorithm. Code can be found at https://github.com/venukarnati92/crawler/blob/master/crawler/crawler.py
For dealing with hierarchical data in MySQL I have used Adjacency List Model. In the adjacency list model, each item in the table contains a pointer to its parent(as shown in the below image).The topmost URL(RootNode) has a NULL value for its parent. Data will be stored in the table as follows
CREATE TABLE links(node_id INT AUTO_INCREMENT PRIMARY KEY, URL VARCHAR(2083) NOT NULL, Parent INT DEFAULT NULL
Implementation can be found at https://github.com/venukarnati92/crawler/blob/master/crawler/crawler_DB.py
The snapshot of database schema
Before running crawler_DB.py, make sure of the following points:
a.You have created a database crawlerdb
b.User ID and password are set to access crawlerdb
Inorder to implement multithreading, I have used threading module.Code can be found at https://github.com/venukarnati92/crawler/blob/master/crawler/crawler_multithreading.py
For multiprocessing implementation, I have used multiprocessing module.Code can be found at https://github.com/venukarnati92/crawler/blob/master/crawler/crawler_multiprocessing.py
To extend code to distributed, we can use Ray, which is an open source library for writing parallel and distributed Python.
To turn python function getLinks() in the code into a “remote function”, declare the function with the @ray.remote decorator. Then the function invocations via getLinks.remote() will immediately return futures (a future is a reference to the eventual output), and the actual function execution will take place in the background (we refer to this execution as a task).
@ray.remote
#Get all the href's from the URL
def getLinks(url):
html_page = urlopen.urlopen(url)
soup = BeautifulSoup(html_page, "html.parser")