/easycrawl

EasyCrawl is a easy tool for crawling resource from Internet.

Primary LanguagePythonApache License 2.0Apache-2.0

EasyCrawl

A easy tool for crawl resource from URLs.

Install

conda create -n easycrawl -y python=3.11
conda activate easycrawl

pip install requests
pip install beautifulsoup4
pip install git+https://github.com/guanhuankang/easyBucket.git
pip install git+https://github.com/guanhuankang/easyCrawl.git

Run Toy Code

rm -rf easybucketdatabase
python crawl.py

Tutorial

from easycrawl import EasyCrawl
from easycrawl import defaultHanlder, md5

if __name__=="__main__":
    urls = ["https://..."]
    entrances = [(md5(str(x)), x) for x in urls]
    
    easyCrawl = EasyCrawl(entrances=entrances, handler=defaultHanlder, n_threads=16)
    easyCrawl.start()
    easyCrawl.join()

where "defaultHanlder" is the page handler function, and usually you need to write this function by yourself to meet your personal requirements. "defaultHanlder" has the following interface:

def defaultHanlder(hash, url, queue):
    '''
    # hash:str is the unique identifer to refer to the url:any
    # url:any is the URL resouce, you can define any type of URL resouce, such as https://... 
    # queue:EasyQueue EasyQueue (from easycrawl import EasyQueue) is a thread-safe FIFO queue. We use this queue to record all urls in a fifo manner.
    '''
    ## Code Here
    ## Toy Code: find all <img> tags util no more resouces that are available.
    res = requests.get(url)
    html = BeautifulSoup(res.content, 'html.parser')
    print(html.find_all('img'))  ## mark down all img tag

    ## append more urls to queue for future visiting (BFS)
    for link in html.find_all('a'):
        if link.get("href", "#").startswith("http") and not queue.visited(link["href"]):
            queue.push(md5(link["href"]), link["href"])  ## push href to queue so that we can visit it later.
    
    print(f"Queue: {queue.size()}", end="\r")  ## print the remaining links

EasyQueue

We also include a thread-safe FIFO queue, named easyQueue, in this repo. What are the advantages of it:

[x] Thread-Safe
[x] Simple to use: it supports push, pop, size, has, visited.
[x] Memory-Efficiency: it adopts a page-mechanism to dump part of the queue into storage to save memory, which make it friendly for memory-limited machine, like most personal vps.

from easycrawl import EasyQueue, md5

data = {"str": "Hello World", "int": 666}
hash = md5(data["str"])  ## we use md5 value as unique identify, you can choose any one you like

easyQueue = EasyQueue(name="anyname")

easyQueue.push(hash, data)   ## push a data into queue
print("size:", easyQueue.size())  ## size of the queue
print("has:", easyQueue.has(hash))  ## whether queue has this data, whose unique identify is "hash"
print("visited:", easyQueue.visited(hash))  ## whether queue has visited this data (no matter it is in queue for now)
print("Advantage usage# setVisitedData:", easyQueue.setVisitedData(hash, data={"status": "in queue"}))  ## we can store some additional data in the visiting tree.
print("Advantage usage# info:", easyQueue.info(hash))  ## we can store some additional data in the visiting tree.
print(easyQueue.pop())  ## pop the top value: hash, data

'''
size: 1
has: True
visited: True
Advantage usage# setVisitedData: None
Advantage usage# info: {'push': 1, 'pop': 0, 'data': {'status': 'in queue'}}
('b10a8db164e0754105b7a99be72e3fe5', {'str': 'Hello World', 'int': 666})
'''