A easy tool for crawl resource from URLs.
conda create -n easycrawl -y python=3.11
conda activate easycrawl
pip install requests
pip install beautifulsoup4
pip install git+https://github.com/guanhuankang/easyBucket.git
pip install git+https://github.com/guanhuankang/easyCrawl.git
rm -rf easybucketdatabase
python crawl.py
from easycrawl import EasyCrawl
from easycrawl import defaultHanlder, md5
if __name__=="__main__":
urls = ["https://..."]
entrances = [(md5(str(x)), x) for x in urls]
easyCrawl = EasyCrawl(entrances=entrances, handler=defaultHanlder, n_threads=16)
easyCrawl.start()
easyCrawl.join()
where "defaultHanlder" is the page handler function, and usually you need to write this function by yourself to meet your personal requirements. "defaultHanlder" has the following interface:
def defaultHanlder(hash, url, queue):
'''
# hash:str is the unique identifer to refer to the url:any
# url:any is the URL resouce, you can define any type of URL resouce, such as https://...
# queue:EasyQueue EasyQueue (from easycrawl import EasyQueue) is a thread-safe FIFO queue. We use this queue to record all urls in a fifo manner.
'''
## Code Here
## Toy Code: find all <img> tags util no more resouces that are available.
res = requests.get(url)
html = BeautifulSoup(res.content, 'html.parser')
print(html.find_all('img')) ## mark down all img tag
## append more urls to queue for future visiting (BFS)
for link in html.find_all('a'):
if link.get("href", "#").startswith("http") and not queue.visited(link["href"]):
queue.push(md5(link["href"]), link["href"]) ## push href to queue so that we can visit it later.
print(f"Queue: {queue.size()}", end="\r") ## print the remaining links
We also include a thread-safe FIFO queue, named easyQueue, in this repo. What are the advantages of it:
[x] Thread-Safe
[x] Simple to use: it supports push, pop, size, has, visited.
[x] Memory-Efficiency: it adopts a page-mechanism to dump part of the queue into storage to save memory, which make it friendly for memory-limited machine, like most personal vps.
from easycrawl import EasyQueue, md5
data = {"str": "Hello World", "int": 666}
hash = md5(data["str"]) ## we use md5 value as unique identify, you can choose any one you like
easyQueue = EasyQueue(name="anyname")
easyQueue.push(hash, data) ## push a data into queue
print("size:", easyQueue.size()) ## size of the queue
print("has:", easyQueue.has(hash)) ## whether queue has this data, whose unique identify is "hash"
print("visited:", easyQueue.visited(hash)) ## whether queue has visited this data (no matter it is in queue for now)
print("Advantage usage# setVisitedData:", easyQueue.setVisitedData(hash, data={"status": "in queue"})) ## we can store some additional data in the visiting tree.
print("Advantage usage# info:", easyQueue.info(hash)) ## we can store some additional data in the visiting tree.
print(easyQueue.pop()) ## pop the top value: hash, data
'''
size: 1
has: True
visited: True
Advantage usage# setVisitedData: None
Advantage usage# info: {'push': 1, 'pop': 0, 'data': {'status': 'in queue'}}
('b10a8db164e0754105b7a99be72e3fe5', {'str': 'Hello World', 'int': 666})
'''