/discogs_aio_spider

基于httpx的一个大型项目 ,爬取黑胶唱片网站 Discogs

Primary LanguagePython

Crawl a foreign vinyl record information website(Discogs), use Asyncio+Httpx+Motor+Aio-Pika+Aioredis in Python3.8

Project flow chart

Project flow chart

How to run?

install poetry

python3 -m pip install poetry

init environment

poetry init

install dependencies

poetry install 

Run

poetry run python main.py

Attention: Step1 must generate task queue first, otherwise step2 will go wrong.

Similarly, step3 must wait for step2 to generate a task queue. The call between files can be optimized according to actual business conditions.(Where step1, step2, step3 are aliases of functions)

About Code

Using some of the new features of python, you can learn some new usages, such as: typing, fstring, pydantic, dataclass How to develop asynchronous programming, the scheduling relationship between codes,the use of decorators, etc.

Introduction to the use of storage tools

  • redis: remove duplication
  • rabbitmq: Conducive to its message confirmation mechanism
  • mongo: data persistence

Bug?

  • In step1, an error will be reported when using step2 and step3, so multiple processes are used, and the optimization will be continued later to find solutions.
  • pydantic can't completely replace dataclass, there will be errors in some places, whether follow-up investigation can be solved.
  • There are some typing that has not been added and will be completed later.

Of course, the level is limited, and there are many areas that need to be optimized. Welcome to discuss with me.

Contact Me