/AQIseeker

A simple crawler to get air quality data in China from https://www.aqistudy.cn/historydata/ using requests-html

Primary LanguagePython

AQIseeker

A simple objectified crawler to acquire the air quality data in China from
空气质量历史数据查询
An objectfied crawler enables users to acquire AQI data in a free way and couple the crawler into users' own codes

Dependency

requests-html

Install requests-html by using
pip install requests-html

How to use

Acquire the data

Import the crawler class
from crawler import AQIseeker

Acquire the data from the website
this_page = AQIseeker('南京', '201703', 5) # '5' maximum request attempts (default=5)
this_page.getData()
this_page.metadict # access the dict that holds the data
# Note that one crawler instance can only retrieve the data of ONE specified city in ONE given month

Acquire the data of multiple cities and months

The crawler class accepts any valid input and attempts to get the data from the website. Though the user is totally free to call the class in their own ways, a simple parser is provided to handle such demand

Create a text file 'some_cities.txt' to hold some contents like below
南京 201701-201706
上海 201609-201703
北京 201610-201705
the format should be city_name yyyymm-yyyymm

Import a parser
from setting_parser import getCityTime
city_time_dict = getCityTime('some_cities.txt') # return a dict
The parser will return a dictionary containing the city names (as indice) and the month list. Please refer to 'example_front.py' for the usage of the parser and the crawler

Planned update

  • Base class of the crawler
  • A parser for formatted txt file
  • Language support of city names in English
  • Provide an alternative method to acquire data from multiple cities and months from a dict/str
  • Improve the performance by introducing parellel operation

---古老语言的分割线---

介绍: AQIseeker

一个简单的对象化爬虫,用于从以下网页爬取空气质量数据
空气质量历史数据查询
对象化的爬虫允许用户更自由地获取特定城市和时间的空气质量数据,并且更方便插入用户自己的代码

依赖

requests-html

安装 requests-html
pip install requests-html

使用方法

获取数据

import爬虫的类
from crawler import AQIseeker

从网站获取数据
this_page = AQIseeker('南京', '201703', 5) # '5' 最大请求次数 (默认=5)
this_page.getData()
this_page.metadict # 获取的数据会存放在字典metadict中

获取多个城市和时间的数据

只要是符合格式的城市名和时间表达式,该爬虫都可以处理。用户可以根据自己的需求请求多个数据,也可以使用此处提供的固定文本格式和文本处理工具一次性定义多个城市和时间

创建一个txt文件'some_cities.txt'(文件名可以随意),文件中的内容如下
南京 201701-201706
上海 201609-201703
北京 201610-201705
文本格式应为 city_name yyyymm-yyyymm

import提供的文本处理工具
from setting_parser import getCityTime
city_time_dict = getCityTime('some_cities.txt') # 返回一个字典
该工具会返回一个包含城市名(作为keys)和月份列表(作为values)的字典,之后可以使用爬虫来获取数据。可以参考'example_front.py'

计划更新

  • 本爬虫的基础类
  • 格式化文本的处理工具, 用于批量获取数据
  • 对中文城市名的英语支持
  • 允许从字典/字符串获取多个城市/月份的数据
  • 使用平行操作改善爬虫性能