This repository shares official data released by the Chinese government from November 24, 2022 to December 23, 2022 on places designated as 'high-risk' for COVID-19. "High-risk" means residents at those locations must quarantine at home (more on that below).
The data was scraped from the following websites:
- http://bmfw.www.gov.cn/yqfxdjcx/index.html (retired around December 18th)
- http://bmfw.www.gov.cn/yqfxdjcx/risk.html (retired after December 23rd)
Both are hosted by China's State Council, but the data comes from the National Health Commission (中华人民共和国国家卫生健康委员会). I started scraping the data on November 24th and had to stop December 23rd, because the NHC stopped publishing high-risk areas as the country lifted its zero-covid policy.
High-risk COVID-19 areas are places where more than 10 cases are discovered within a 2-week period (source in Chinese). In such areas, residents cannot leave their homes and necessary services are delivered to their door, such as grocery deliveries (source in Chinese).
Since the word "lockdown" can mean a variety of things, I decided to focus on high-risk areas because they are the strictest level of quarantine and lockdown in China.
The "raw_data" directory includes daily lockdown data that I scraped without any post-processing or data cleaning. The "tidy_data" directory includes the cleaned and reformatted version of the dataset, with the following columns:
The largest administrative unit lockdown data was characterized under. It includes China's major municipalities like Beijing and Chongqing, and all provinces and regions, including Xinjiang and Tibet. Note: Hong Kong and Macau are not included in this dataset.
This is the second largest administrative unit, which includes city districts but also even more local data at the 街道 level. But it can also refer to large swaths of land, like Karakax County (墨玉县) in Hotan, or Xinjiang Production and Construction Corps land units (ex: **生产建设兵团第八师134团).
I chose to format this column as Municipality/Province + District (ex: 北京市昌平区天通苑北街道) to make it easy to identify and match which districts belonged where. It also made it easier to map the data.
This is the most specific location data from the National Health Commission. I added up the number of high-risk addresses published to calculate the number of places under strict lockdown in China every day.
Here's some stuff to look out for if you decide to process the raw data yourself, or if you are curious about what was changed in the tidy_data versions:
Per day, there shouldn't be any addresses that are repeated but there are, perhaps due to human error when lockdown data is compiled by the National Health Commission.
There are also duplicate addresses that appear to be different but are actually identical. This happened a lot and it usually meant there was district-level location data included in the address, for example: 回民区战备路社区人行小区3号楼 vs. 战备路社区人行小区3号楼.
When the first website I scraped was taken down, I switched to the second URL, which wasn't formatted in the same way. The names of some municipalities and provinces were different, such as 云南 vs. 云南省.
I found stray punctuation at the end of some lockdown addresses, like: "新源道街道和平丽景社区盛德金地小区C座。"
Also, some addresses were bundled together under the same district by semi-colon (note: ;not ; <- first one comes from the Chinese keyboard) instead of listed separately. So I separated those addresses out by splitting the addresses via semi-colon.
I used Selenium to scrape the two State Council websites. I tried to find historical data on the Wayback Machine but it was very inconsistent. The Wayback Machine also struggled to take snapshots of the website even while it was still up.
If you're curious about what the two different websites looked like, I have some PDF copies stored in the backup_PDFs directory. Sadly, I wasn't able to save PDF copies of all the data for each day.
Note: I had some issues with my computer December 15th-17th and ended up having to save PDFs of the website for those days and extracting the data using Tabula later. I then formatted the data to resemble the raw data files I've shared in the raw_data directory. I've included PDFs from those days in the backup_PDFs directory for reference.
To map lockdowns on the district level, such as in the animation up top, I used Amap's Point of Interest Web API, which you can find here.
As you can see in the chart above, there's a huge spike in cases on December 2nd. The change mainly comes from Guangdong province which went from 1,969 high-risk areas on December 1st to 7,659 on December 2nd. This was at a time when China was starting to loosen its covid policies though. So what happened?
Instead of listing an entire building as being high-risk, a lot of places in Guangdong began designating specific apartments or stores. So while there were probably fewer places under lockdown, more addresses were being listed as high-risk. I couldn't think of a good way to reflect this in the data, but if you have any ideas, I would love to hear them!
Here are some of the tools I used to scrape, clean, and visualize:
- Selenium
- QGIS
- tidyverse, lubridate & ggthemr (R packages)
- ImageMagick
- Tabula