推特数据爬取

环境安装

为避免python版本不一致导致的不兼容问题，建议在python3.10环境运行。
需要安装twscrape库：
```
pip3 install twscrape
```

运行

本仓库有两个模块的源代码，一个是src，另一个是twsrc。其中，src是基于snscrape写的爬取代码，由于2023年7月份推特政策的更新，无法以游客身份继续爬取数据，因此已不可用。twsrc是基于twscrape写的爬取代码，需要购买账号登录，以完成爬取。

运行模块`src`（已弃用）

运行脚本是run-pooled.sh，运行前需要设置相应的变量：

DIVIDE_DIR应该是一个包含有若干txt文件的文件夹，其中每个txt文件的每一行是一个需要爬取的关键词。
DOWNLOAD_DIR是下载到的数据与log文件存放的文件夹。

此外，需要在代码中修改爬取的时间段。具体来说如果爬取模式为多进程或者多线程下载，应该要修改code/pooled_download.py中的：

def pooled_download(args: Namespace):

    ...

    def _get_configs(counter_callback = lambda: None):
        temp_count: int = 0
        used_keys = get_used_keys(args)
        download_configs: List[Tuple[tuple, int]] = []
        for key in keys:
            for month in range(12 * 12):
                if (key, month) not in used_keys:
                    until = datetime.datetime(*date_minus((2022, 12), month), day = 1)
                    since = datetime.datetime(*date_minus((2022, 12), month + 1), day = 1)
                    temp_file_path = os.path.join(temp_dir, f'{temp_count :06d}.jsonl')
                    download_configs.append(((key, temp_file_path, until, since), month))
                    temp_count += 1
                    counter_callback()
        return download_configs

    ...

设置好后使用命令bash run-pooled.sh即可开始运行。

运行模块`twsrc`

该模块基于twscrape库。该库是基于用户账号的推特爬取库，使用该库时，账号的登录信息会存储在当前目录的accounts.db（以sqlite3数据库格式）。安装twscrape后可以直接在终端调用twscrape，可以测试是否在正常工作，例：

twscrape search china --limit 5

注：以上命令需要在当前目录的accounts.db已经注册过账号后使用。

可以用模块调用的形式调用twsrc模块：

python3 -m twsrc --help

该命令会输出使用该模块的基本方式。在进一步运行代码之前，建议先完成后文一些注意事项中的步骤。

在本仓库中，已经包含了几个默认的配置文件，分别是twsrc/accounts.txt，存储了测试用的账号信息；以及twsrc/keywords-demo.txt，存储了测试用的需要爬取的关键词。运行以下代码可以对该关键词进行爬取：

python3 -m twsrc --keywords twsrc/keywords-demo.txt

在登录账号后，程序应该会输出以下信息：

totally 1 keywords
starting download for keyword "deepin since:2017-08-09 until:2017-08-10"
scraping end, keyword = "deepin since:2017-08-09 until:2017-08-10", count = 71 (1 / 1 = 100.00%), time = 5.18s, total time = 0:00:05, average time = 5.18s, average data = 71.00

随后检查当前目录下的downloads文件夹，应该可以看到以jsonl格式存储的爬取数据。

一些注意事项

修改twscrape源码，使其支持推特的新域名（x.com）。

详见github issue，在推特更改版本后修改了验证邮件信息的中的域名信息，会导致其邮件验证模块失效。当前pip源似乎没有同步最新版本的twscrape，可以手动在本地的pip文件夹处修改其源码，具体在twscrape.imap._wait_email_code函数，有以下内容：

...

if min_t is not None and msg_time < min_t:
    return None

if "info@twitter.com" in msg_from and "confirmation code is" in msg_subj:
    # eg. Your Twitter confirmation code is XXX
    return msg_subj.split(" ")[-1].strip()

...

修改为：

...

# if min_t is not None and msg_time < min_t:
#     return None

if "info@x.com" in msg_from and "confirmation code is" in msg_subj:
    # eg. Your Twitter confirmation code is XXX
    return msg_subj.split(" ")[-1].strip()

...

[可选] 修改twscrape源码，减少其超速等待时间。

由于推特政策的更改，每个账号的爬取速率存在一个限制。具体而言，发送的请求会返回一个(88) Rate limit exceeded错误信息。此时可以选择暂停该账号的使用，一段时间后再继续使用该账号爬取。在twscrape中该等待时间的默认值为4小时，实际上有点过于保守了，经实测可能设置为15分钟就足够。具体在twscrape.queue_client.QueueClient._check_rep函数，有以下内容：
```
...

# possible new limits for tweets view per account
if msg.startswith("(88) Rate limit exceeded") or rep.status_code == 429:
    await self._close_ctx(utc_ts() + 60 * 60 * 4)  # lock for 4 hours
    raise RateLimitError(msg)

...
```
修改为：
```
...

# possible new limits for tweets view per account
if msg.startswith("(88) Rate limit exceeded") or rep.status_code == 429:
    await self._close_ctx(utc_ts() + 60 * 15)  # lock for 15 minutes
    raise RateLimitError(msg)

...
```

并行运行模块`twsrc`

由于twscrape是基于异步协程写的，直接在python脚本用multiprocessing等库开启多进程支持容易导致程序崩溃，因此建议使用bash脚本同时开启多个爬虫程序，以实现多进程爬取。

相关参考代码可见tw-multi/twsrc-multi.sh，大体思路是先确定多进程的并发数，然后用split.py脚本将原账号、代理和关键词文件均分到若干个子文件夹，然后在各个子文件中调用twsrc模块。

需要注意该脚本是爬取**科技实体数据时编写，不一定适用于其它使用情况。

此外也可以基于上述思路用一个python主进程管理爬虫子进程，以实现动态分配给每个子进程需要爬取的内容，避免关键词划分不均衡。（没写相关脚本）

yanruotian/get-twitter-test

推特数据爬取

环境安装

运行

运行模块src（已弃用）

运行模块twsrc

一些注意事项

并行运行模块twsrc

运行模块`src`（已弃用）

运行模块`twsrc`

并行运行模块`twsrc`