This is a simple crawler design for bilibili that can collect basic data associate with each video. For example, it will collect the number of comment
, dannmaku
, coin
of each video.
After tuning parameters in configuration file and run it in my server, it can collect approximately 300k video data per hour. However, this number is trivial compared with the total video number in bilibili. By the time 2018/11/14, the max av
number is larger than 35 million, which means it will cost this program several days to collect all video data, hopefully. And this number increases over 50000 daily.
The speed of Internet.
As you can see, outward internet flow reaches upper limit as soon as I running this program
create databsase and tables with the following command
mysql> source table.sql
rename config-default.json
to config.json
and edit it.
If you want to use proxy, change GetProxy()
function in proxy/proxy.go
to provide your proxy ip addresses. The quality of your proxy server has a great influence on the overall performance
go build
./bilibiliCrawler [-v[d] startAid endAid][-o[d]][-b[d]]
This parameter determine the max number of crawler go routines. Although a go routine is much lighter than a thread, this number cannot be too large. Because the bottleneck is the speed of internet. In my server, 300
seems reasonable.
This parameter determine the max number a go routine failed to crawl a video data until it gives up. It highly depends on your internet reliability and your proxy server's reliability if you choose to use proxy. No one knows why some data isn't properly transmitted. If you're not confident with your internet, try a higher value. In my case, I use 10
, 20
or even 30
.
This parameter determine the max number of connection that mysql connection pool holds. Mysql server limits the number of connection to 100
by default, so I set this parameter to 95
in case some emergency may occur.
A crawler program is actually pretty simple, however, when requiring the program running several hours or even forever; it indeed becomes difficult. A simple bug may not matter for a small program running a few seconds, but is fatal to a long-staying service program. It took me 4 days finally fix all bugs. In this process, I met lots of strange bugs that I even never think about. Some relate to mysql
, some relate to network knowledge. Here I want to explain the most challenging bug in detail.
After I finish main part of my crawler, I run it in my server. A strange situation occurred. After running smoothly for several hours, my program cashes suddenly. I checked mysql
error log and my crawler's error log, but found nothing. I took me a whole day to realize that it results from memory leaking. After several hours running my program, my program occupied almost all available memory. Then linux system kill this process because of running out of memory, this mechanism also referred as OOM Killer
or Out Of Memory Killer
. The following link explains the detail of this leak and how to fix it.