Crawler mode speed
Zibri opened this issue · 13 comments
Zibri commented
Even increasing the connection limits I notice that in crawler mode it gets only 60 peers/minute.
Is there a setting to increase the speed?
With another crawler I have I can get 100000/hour!
shiyanhui commented
Did you run it in local network? This crawler can't run behind NAT now.
Zibri commented
Sure. I read the documentation.
Throughput seems to be 60-100 per minute. With p2pspider I get 3200/minute after an hour.
…-------- Messaggio originale --------
Da: Lime <notifications@github.com>
Data:10/01/2017 13:54 (GMT+02:00)
A: shiyanhui/dht <dht@noreply.github.com>
Cc: Zibri <zibri@zibri.org>,Author <author@noreply.github.com>
Oggetto: Re: [shiyanhui/dht] Crawler mode speed (#17)
Did you run it in local network? This crawler can't run behind NAT now.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
shiyanhui commented
Is it the sample that you are running?
- There are two kind of peers message in dht protocol,
get_peers
andannounce_peer
.get_peers
messages are far more thanannounce_peer
. Onlyannounce_peer
is what we want. The example will only print successful BT seed. I don't know whatp2pspider
print. - We got
announce_peer
message, and then we fetch the BT seed. If it fails, theip:port
will be put in blacklist, and DHT crawler will not fetch it until sometime in the future.
Zibri commented
Yes, I am running the sample.
I understand what you say but on the same pc, with the same bootstrap servers the two programs behave differently.
Yours starts at about 60 peers/minute and stays there even after 3 hours.
p2pspider (which I suggest you to test just for comparison) starts also at 60 then in about 2 hours is at full speed and consumes almost half of my bandwidth!
and also 100% cpu time...
p2pspider uses nodejs because it’s javascript I just wanted to test yours to see if I get similar results with less cpu time (bandwidth comes with it I think)
or, maybe is anything I need to set?
Regards,
Zibri
http://www.zibri.org
https://twitter.com/Zibri
From: Lime
Sent: Wednesday, January 11, 2017 03:03
To: shiyanhui/dht
Cc: Zibri ; Author
Subject: Re: [shiyanhui/dht] Crawler mode speed (#17)
Is it the sample that you are running?
a.. There are two kind of peers message in dht protocol, get_peers and announce_peer. Only announce_peer is what we want. The example will only print successful BT seed. I don't know what p2pspider print.
b.. We got announce_peer message, and then we fetch the BT seed. If it fails, the IP:port will be put in blacklist, and DHT crawler will fetch it again after maybe one hour.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
shiyanhui commented
OK, I'll figure it out.
ilcn commented
I have the same question as Zibri. Intrinsically golang should go faster than nodejs, and I am listening for annouce peer right now. But please let us know if what in the config we can tweek to make the spider mode go faster.
Thanks
fanpei91 commented
Zibri commented
At the moment I am using simdht with pypy... check it out. But I still think C is the way...
Inviato dal mio dispositivo Samsung
…-------- Messaggio originale --------
Da: fanpei91 <notifications@github.com>
Data: 13/12/17 16:39 (GMT+01:00)
A: shiyanhui/dht <dht@noreply.github.com>
Cc: Zibri <zibri@zibri.org>, Author <author@noreply.github.com>
Oggetto: Re: [shiyanhui/dht] Crawler mode speed (#17)
I have rewritten p2pspider from node to golang recently. Same efficiency as before, but higher performance.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/shiyanhui/dht","title":"shiyanhui/dht","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/shiyanhui/dht"}},"updates":{"snippets":[{"icon":"PERSON","message":"@fanpei91 in #17: I have rewritten [p2pspider](https://github.com/fanpei91/p2pspider) from node to golang recently. Same efficiency as before, but higher performance."}],"action":{"name":"View Issue","url":"#17 (comment)"}}}
fanpei91 commented
simdht for golang is here godht
I don't think so c is the way. You need to learn golang 1.9 runtime's performance.
Zibri commented
I am checking dht in go…
I see the output:
link: magnet:?xt=urn:btih:49a2afaa0a3bb5e1eb45cb2cc598c7ed6cd9c2c5
node: 2.136.205.155:58236
peer: 2.136.205.155:58236
How to include the announced FILE NAME (if present in the announcement)?
What I need is just the has and the name.
Sent from Mail for Windows 10
From: fanpei91
Sent: Wednesday, December 13, 2017 18:04
To: shiyanhui/dht
Cc: Zibri; Mention
Subject: Re: [shiyanhui/dht] Crawler mode speed (#17)
simdht for golang is here godht
I don't think so c is the way. You need to learn golang 1.9 runtime's performance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
Zibri commented
Hmm
Even if I never coded in go I did this:
for announce := range dht.Announce {
rawa := announce.Raw["a"].(map[string]interface{})
fmt.Println(fmt.Sprintf("link: magnet:?xt=urn:btih:%v\nraw: %s\n",
announce.InfohashHex,
rawa["name"],
))
And now it writes hash and name.
Hmm.. I wonder the speed compared to simDHT with pypy
From: fanpei91
Sent: Wednesday, December 13, 2017 18:04
To: shiyanhui/dht
Cc: Zibri; Mention
Subject: Re: [shiyanhui/dht] Crawler mode speed (#17)
simdht for golang is here godht
I don't think so c is the way. You need to learn golang 1.9 runtime's performance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
Zibri commented
By the way, as of now my crawler (modified simDHT running multithreaded is doing:
(20494 hashes / min) (8019 unique hashes /min)
Bandwidth: 11834.47 / 6569.03 Kbit/s
But I think that with the right coding it could go much higher than that!
From: fanpei91
Sent: Wednesday, December 13, 2017 18:04
To: shiyanhui/dht
Cc: Zibri; Mention
Subject: Re: [shiyanhui/dht] Crawler mode speed (#17)
simdht for golang is here godht
I don't think so c is the way. You need to learn golang 1.9 runtime's performance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
Zibri commented
I did some testing..
Even putting 1500 friends/sec the speed is ridiculous..
pid 1217 speed 27 running time 29:51
Total speed: 27
Top speed: 75
Top speed 75? After half an hour? I get that speed in 1 minutes with the modified simdht.
And look the speed now:
pid 3687 speed 12561 running time 19-19:34:17
pid 3686 speed 6662 running time 19-19:34:17
Total speed: 19223
TOP SPEED
Top speed: 19223
That’s 20K hashes per minute!
From: fanpei91
Sent: Wednesday, December 13, 2017 18:04
To: shiyanhui/dht
Cc: Zibri; Mention
Subject: Re: [shiyanhui/dht] Crawler mode speed (#17)
simdht for golang is here godht
I don't think so c is the way. You need to learn golang 1.9 runtime's performance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.