Need help to identify the issue in the implementation.
yogesh-desai opened this issue · 3 comments
Hello,
I am using the example/full/main.go in my crawler and scrapper. The link has the implementation. When I run the code it does crawling and scrapping as per expected but it consumes too much memory and creates too many goroutines which result in exiting the code after some time (approx an hour in my case.).
Please help me understand where I'm wrong and what are the things I need to take care of. I believe I'm missing something or misunderstood which is causing this. I am a newbie so please feel free to ask any more explanation/ clarification if needed.
Thank you.
Understood the issue. My implementation was wrong. Corrected it. Thank you.
Hello Yogesh,
Glad you got it figured out! However, what you described is indeed a possible issue with fetchbot - that is, it will happily let you use too much memory and create many goroutines if you don't constrain your crawl (i.e. if you call q.Send
with many different hosts, and faster than it can process the URLs).
For the benefit of others who may stumble unto this page, when crawling uncontrolled number of URLs (not limited to a given website or a known small number of URLs), it is highly recommended to:
- Store the URLs in a database instead of calling
Queue.Send
from aHandler
- Use a semaphore-like channel to rate-limit the producer, signaling availability in the
Handler
(i.e.<- semaphore
before callingQueue.Send
,semaphore <- 1
when exiting theHandler
)
Have fun!
Martin
Hello Martin,
I see that is useful information. In my case, I wanted to crawl the entire website "http://www.tokopedia.com".
Also, currently I want to use a channel to enqueue the links and then further process those through extractor function. Here is my code.
I am trying to send the URLs to the channel whenever ctx.Q.SendString
calls are used. But this way is not working. Please help in how to use the channel and send the URLs to it.
Please check line numbers: 115, 267 in my implementation.
Thank you.