lgraubner/sitemap-generator-cli

Large site issue

bartleman opened this issue · 6 comments

I increased node to use 32gb so I could crawl a site with 2+ mil links. When it eventually finished, it only generated files 27-44 without displaying any errors.

So files 1-26 are missing? Memory should not be a problem if you are using v6 the sitemap data is streamed to files which should not consume much memory space.

Currently it's at five requests/s. This option could be exposed easily. Hopefully I will have some spare time this weekend to check why files are missing.

I could not reproduce the issue. I tried to generate some more sitemaps by lowering the maximum number of url's per file (see option) but it seems to work fine. Maybe there is indeed a difference when the file size is bigger. Anyways I suspect it might have to do with some async iteration. Changed it to serial execution instead.

Any chance you could check out the branch linked above and test? It's not the cli, but you can test it easily with the following code:

const SitemapGenerator = require('./lib')

const gen = SitemapGenerator('https://example.com', {
  maxConcurrency: 20
})

gen.start()

Simply run it with node index.js.

I also exposed the maxConcurrency option which specifies the number of workers used.

Testing with concurrency set to 100, it appears to run at the same rate and doesn't increase the speed.

I also get a JavaScript heap out of memory error after about an hour of running unless I increase it. I'm currently using: --max_old_space_size=16384

RayBB commented

I think it would be quite handy if you added the max concurrency as a flag.