Missing reports from large sites
emersonthis opened this issue · 7 comments
Describe the bug
I'm trying generate reports for a large site (~6760 pages) and it is only producing 437 report files.
To Reproduce
Steps to reproduce the behavior:
git clone ...
npm install
cd ./auto-lighthouse
npm run start -- -u https://example.com --format=csv --respectRobots=false
Expected behavior
I expect the crawler to find ~6760 pages, then generate 13522 report files (extra two for aggregated).
Instead, I find ~437 report files and an error in the console.
- The terminal shows
Pushed: ...
6760 times. - It then says
Generating 13522 reports!
(so far so good) - Then I see
Wrote ...
437 times, followed by an error. - There are 437 files in the expected directory. This included two aggregate reports.
It appears that the script is chocking on something before it finishes writing all the files. It may have something to do with the race condition mentioned in this unmerged PR.
Here's an abridged version of the full transcript:
~/Code/auto-lighthouse[master]$ npm run start -- -u https://example.com --format=csv --respectRobots=false
> auto-lighthouse@1.3.0 start /Users/emerson/Code/auto-lighthouse
> node cli "-u" "https://example.com" "--format=csv" "--respectRobots=false"
Not automatically opening reports when done!
Starting simple crawler on https://example.com!
Pushed: https://example.com/page1
Pushed: https://example.com/page2
...
Generating 13522 reports!
Wrote desktop report: https://example.com/ at: /Users/emerson/Code/auto-lighthouse/lighthouse/7_22_2020_3_59_16PM
...
Wrote desktop report: https://example.com/page1 at: /Users/emerson/Code/auto-lighthouse/lighthouse/7_22_2020_3_59_16PM
Error: not opened
at WebSocket.send (/Users/emerson/Code/auto-lighthouse/node_modules/ws/lib/WebSocket.js:344:18)
at CriConnection.sendRawMessage (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/connections/cri.js:167:14)
at CriConnection.sendCommand (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/connections/connection.js:66:10)
at Driver._innerSendCommand (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/driver.js:397:29)
at /Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/driver.js:359:35
at new Promise (<anonymous>)
at Driver.sendCommandToSession (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/driver.js:350:12)
at Driver.sendCommand (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/driver.js:377:17)
at /Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/driver.js:983:22
at async Driver.gotoURL (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/gather/driver.js:1134:26) {
friendlyMessage: undefined
}
TypeError: Cannot read property 'categories' of undefined
at Function.generateReportCSV (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/report/report-generator.js:72:37)
at /Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/report/report-generator.js:105:32
at Array.map (<anonymous>)
at Function.generateReport (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/report/report-generator.js:98:32)
at processResults (/Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:96:33)
at /Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:76:21
at async processReports (/Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:68:13)
at async /Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:131:9
at async Promise.all (index 0)
at async /Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:213:13
TypeError: Cannot read property 'categories' of undefined
at Function.generateReportCSV (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/report/report-generator.js:72:37)
at /Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/report/report-generator.js:105:32
at Array.map (<anonymous>)
at Function.generateReport (/Users/emerson/Code/auto-lighthouse/node_modules/lighthouse/lighthouse-core/report/report-generator.js:98:32)
at processResults (/Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:96:33)
at /Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:76:21
at async processReports (/Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:68:13)
at async /Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:131:9
at async Promise.all (index 0)
at async /Users/emerson/Code/auto-lighthouse/lighthouse_runner.js:213:13
Done with reports!
My gut intuition says this is because the crawler is finding items on the internet that it shouldn't be passing to Lighthouse (but is because of the respectRobots=false
flag. I can take a look at this, not sure when though. My gut debug check is to console.log
the queueItem.uriPath
when the crawler adds URLs to the queue.
However, it definitely could be the size of the queue or something else that I'm not sure of. I'm not sure where to look for the root cause of this.
I'm unable to reproduce this error. When running auto-lighthouse on example.com
, my crawler only finds two pages. Can you provide more details? @emersonthis
Sorry for the confusion. example.com
isn't the real site I'm testing. Just a placeholder. The real site I'm testing is for a client so I can't list it here for privacy reasons. I suspect you'll be able to reproduce this with any big site. Maybe try Wikipedia? Or Amazon?
As you mentioned in PR #133 it seems possible that this issue is the result of this upstream lighthouse issue. (Or maybe I'm misunderstanding?)
Having discovered this, I'm curious about what the implications are for the current architecture of this tool. Sounds like we're not supposed to run lighthouse a bunch inside a for
loop... maybe child process resolves this. I'm less use about how to address what they say about not running lighthouse concurrently on the same machine...
From my understanding, and this comment from Patrick, it sounds like running Lighthouse in parallel is a valid use case if you're okay with a loss of accuracy in the performance metrics. Now I don't know yours and your company's use case, but if you're using Lighthouse to audit the other metrics, maybe I can create some way to handle that.
I'd have to do some timing tests to see how fast Lighthouse can run if one is only auditing the performance metrics to justify my first thought solution though. Just for context, I'm thinking of a parallel run of metrics that aren't performance based then a sequential run of the performance metric. However, that's running Lighthouse four times on each page, which is why I'd need to do a quick check to see how fast the auditing can be done with different categories.
I support the idea of just adding the child process. One way to offset the potential inaccuracy (resulting from resource limitations) might be to add an option to control the amount of concurrency. Then the user could choose the balance between accuracy and performance. If you set the concurrency to 1, I don't think we should expect to hit resource limits because this would be the exact use case the tool was designed for. Users with more horsepower, or less concern for accuracy could turn up the concurrency to run more tests in parallel. Based on my limited understanding of the relevant code, both of these seem pretty straight-forward to do.
Issue is looking mighty stale