clj-commons/etaoin

Sporadic but Frequent Firefox Errors on Linux

lread opened this issue · 2 comments

lread commented

Version
master, f8c4944 as of this writing.

Platform
Operating System: PopOS Linux
Clojure version: 1.11.1.1273
Babashka version: 1.3.179
JDK vendor and version: 17.0.7 temurin

Browser vendor: firefox
Browser version: 112.0.1
WebDriver version: geckodriver 0.33.0

Symptom
When running tests under Firefox I'll sometimes get a java.net.SocketTimeoutException: Read timed out. The exception shows an attempt to disconnect is occurring:

...
	at slingshot.support$stack_trace.invoke(support.clj:201)
	at etaoin.impl.client$call.invokeStatic(client.cljc:131)
	at etaoin.impl.client$call.invoke(client.cljc:95)
	at etaoin.api$execute.invokeStatic(api.clj:235)
	at etaoin.api$execute.invoke(api.clj:209)
	at etaoin.api$delete_session.invokeStatic(api.clj:275)
	at etaoin.api$delete_session.invoke(api.clj:269)
	at etaoin.api$disconnect_driver$fn__8617.invoke(api.clj:3631)
	at etaoin.api$disconnect_driver.invokeStatic(api.clj:3631)
	at etaoin.api$disconnect_driver.invoke(api.clj:3625)
	at etaoin.api$quit.invokeStatic(api.clj:3650)
	at etaoin.api$quit.invoke(api.clj:3645)
	at etaoin.api_test$fixture_browsers$fn__8724.invoke(api_test.clj:57)
	at etaoin.api_test$fixture_browsers.invokeStatic(api_test.clj:57)
	at etaoin.api_test$fixture_browsers.invoke(api_test.clj:53)
...

Reproduction
Happens sporadically but frequently.

`bb test-doc`

Or

`bb test:jvm --browsers firefox --suite api

Actual behavior
Read timeout exception is thrown.

Expected behavior
Tests should pass.

Diagnosis
I don't think this is an Etaoin problem.
It seems to me like a Firefox and/or GeckoDriver problem.

If I turn on detailed logging for firefox via options:

{:driver-log-level "trace" :log-stdout :inherit :log-stderr :inherit}

When things go awry I see geckodriver logs like this:

WARNING: A blocker encountered an error while we were waiting.
          Blocker:  TelemetryController: shutting down
          Phase: profile-before-change-telemetry
          State: Error getting state: Error: Phase "profile-before-change" is finished, it is too late to register completion condition "OS.File: flush I/O queued before profileBeforeChange" at addBlocker@resource://gre/modules/AsyncShutdown.sys.mjs:727:15
addBlocker@resource://gre/modules/AsyncShutdown.sys.mjs:523:26
addBlocker@resource://gre/modules/AsyncShutdown.sys.mjs:458:15
setupShutdown@resource://gre/modules/osfile/osfile_async_front.jsm:1548:28
@resource://gre/modules/osfile/osfile_async_front.jsm:1568:16
@resource://gre/modules/osfile.jsm:12:30
@resource://gre/modules/TelemetryStorage.sys.mjs:10:28
_checkPendingPings@resource://gre/modules/TelemetrySend.sys.mjs:863:17
setup@resource://gre/modules/TelemetrySend.sys.mjs:803:18
setup@resource://gre/modules/TelemetrySend.sys.mjs:241:30
setupTelemetry/this._delayedInitTask<@resource://gre/modules/TelemetryControllerParent.sys.mjs:824:36
observe@resource://gre/modules/AsyncShutdown.sys.mjs:576:16

WARNING: Error: Phase "profile-before-change" is finished, it is too late to register completion condition "OS.File: flush I/O queued before profileBeforeChange"
WARNING: addBlocker@resource://gre/modules/AsyncShutdown.sys.mjs:727:15
addBlocker@resource://gre/modules/AsyncShutdown.sys.mjs:523:26
addBlocker@resource://gre/modules/AsyncShutdown.sys.mjs:458:15
setupShutdown@resource://gre/modules/osfile/osfile_async_front.jsm:1548:28
@resource://gre/modules/osfile/osfile_async_front.jsm:1568:16
@resource://gre/modules/osfile.jsm:12:30
@resource://gre/modules/TelemetryStorage.sys.mjs:10:28
_checkPendingPings@resource://gre/modules/TelemetrySend.sys.mjs:863:17
setup@resource://gre/modules/TelemetrySend.sys.mjs:803:18
setup@resource://gre/modules/TelemetrySend.sys.mjs:241:30
setupTelemetry/this._delayedInitTask<@resource://gre/modules/TelemetryControllerParent.sys.mjs:824:36
observe@resource://gre/modules/AsyncShutdown.sys.mjs:576:16

JavaScript error: resource://gre/modules/AsyncShutdown.sys.mjs, line 727: Error: Phase "profile-before-change" is finished, it is too late to register completion condition "OS.File: flush I/O queued before profileBeforeChange"
JavaScript error: resource://gre/modules/AsyncShutdown.sys.mjs, line 727: Error: Phase "profile-before-change" is finished, it is too late to register completion condition "OS.File: flush I/O queued before profileBeforeChange"
JavaScript error: resource://gre/modules/AsyncShutdown.sys.mjs, line 727: Error: Phase "profile-before-change" is finished, it is too late to register completion condition "OS.File: flush I/O queued before profileBeforeChange"
JavaScript error: resource://gre/modules/AsyncShutdown.sys.mjs, line 727: Error: Phase "profile-before-change" is finished, it is too late to register completion condition "OS.File: flush I/O queued before profileBeforeChange"
JavaScript error: resource://gre/modules/AsyncShutdown.sys.mjs, line 727: Error: Phase "profile-before-change" is finished, it is too late to register completion condition "OS.File: flush I/O queued before profileBeforeChange"
JavaScript error: resource://gre/modules/AsyncShutdown.sys.mjs, line 727: Error: Phase "profile-before-change" is finished, it is too late to register completion condition "OS.File: flush I/O queued before profileBeforeChange"
console.error: (new Error("SessionFile is closed", "resource:///modules/sessionstore/SessionFile.sys.mjs", 370))
console.error: (new AbortError("IOUtils: Shutting down and refusing additional I/O tasks", (void 0), 890))
JavaScript error: resource://gre/modules/CrashManager.jsm, line 927: AbortError: IOUtils: Shutting down and refusing additional I/O tasks

Best guess: It feels like a race condition, something that is starting up is maybe still starting up while we are shutting down, and then things go kaboom.

I do see a bug reported around this for firefox, but it is reported as fixed. But maybe not for Firefox v112.0.1?

If I'm right about the race condition, this scenario probably won't happen much in typical use.
Etaoin tests sometimes do very little and are therefore generate very short lived session.
If a session were allowed to live longer than these internal firefox services take to startup, my guess is that we'd be fine.

This might also be an issue on macOS and Windows.
My dev box is PopOS Linux.

This might not occur as much on CI because CI is pretty slow and this slowness might give things the time they need to avoid this condition.

Action
Not sure yet.
Thought I'd raise the issue to maybe see if others are suffering it too.
Maybe wait until the next release of Firefox before taking action.
If we need to compensate we could try to ensure that internal services are started before we shutdown.

rgri commented

I am also getting infrequent Read timed out errors, also without pretty much any reproducibility. It's occuring in the context of a call to pmap. I'm trying to perform some actions on each row of a table. To make this faster, I'm trying to:

(pmap #(e/with-firefox-headless driver (doStuff driver %)) rows)

This has been a real headache. I can run my script 3 times and 2/3 tries run wtihout issue. Unsure how to proceed.

@rgri, you might want to stick with a single WebDriver process and not go parallel at all.
If you are still seeing problems, please open a new issue.

I'm not seeing this on CI anymore, so gonna close.