ulixee/secret-agent

problems with IPuppetFrame.waitForLoad method

andynuss opened this issue · 2 comments

For all of my testing, as soon as I have awaited the agent.goto() method to visit a requestUrl that I am scraping,
I then do the following:

await agent.activeTab.waitForLoad(LocationStatus.DomContentLoaded, { timeoutMs: my_longish_timeout });

This means that when I get to the point of using my own plugin code to make any necessary calls to
IPuppetFrame.waitForLoad() before calling IPuppetFrame.evaluate(), I can be sure that the main frame
is loaded successfully, but cannot depend on other frames having been loaded without calling waitForLoad.

The first thing I noticed in testing two of my own test pages that reference simple url-based iframes, is
that generally, for my localhost pure html webpages (with frames), that immediately after calling
agent.activeTab.waitForLoad (above), when I reference a child frame from the main page (which is url-based),
it appears that the following strange condition holds. (1) the child frames' isLoaded property is true, but
(2) the child frames' url property is undefined! Given that I know that this iframe's src attribute was not set by
javascript, but in the original html of the main page, this has to mean that frame.isLoaded is not telling
the truth! Thus I call frame.waitForLoad anyways, and when it resolves successfully, indeed I see the
url that I expect based on the frameEl's "src" attribute.

Thus my current code only skips calling IPuppetFrame.waitForLoad when (1) frame.isLoaded is true,
AND (2) frame.url is "truthy".

This above is the first problem.


The second problem is that I wondered what happens if I called frame.waitForLoad exactly one time
for every frame that I see for the first time, regardless of what the frame.isLoaded boolean says
(given that from the above, we can see that it is not always accurate). I decided to do this at two
levels, for the mainFrame and for the child frames, and found that I got LOTS of 30 second timeouts
on waitForLoad when calling waitForLoad unnecessarily on the main page! When I turned off
this "extra work" for the main page (as I should, given I call agent.waitForLoad as indicated above),
I still got quite a few child-frame 30 second timeouts for unnecessary waits for child frames.

So the second problem is that it does not make sense that calling waitForLoad for a frame or main
page frame that has a truthy url should not cause such failures.


The third problem is related to the fact that I cannot trust isLoaded as stated at the top. This means
that I do in fact call waitForLoa when isLoaded is true but frame.url is undefined. Unfortunately,
this does not cause ALL the 30 second timeouts to disappear. There are still a few.


The fourth problem is that I feel I could do a better job debugging this if I had control over the
30 second timeout, and could adjust it up or down. It seems fixed.


And finally, wondering about the 3 types of injected-by-javascript doms (other than rewriting the
document itself), I found that for iframes that come from the 'srcdoc' attribute, the IPuppetFrame.url
does reach the 'about:srcdoc' state.

Likewise for iframes injected by dynamic javascript on top of an initial 'about:blank' src attribute,
or no src attribute at all, all these IPuppetFrame.url values reach the 'about:blank' state.

But for iframes whose 'src' is initially set to a string that loads the entire page and its javascript
by escaping the html into an url that begins with 'javascript:', after waitForLoad resolves
successfully, frame.url is STILL undefined.


In view of trying to figure out the best approach to minimize errors, I was wondering:
what exactly does IPuppetFrame.waitForLoad do?

BTW, I do not see problems with standalone tests of any single url. It seems that this happens under
load, and randomly and in a non-reproducible way, and the ec2 instance I am using currently
is an m3.medium.

@andynuss Thank you so much for your work on your plugins. Unfortunately, you're exposing that we have unintentionally exposed undocumented, unintended for developer consumption "internal apis". That isLoaded variable is not related to the "load" event of the page. It's about loading internal state. You need to translate these puppetFrames into FrameEnvironments on the client and use the APIs provided by those. If you encounter these load issues with the FrameEnvironments on the client, please log those. Going to close this for now as these events are internal, and we will likely close them off from the plugins at some point to reduce confusion.