ulixee/secret-agent

Possible small bug in the resource returned by reload()

A-Posthuman opened this issue · 12 comments

When I try to use the text() method on a resource, it works if the resource was returned from a goto(), but the resource returned by reload() results in:

resource.text is not a function

Is there a way to get the page body text from a reload() resource?

Also, separate quick question: what is the quickest way, performance-wise to get a copy of the page body text?

I've noticed when doing some basic performance timing that resource.text() will get me a copy of the body text 2 to 3 times faster than using something from the awaited DOM like agent.document.body.outerHTML - is the awaited DOM inherently going to always be slower, and is there an even faster way than resource.text() ?

Thanks @A-Posthuman. Looks like you uncovered a bug for reload. It should be giving you the resource text.

Resource.text is going to give you back the html from the http request. It will likely be different than the html from outerHTML, which is from your live page: ie, the rendered state of the page. It will be slower because it needs to evaluate the current state of the page

Thanks for explaining the difference, so depending on which version of the page HTML I want, are these 2 properties generally the fastest way performance-wise to get the HTML? I'm looking to get it all and hand it off to a faster HTML parser library for speed... I'm finding for longer more in-depth page scrapes of many different elements that the SA awaited dom is too slow.

Your fastest option is probably to detach a Frozen tab (https://github.com/ulixee/secret-agent/blob/main/examples/detach.ts). It will learn all the awaited commands you use after the first time, and then prefetches them all upfront.

Ok, yes I had experimented with that a while back, but actually forgot about it, so thank you for reminding me. After working with it more tonight it does appear to speed things up, but it looks like overall my extraction library functions will still end up being faster using a separate HTML parser library.

One thing on the frozen tabs I'm not 100% clear on from the documentation and example - if I'm visiting a series of different URLs, but using the same extraction queries on each URL, does the detach method know somehow to keep re-using the previously learned queries, or is this a case where I need to supply a key to clue it in? Or in other words does a set of detached learned queries only apply to a specific URL/is it URL based?

Also, when are the learned queries lost or cleared - are they in a /tmp db?

I also ran into a crash while using frozen tabs, I guess I'll open a separate issue for that.

The learned queries go into the "sessions.db" database in your sessions directory. They go away whenever that gets cleared out (it's in a tmp directory, so usually on restart).

Regarding your question - it will run the exact same queries against each url. If you're branching and doing if/else, that can cause issues where it doesn't know about some of the branches. The second parameter is the key that will branch for you, but you only need it if you have slightly different selectors/logic for each. Does that make sense?

Would love to see your example on performance if you can make it something shareable. We didn't do everything to speed that as fast as it would go, but a comparison and/or example would be really nice to see - ie, how much slower is it right now?

There is some branching in my extraction code, however it isn't based on the URL, but dynamically comes into play depending on what content is found in the DOM of the pages. So I don't think I could know which 2nd parameter key to use with detach() prior to detaching, at least not without first slowly testing some of these decision points in the page, which would defeat a lot of the goal of freezing the page.

If I don't use a 2nd parameter key and just keep detaching each URL I visit, would over time the db learn all the various possible queries my code "might" run and be prepared for them all after a period of time when all possible if/then branches have been seen at least once?

Once I get some more comprehensive timing data on this vs the other HTML parsing library, I'll provide some info. May take several more days.

An update on performance. I've made an extraction library of 21 functions to scrape all the various bits from some particular pages I'm interested in. Here's how the runtimes compare on 2 random pages with no resource types blocked (page 1 has roughly double the HTML size - around 2 MB - compared to page 2), the Awaited DOM version, and a version ported to use a popular javascript HTML parsing module (times in ms):

Result without using detached/frozen tabs:

Example page 1:
Awaited DOM total time to run my 21 functions: 9326.23
HTML parser: time to copy await document.body.outerHTML from SA into a variable: 291.37
             time to parse that HTML into the module's root object: 232.76
             time to run my 21 functions: 491.58
             total: 1015.73

Example page 2:
Awaited DOM total time to run my 21 functions: 6813.71
HTML parser: time to copy await document.body.outerHTML from SA into a variable: 178.57
             time to parse that HTML into the module's root object: 137.43
             time to run my 21 functions: 126.53
             total: 442.53

Result of 1st run using detached/frozen tabs:

Example page 1:
Awaited DOM total time to run my 21 functions: 920.92
HTML parser: time to copy await document.body.outerHTML from SA into a variable: 245.91
             time to parse that HTML into the module's root object: 83.90
             time to run my 21 functions: 457.43
             total: 787.24

Example page 2:
Awaited DOM total time to run my 21 functions: 446.73
HTML parser: time to copy await document.body.outerHTML from SA into a variable: 118.17
             time to parse that HTML into the module's root object: 48.76
             time to run my 21 functions: 88.80
             total: 255.73

Result of 2nd run using detached/frozen tabs:

Example page 1: SA usually crashes when I try doing 2nd runs of frozen sessions, but I managed to get 1 result:
Awaited DOM total time to run my 21 functions: 241.26
HTML parser: time to copy await document.body.outerHTML from SA into a variable: 0.24
             time to parse that HTML into the module's root object: 104.61
             time to run my 21 functions: 301.53
             total: 406.38

Example page 2: SA crashed during this

Conclusions:

Detached/frozen is definitely way more competitive than non-frozen tabs, however I continue to have stability issues/crashes while trying to use these, particularly when I try to run my script a 2nd time after the 1st runs.

In terms of the direct comparison time to run my functions, excluding the time required to copy resource.response.text() into a variable and parse it, my functions are run typically 2 to 4x quicker using this js library vs. the awaited DOM frozen tab, although there are hints from the one single 2nd frozen tab run I was able to get non-crash results from that SA may start to improve even further?

The main thing I'd like to find out is how can I get the current/"final" page HTML data more quickly, either from await document.body.outerHTML or some other way, since that is taking a significant amount of time on non-frozen runs and also on the 1st frozen run. Or alternately, is there something I can do to get the 2nd and beyond frozen runs to work without crashing? await document.body.outerHTML was super fast on that final run I where once I was able to get the frozen tab working on a 2nd run, it would be ideal if it could always be that quick, but also not crash.

Thanks for all the analysis!! Yes, definitely need to fix these bugs. Are there more crashes than the ones you sent over?

At least one of them is due to an attribute with the value "{}". Something is getting tripped up trying to rebuild the DOM with that value. It's possible that that issue is causing the "Cannot read property 'id' of undefined" issue. Will dig further this week.

If you can avoid doing the full outerHTML, that's going to be ideal. The browser takes all the time to do that - from my understanding, the DOM engine has to wait for a rendering cycle to be complete, which is likely your variation, and then it has to recurse the full tree.

So far those examples of frozen tab bugs I sent in the other issue seem to the primary offenders, haven't seen any others.

You mention avoiding using the full outerHTML, but if my goal currently is to simply get the fetched HTML including post-js-execution, and then dump that HTML into my HTML parser helper module, is there a faster method or technique I can use?

If you need the full dom, I'm not sure of a faster way. If there's some way you can grab chunks of the dom (eg, query a selector and load the dom for only that element), you might see better performance.

When I try to use the text() method on a resource, it works if the resource was returned from a goto(), but the resource returned by reload() results in:

resource.text is not a function

Is there a way to get the page body text from a reload() resource?

Also, separate quick question: what is the quickest way, performance-wise to get a copy of the page body text?

I've noticed when doing some basic performance timing that resource.text() will get me a copy of the body text 2 to 3 times faster than using something from the awaited DOM like agent.document.body.outerHTML - is the awaited DOM inherently going to always be slower, and is there an even faster way than resource.text() ?

@A-Posthuman

@zealand145i4i8 Did you add this for documentation purposes?