Extract web text directly instead of OCR
Opened this issue · 6 comments
I'm working on something pretty similar to what you guys are doing and had a thought. Why not grab text directly from the web instead of using OCR? Langchain and llamaindex both have such tools, and there are also some repos about converting html to markdown.
Just a thought. Would love to know what you think!
Seconding the ask for a Motivations section that discusses when to use this in lieu of parsing the DOM.
Thats a good question. Would be curious to see their approaches and performance is like.
For us, it's very important to contain as much of the visual structure of the page as possible. This includes positions of the text on the 2D plane. Using just the HTML and skipping the actual rendering of the page, you lose a lot of this information. We need this because a) we want our agents to reason about and take actions on the page just as we would, and b) because visibility of elements on screen is required for automation frameworks to actually take actions (you cannot "click" on elements that don't actually appear on the page)
For example, suppose you had a scrollable container element containing 10 child elements total, with 5 elements overflowing and requiring scrolling the parent container to view. I would imagine the other approaches would display the overflowed elements in the ultimate representation, while we want to avoid doing this (Because if an agent were to try and click on these elements, it would cause an element_not_found error)
Hope this makes sense, happy to elaborate further @eshoyuan. (And apologies for the late response) If @will-holley or anyone wants to add this to the README, happy to take a PR!
I agree with the motivation. One issue I see with the approach is when there are images embedded in a webpage that contain text but are not really actionable by the LLM. E.g., https://app.sequence-erp.com/login, the only thing that matters are the login elements on the left hand side, but the OCR algo fails at recognizing this:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sequence-erp.com
sequence
sequence Recherche par mot - clé dans " Active section "
Dashboard Ventes Achats Clients avec factu
Total factures impayées 15,000.00 CHF Total factures impayées 20,000.00 CHF Les 5 avec le plus de volu
& Mon espace En temps En retard En temps En retard Client
5,000.00 CHF 10,000.00 CHF 10,000.00 CHF 10,000.00 CHF Green Line
Projets Helio
Français Banking
Ventes Jours depuis la dernière importation bancaire 22 jours ( 01/10/2022 )
Bonjour Sequencer ! Dernière période importée 01 / 09 / 2022-01 / 10 / 2022
Achats Transactions en attente de réconciliation 16 transactions
Vous nous manquiez déjà ! Ressources humaines FOL
Profit & Loss Les !
6 derniers mois Clients
Banking Four Les 5 avec
Email *
Comptabilité Client
[ # 0 ]
Rapports GR Greem
I. 1. I.
Helio
Mot de passe [ 1 ] Mot de passe oublié ?
Sata A
[ # 2 ] ] [ $ 3 ] Ventes Clie
6 derniers mois M & L
MA Masa
[ $ 4 ] Se connecter
Clients
attente
Pas de compte sur Sequence ? Créer mon compte Mois en co
4.8 / 5 sur Google Avis G
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
@tvatter the thing is that there are some cases where the image information might actually be important. Perhaps those cases are far and few between though
What's happening specifically is that the smaller text of the image hurts the rendering of the page. We could maybe pass in an option to ignore images in the OCR (turning them invisible before taking a screenshot)
For example, suppose you had a scrollable container element containing 10 child elements total, with 5 elements overflowing and requiring scrolling the parent container to view. I would imagine the other approaches would display the overflowed elements in the ultimate representation, while we want to avoid doing this (Because if an agent were to try and click on these elements, it would cause an element_not_found error)
For this example, how does the agent know that this element is scrollable? For example if it needed to scroll down to find a specific element. Have you found that agents can handle this case? I'd imagine it necessary to add a scrollable tag for OCR.
Separately what do you mean by element_not_found error
, IFAIK if an element is not visible due to overflow you can still trigger any event listener like click
, it's only if that element wasn't in the DOM at all that you'd be a element_not_found error right?
hey @craigmulligan yeah a scrollable tag would be of interest. We should make a ticket!
playwright typically will require the element be visible on screen or somewhat clickable (at least with defaults)