Extract web text directly instead of OCR

Question

Extract web text directly instead of OCR

Opened this issue 7 months ago · 6 comments

I'm working on something pretty similar to what you guys are doing and had a thought. Why not grab text directly from the web instead of using OCR? Langchain and llamaindex both have such tools, and there are also some repos about converting html to markdown.

Just a thought. Would love to know what you think!

Answer 1 · 2024-03-26T20:32:01.000Z

Seconding the ask for a Motivations section that discusses when to use this in lieu of parsing the DOM.

Answer 2 · 2024-05-16T17:08:41.000Z

Thats a good question. Would be curious to see their approaches and performance is like.

For us, it's very important to contain as much of the visual structure of the page as possible. This includes positions of the text on the 2D plane. Using just the HTML and skipping the actual rendering of the page, you lose a lot of this information. We need this because a) we want our agents to reason about and take actions on the page just as we would, and b) because visibility of elements on screen is required for automation frameworks to actually take actions (you cannot "click" on elements that don't actually appear on the page)

For example, suppose you had a scrollable container element containing 10 child elements total, with 5 elements overflowing and requiring scrolling the parent container to view. I would imagine the other approaches would display the overflowed elements in the ultimate representation, while we want to avoid doing this (Because if an agent were to try and click on these elements, it would cause an element_not_found error)

Hope this makes sense, happy to elaborate further @eshoyuan. (And apologies for the late response) If @will-holley or anyone wants to add this to the README, happy to take a PR!

Answer 3 · 2024-07-11T08:48:07.000Z

I agree with the motivation. One issue I see with the approach is when there are images embedded in a webpage that contain text but are not really actionable by the LLM. E.g., https://app.sequence-erp.com/login, the only thing that matters are the login elements on the left hand side, but the OCR algo fails at recognizing this:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sequence-erp.com
						 sequence
																																												       sequence								 Recherche	 par    mot    - clé dans     " Active     section     "
																																													Dashboard						Ventes										    Achats										    Clients	  avec	 factu
																																																				 Total   factures      impayées	  15,000.00       CHF			       Total   factures       impayées	 20,000.00       CHF			       Les   5  avec     le  plus   de   volu
																																												 &     Mon       espace					  En   temps				  En    retard				  En   temps			     En    retard				      Client
																																																				 5,000.00	 CHF			 10,000.00	  CHF		       10,000.00	  CHF		  10,000.00	 CHF				      Green       Line
																																													Projets																													       Helio
				     Français																																														    Banking
																																													Ventes						  Jours       depuis       la   dernière	importation	     bancaire					   22    jours       ( 01/10/2022	  )
				      Bonjour						Sequencer								!																															 Dernière	  période	 importée									       01    / 09 / 2022-01	   / 10  / 2022
																																													Achats						   Transactions	   en    attente	 de   réconciliation							 16   transactions
				    Vous	      nous	     manquiez			déjà	    !																														Ressources	 humaines																										  FOL
																																																				 Profit	&   Loss																			      Les   !
																																																				 6 derniers       mois																						 Clients
																																													Banking																												      Four	      Les     5   avec
				     Email	      *
																																													Comptabilité																														  Client
					      [  #   0   ]
																																													Rapports																															GR	Greem
																																																											       I.			   1.			 I.
																																																																														   Helio
				     Mot	    de	 passe				      [	1  ]      Mot	 de       passe	   oublié	      ?
																																																																														  Sata	A
					      [  #   2    ] ]											  [ $    3   ]																															      Ventes																					      Clie
																																																				 6 derniers       mois																							     M   &   L
																																																																													MA	Masa
										 [ $   4   ]	Se	 connecter
																																																																												       Clients
																																																																												       attente
				     Pas	de       compte		sur      Sequence		    ?    Créer	     mon	  compte																																																									 Mois	en     co
																																																												 4.8     /  5    sur       Google	       Avis	    G
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Answer 4 · 2024-07-22T19:53:08.000Z

@tvatter the thing is that there are some cases where the image information might actually be important. Perhaps those cases are far and few between though

What's happening specifically is that the smaller text of the image hurts the rendering of the page. We could maybe pass in an option to ignore images in the OCR (turning them invisible before taking a screenshot)

Answer 5 · 2024-07-23T19:13:20.000Z

For example, suppose you had a scrollable container element containing 10 child elements total, with 5 elements overflowing and requiring scrolling the parent container to view. I would imagine the other approaches would display the overflowed elements in the ultimate representation, while we want to avoid doing this (Because if an agent were to try and click on these elements, it would cause an element_not_found error)

For this example, how does the agent know that this element is scrollable? For example if it needed to scroll down to find a specific element. Have you found that agents can handle this case? I'd imagine it necessary to add a scrollable tag for OCR.

Separately what do you mean by element_not_found error, IFAIK if an element is not visible due to overflow you can still trigger any event listener like click, it's only if that element wasn't in the DOM at all that you'd be a element_not_found error right?

Answer 6 · 2024-08-14T05:23:56.000Z

hey @craigmulligan yeah a scrollable tag would be of interest. We should make a ticket!

playwright typically will require the element be visible on screen or somewhat clickable (at least with defaults)