Cleaning and compression of raw html

Question

Cleaning and compression of raw html

Closed this issue 2 months ago · 1 comments

Dear authors, I would like to ask how I can utilize part of your work to clean and compress a raw html file to get a new compressed html file. By reading the official website, I tried to use the python library functions you provide to do this, and I wonder if this should be the idea?
Another question is: if those python libraries are utilized, would this involve the selection of candidates, that is, keeping the content corresponding to the candidate? Wouldn't that require artificially setting the candidate in advance?
One of my ideal effects would be to input the html file, then process it through, and the output would be the html file after the operation, so if I need to set the candidate ahead of time wouldn't I be missing automation?
Looking forward and thank you for your insights and replies!

Answer 1 · 2024-09-27T15:53:38.000Z

I would like to ask how I can utilize part of your work to clean and compress a raw html file to get a new compressed html file.

Although you may find the library useful for cleaning html files, it is not the primary goal of this library; rather, our goal is to process the html files so they can be ingested by LLM models for predicting web actions.

if those python libraries are utilized, would this involve the selection of candidates, that is, keeping the content corresponding to the candidate? Wouldn't that require artificially setting the candidate in advance?

You can use the DMR retriever to dynamically find relevant candidate, given a context (action history similar to those of weblinx). See this example for concreteness: https://github.com/McGill-NLP/webllama/blob/main/examples/complete/run_all.py

One of my ideal effects would be to input the html file, then process it through, and the output would be the html file after the operation, so if I need to set the candidate ahead of time wouldn't I be missing automation?

If I understand your question correctly: you can use the DMR model as part of your automation pipeline, so you can get candidates automatically given raw html, bounding boxe coordinates and action/dialogue history.