URL Scraper 🌐🤖

This repository is a starting point to test strategies for extracting structured information from a given URL using a language model.

You can visit this page to see an example dataset that is used for evaluation of this agent together with an experimental run using the default agent.

What's included

A bare minimum implementation that can be used to extract structured information from a given URL (src/agent folder).
A dataset and evaluation script to evaluate the performance of the agent doing the extraction (src/eval folder).

How it works

The agent is a LangGraph agent that uses a language model to extract structured information from a given URL. The agent is implemented in the src/agent folder.

The agent does the following:

Accepts a URL and a JSON schema as input from a user.
Fetches the HTML content of a given URL.
Parses the HTML content into text.
Uses a vanilla chat model capable of tool calling to extract structured information from the text that matches the schema.

🚀 Launch the LangGraph Server

Install the langgraph CLI:

pip install "langgraph-cli[inmem]==0.1.61"

Install dependencies:

pip install -e .

Load API keys into the environment for the LangSmith SDK and OpenAI API:

export LANGSMITH_API_KEY=<your_langsmith_api_key>
# Or configure another chat model
export OPENAI_API_KEY=<your_openai_api_key>

Launch the agent:

langgraph dev

If all is well, you should see the following output:

Ready!

API: http://127.0.0.1:2024

Docs: http://127.0.0.1:2024/docs

LangGraph Studio Web UI: https://smith.langchain.com/studio/?baseUrl=http://127.0.0.1:2024

Improving the agent

You can try to improve the extraction strategy in a variety of ways. For example,

Improving the HTML parsing strategy.
Adding handling of large html documents and deduplication of extracted information.
Adding reflection steps.
Extend this to work with data URLs and accept other file formats like PDFs. (The src/agent/parsing already has functionality to parse PDFs, you just need to hook it up.)

Evaluation

Prior to engaging in any optimization, it is important to establish a baseline performance. This repository includes:

A dataset consisting of a list of URLs and the expected structured information to be extracted from each URL.
An evaluation script that can be used to evaluate the agent on this dataset.

Set up

Make sure you have the LangSmith CLI installed:

pip install langsmith

And set your API keys:

export LANGSMITH_API_KEY=<your_langsmith_api_key>
# We're using an LLM as a judge, so will need an API key
export OPENAI_API_KEY=<your_langsmith_api_key>

Evaluation metric

A score between 0 and 1 is assigned to each extraction result by an LLM model that acts as a judge.

The model assigns the score based on how closely the extracted information matches the expected information.

Get the dataset

Create a new dataset in LangSmith using the code in the eval folder:

python eval/create_dataset.py

Run the evaluation

To run the evaluation, you can use the run_eval.py script in the eval folder. This will create a new experiment in LangSmith for the dataset you created in the previous step.

python eval/run_eval.py --experiment-prefix "My custom prefix" --agent-url http://localhost:2024

Deploying

You can deploy it using LangGraph Platform.
If you're deploying this agent yoruself and the container is not network isolated (e.g., can access other network resources), you should configure a proxy for using in web requests.

langchain-ai/url_scraper