This repository is a starting point to test strategies for extracting structured information from a given URL using a language model.
You can visit this page to see an example dataset that is used for evaluation of this agent together with an experimental run using the default agent.
- A bare minimum implementation that can be used to extract structured information from a given URL (
src/agentfolder). - A dataset and evaluation script to evaluate the performance of the agent doing the extraction (
src/evalfolder).
The agent is a LangGraph agent that uses a language model to extract structured information from a given URL. The agent is implemented in the src/agent folder.
The agent does the following:
- Accepts a URL and a JSON schema as input from a user.
- Fetches the HTML content of a given URL.
- Parses the HTML content into text.
- Uses a vanilla chat model capable of tool calling to extract structured information from the text that matches the schema.
Install the langgraph CLI:
pip install "langgraph-cli[inmem]==0.1.61"Install dependencies:
pip install -e .Load API keys into the environment for the LangSmith SDK and OpenAI API:
export LANGSMITH_API_KEY=<your_langsmith_api_key>
# Or configure another chat model
export OPENAI_API_KEY=<your_openai_api_key>Launch the agent:
langgraph devIf all is well, you should see the following output:
Ready!
Docs: http://127.0.0.1:2024/docs
LangGraph Studio Web UI: https://smith.langchain.com/studio/?baseUrl=http://127.0.0.1:2024
You can try to improve the extraction strategy in a variety of ways. For example,
- Improving the HTML parsing strategy.
- Adding handling of large html documents and deduplication of extracted information.
- Adding reflection steps.
- Extend this to work with data URLs and accept other file formats like PDFs. (The
src/agent/parsingalready has functionality to parse PDFs, you just need to hook it up.)
Prior to engaging in any optimization, it is important to establish a baseline performance. This repository includes:
- A dataset consisting of a list of URLs and the expected structured information to be extracted from each URL.
- An evaluation script that can be used to evaluate the agent on this dataset.
Make sure you have the LangSmith CLI installed:
pip install langsmithAnd set your API keys:
export LANGSMITH_API_KEY=<your_langsmith_api_key>
# We're using an LLM as a judge, so will need an API key
export OPENAI_API_KEY=<your_langsmith_api_key>A score between 0 and 1 is assigned to each extraction result by an LLM model that acts as a judge.
The model assigns the score based on how closely the extracted information matches the expected information.
Create a new dataset in LangSmith using the code in the eval folder:
python eval/create_dataset.pyTo run the evaluation, you can use the run_eval.py script in the eval folder. This will create a new experiment in LangSmith for the dataset you created in the previous step.
python eval/run_eval.py --experiment-prefix "My custom prefix" --agent-url http://localhost:2024- You can deploy it using LangGraph Platform.
- If you're deploying this agent yoruself and the container is not network isolated (e.g., can access other network resources), you should configure a proxy for using in web requests.