This project leverages generative AI agents to generate recipes from food images. By utilizing ️LangGraph, various LLM-powered tools and conditional workflows, the application can extract ingredients, retrieve relevant documents, generate recipes, and have self-supervised workflows to correct mistakes and errors in generation.
- Routing: Adaptive RAG (paper). Route questions to different type of retrieval
- Self-correction: Self-RAG (paper). Fix answers that either contain hallucinations or don't answer the question
- LLM Critics Help Catch LLM Bugs LLM-Critic (paper). This research trains AI "critics" to assist humans in evaluating code written by other AI models for more accurate evaluations.
- NVIDIA/GenerativeAIExamples
- LangGraph_HandlingAgent_IntermediateSteps
- Agent_use_tools_leveraging_NVIDIA_AI_endpoints.ipynb
- LangChain NVIDIA Integration
- Scenario for Image Assets generation
- Elevent Labs for Audio in the demo video
The project is created with Langchain/Langgraph and can be run with docker compose
To run this project, you only need to use Docker Compose. Follow the steps below to get started.
- Nvidia API key is provided through .env file
- Ensure you have Docker and Docker Compose installed on your machine.
- Clone the Repository:
git clone git@github.com:ttback/photo-to-recipe.git
cd photo-to-recipe
-
Set up NVIDIA_API_KEY key in .env file, see .env.example
-
Build and Run the Docker Containers:
docker compose up
- Run it in browser: localhost:7860
The images in images folder can be used to test out basic workflow with burger, sushi and non-food photo from the Nvidia example for image caption. The vector db contains burger recipes only, so sushi can be used to test for most complete workflow where the initial RAG-based generation will be rejected and the ADDA team will re-generate recipe with non-RAG based process.
- Unsupervised Image Type detection: Handle food vs. non-food image without user interaction
- Automatic Ingredient Extraction from Food Photo: Using latest multi-modal SLM (
microsoft/phi-3-vision-128k-instruct
) to extract ingredient from food image - Document Retrieval: Transform online web pages to vector store via langchain and Nvidia's embedding model,
NV-Embed-QA
- Conditional (RAG or no-RAG) generation: Check whether the retrieved documents are relevant for the recipe generation process, before proceeding with RAG-based generation. If for some reasons, the web urls changed content, or are unavailable, ADDA team is smart enough to avoid RGA-based generation
- RAG-based recipe generation: Using retrieved documents, the writer agent will generate recipe.
- Automated Hallucation checker: Agents will check whether generated recipe is grounded by documents and is for the food and ingredients detected in the input image.
Tool | Description | Model |
---|---|---|
image_router |
Routes the image to the appropriate processing path based on its content. | microsoft/phi-3-vision-128k-instruct |
ingredients_recognizer |
Extracts ingredients from the image. | microsoft/phi-3-vision-128k-instruct |
image_caption |
Generates a caption for the image. | microsoft/phi-3-vision-128k-instruct |
doc_retriever |
Retrieves documents from a vector store based on the question, downloading from food.com. | NV-Embed-QA |
relevance_grader |
Grades the relevance of retrieved documents to the question. | meta/llama3-70b-instruct |
rag_recipe_generator |
Generates a recipe using RAG on retrieved documents. | meta/llama3-70b-instruct |
recipe_generator |
Generates a recipe without using RAG. | mistralai/mixtral-8x7b-instruct-v0.1 |
hallucination_grader |
Grade for hallucinations in the generated recipe. | meta/llama3-70b-instruct |
answer_grader |
Grades the generated recipe against the documents and question. | meta/llama3-70b-instruct |
graph TD
A[Start] --> B{Is it a food image?}
B -->|Yes| C[Extract Ingredients]
B -->|No| D[Image Caption]
C --> E[Retrieve Recipe Documents]
E --> F{Are most recipe documents relevant?}
F -->|Yes| G[Generate Recipe using RAG]
F -->|No| H[Generate Recipe without RAG]
G --> I{Is the RAG generation grounded in documents?}
I -->|Yes| J{Does the RAG generation address the question?}
I -->|No| H
J -->|Yes| K[End]
J -->|No| H
D --> L[End]
H --> K