A pipeline that implements Chain-of-Thought (CoT) Decoding using the Ollama API within Open-WebUI.
Waiting on Ollama team to impelent logprobs to have an original implementation. Currently this will prompt the model to use the options as an inner monologue.
This pipeline leverages the Ollama API to perform Chain-of-Thought decoding, which involves:
- Extracting alternative top-𝑘 decoding paths.
- Calculating confidence metrics for each decoding path.
- Selecting the most reliable path based on confidence.
By generating multiple responses to a given prompt and selecting the most confident one, the pipeline aims to improve the accuracy and reliability of the generated answers.
- Chain-of-Thought Decoding: Implements CoT decoding to enhance reasoning capabilities without explicit prompting.
- Customizable Parameters: Adjust
k
,temperature
,max_tokens
, and other parameters via configuration. - Debugging Output: Optionally enable detailed debugging information to see generated responses and confidence scores.
-
Python Libraries:
requests
(install viapip install requests
)pydantic
(should be available in the Open-WebUI environment)
-
Ollama API:
- Ensure the Ollama API is running and accessible.
- Default API URL:
http://localhost:11434/api/generate
-
Place the Script:
- Save the script as
cot_decoding_pipeline.py
(or any appropriate name) in the pipelines directory of Open-WebUI.
- Save the script as
-
Install Required Libraries:
pip install requests
The pipeline uses a Valves
class for configuration, allowing you to adjust parameters without modifying the code.
- pipelines: List of pipelines to connect to (default:
["*"]
). - priority: Pipeline execution priority (default:
0
). - ollama_api_url: URL of the Ollama API (default:
http://localhost:11434/api/generate
). - model: Name of the model to use (must be set).
- k: Number of top-k alternatives to consider (default:
10
). - temperature: Sampling temperature for response generation (default:
0.7
). - max_tokens: Maximum number of tokens to generate (default:
256
). - debug: Enable debugging output (default:
True
).
You can also set configuration parameters using environment variables:
OLLAMA_API_URL
: Overridesollama_api_url
.COT_DECODING_K
: Overridesk
.COT_DECODING_TEMPERATURE
: Overridestemperature
.COT_DECODING_MAX_TOKENS
: Overridesmax_tokens
.COT_DECODING_DEBUG
: Set to"True"
or"False"
to enable or disable debugging output.
Ensure that the model
parameter in the Valves
configuration is set to the name of the model you wish to use.
self.valves = self.Valves(
**{
"model": "your_model_name_here",
# other parameters...
}
)
The pipeline works within Open-WebUI and interacts with user inputs to generate responses. Here's how it operates:
-
User Input: The user provides a message or question.
-
Prompt Formatting: The pipeline formats the conversation history into a prompt in the standard Q&A format.
-
Generating Responses:
- It generates
k
alternative responses using the Ollama API. - Each response is generated with a different random seed to ensure diversity.
- It generates
-
Calculating Confidence:
- For each response, a confidence score is calculated.
- The confidence metric used is tokens per second (
eval_count / eval_duration
).
-
Selecting the Best Response:
- The response with the highest confidence score is selected as the final answer.
-
Debugging Output (if enabled):
- The pipeline prints the formatted prompt, all generated responses, their confidence scores, and the selected best response.
Formatted Prompt:
Q: What is the capital of France?
A:
Generated Responses and Confidence Scores:
Response 1:
Content: The capital of France is Paris.
Confidence: 0.000023
Response 2:
Content: Paris is the capital city of France.
Confidence: 0.000025
... (additional responses)
Selected Best Response:
Content: Paris is the capital city of France.
Confidence: 0.000025
-
Confidence Metric:
- The current confidence calculation is simplified. Waiting on Ollama team to impelent logprobs passthrough to have a fully working version
- For more accurate confidence metrics, consider implementing a method based on token probabilities if available from the API.
-
Debugging:
- Debugging output is helpful during development and testing.
- In a production environment, consider setting
debug
toFalse
to avoid exposing internal details.
-
Error Handling:
- The pipeline includes basic error handling for API calls.
- You may want to enhance error handling based on your specific use case.
-
No Response Generated:
- Ensure the
model
parameter is correctly set. - Verify that the Ollama API is running and accessible.
- Ensure the
-
API Errors:
- Check the API URL and network connectivity.
- Review error messages printed in the console or logs.
-
Debug Information Not Displayed:
- Confirm that
debug
is set toTrue
in the configuration. - Ensure that the environment variable
COT_DECODING_DEBUG
is not overriding thedebug
setting.
- Confirm that
Contributions and improvements to this pipeline are welcome. Feel free to fork the repository and submit pull requests.
This project is licensed under the MIT License.
- Open-WebUI: For providing the framework to build and integrate this pipeline.
- Ollama: For the API used to generate model responses.
If you have any questions or need further assistance, please feel free to reach out.