CogVLM Stand Alone Script

This is a basic script you can use to run CogVLM2 locally. It uses 4-bit quantization by default (on CUDA devices) to minimize VRAM requirements and provides several modes of interaction, with interactive chat as the default mode.

Features ✨

Chat About Your Images! 🗣️🖼️
- Interactive Conversation: Talk back and forth with the AI about your pictures.
- Context Aware: It remembers the chat history for follow-up questions.
- Load Images Easily: Add pictures from your computer or the web during the chat.
Get Instant Image Descriptions ✍️
- Automatic Captions: Instantly generates text descriptions for your images.
- Practical Use: Great for summaries, alt text, and quick understanding.
Extract Specific Details into JSON 📊
- Structured JSON Output: Get organized information (like lists of objects or text) from images formatted as JSON, which is easy for computers to read and use.
- Targeted Recognition: Ask it to find specific things like text or particular objects to include in the JSON.
- Format Correction: Automatically tries to fix the JSON output if it's not valid or doesn't match requirements.

Quickstart

Ubuntu

Clone GitHub Repo: git clone https://github.com/Spinnernicholas/CogVLM.git
Navigate to directory: cd CogVLM
Create Python Environment: python -m venv .venv
Activate Environment: source .venv/bin/activate
Install Python Requirements: pip install -r requirements.txt
Run Examples:
- Interactive Chat (Default): python CogVLM.py
- Interactive Chat starting with an image: python CogVLM.py --image <path_or_url_to_image>
- JSON Demo: python CogVLM.py --json-demo
- JSON Demo with specific image and 1 retry: python CogVLM.py --json-demo --image <path_or_url_to_image> --retries 1
- Generate Caption: python CogVLM.py --caption --image <path_or_url_to_image>

Note on Transformers Library Versions

Modifications to the base model classes in Transformers>=v4.49.0 broke the CogVLM2 model classes. This script has been tested and works with Transformers==v4.48.3. Ensure you have the correct version installed (as specified in requirements.txt).

Code Reference: CogVLM Interaction Utilities

This script provides utilities for interacting with the CogVLM (Cognitive Visual Language Model), including image loading, chat functionalities, JSON extraction/validation, model inference, and a command-line interface.

Dependencies
Helper Functions
Classes
- CogVLMChat
- CogVLM
Command-Line Interface

Dependencies

logging
random
argparse
json
jsonschema
PIL (Pillow)
io
typing
numpy
requests
torch
transformers (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig)

Helper Functions

`extract_json(response)`

Extracts a JSON object from a string response. It first tries to load the entire string as JSON. If that fails, it searches for the first { and last } and attempts to parse the content between them.

Arguments:
- response (str): The string potentially containing a JSON object.
Returns:
- dict or list: The parsed JSON data if found and valid.
- None: If no JSON is found or if decoding fails. Logs warnings/errors.

`validate_json_schema(json_data, schema)`

Validates a Python dictionary (representing JSON data) against a given JSON schema dictionary using the jsonschema library.

Arguments:
- json_data (dict): The JSON data (as a Python dictionary) to validate.
- schema (dict): The JSON schema (as a Python dictionary) to validate against.
Returns:
- tuple: (bool, str)
  - (True, None): If the json_data is valid according to the schema.
  - (False, error_message): If validation fails, containing the error message string from ValidationError. Logs errors on failure.

`get_random_examples(data, num_examples=3)`

Selects a specified number of random examples from a list. (Note: Currently not used by the main script modes).

Arguments:
- data (list): The list of items to sample from.
- num_examples (int, optional): The number of random examples to return. Defaults to 3.
Returns:
- list: A list containing num_examples randomly selected items from data. Returns an empty list and logs a warning if num_examples is greater than the length of data.

`load_image(image)`

Loads an image from various sources into a PIL Image object.

Arguments:
- image (Union[str, PIL.Image.Image]): The image source. Can be:
  - A local file path (string).
  - A URL (string).
  - An existing PIL.Image.Image object.
Returns:
- PIL.Image.Image: The loaded image object.
Raises:
- FileNotFoundError: If the image cannot be loaded from the given path or URL after trying both.
- TypeError: If the input image is not a string or a PIL.Image.Image object.

Classes

`CogVLMChat`

Provides a stateful chat interface wrapper around a CogVLM instance, managing conversation history and image context.

`init(self, model, user_name='USER', assistant_name='ASSISTANT')`

Initializes the chat session.

Arguments:
- model (CogVLM): An instance of the CogVLM class to use for inference.
- user_name (str, optional): The name representing the user in the chat history. Defaults to 'USER'.
- assistant_name (str, optional): The name representing the assistant in the chat history. Defaults to 'ASSISTANT'.
Attributes:
- model: The associated CogVLM instance.
- user_name: User's name tag.
- assistant_name: Assistant's name tag.
- history: A list storing the conversation history as (name, message) tuples.
- image: The current PIL.Image.Image context for the chat (or None).
- image_path: The path or URL of the currently loaded image (or None).

`chat(self, query)`

Sends a user query to the model using the inference method, incorporating the current history and image context (if any). Updates the internal history with the query and the model's response.

Arguments:
- query (str): The user's message.
Returns:
- str: The model's response.

`open_image(self, image_path)`

Opens an image from a file path or URL using load_image, converts it to RGB, sets it as the current image context for the chat, and stores its path.

Arguments:
- image_path (str): Path or URL to the image.
Returns:
- tuple: (bool, str) where the boolean indicates success and the string provides a status message (including image dimensions). Handles FileNotFoundError, requests.exceptions.RequestException, and other potential errors during loading/processing.

`get_image_info(self)`

Returns information about the currently loaded image, including its dimensions, mode, and path/URL.

Arguments: None.
Returns:
- str: A string containing image information or "No image is currently loaded".

`clear_history(self)`

Clears the chat conversation history.

Arguments: None.
Returns:
- str: A confirmation message "Chat history cleared".

`reset(self)`

Resets the chat session by clearing both the conversation history and the currently loaded image context (including its path).

Arguments: None.
Returns:
- str: A confirmation message "Chat history and image have been reset".

`start_cmd_chat(self)`

Starts an interactive command-line chat session. This is the default mode of operation if no other mode is specified. Handles user input, commands, and model interaction within a loop.

Arguments: None.
Returns: None.
Behavior:
- Prompts the user for input using user_name.
- Parses input starting with / as commands.
- Sends non-command input to the chat method and prints the response.
- Handles EOFError (e.g., Ctrl+D) to exit gracefully.
- Prints error messages for unknown commands or inference errors.
Commands:
- /help: Show available commands.
- /exit: Exit the chat session.
- /open [path_or_url]: Load an image from a local path or URL.
- /clear: Clear the conversation history.
- /image: Show information about the currently loaded image.
- /reset: Clear history and unload the current image.

`CogVLM`

The main class for loading and interacting with the CogVLM model.

`init(self, model_path='THUDM/cogvlm2-llama3-chat-19B')`

Initializes the CogVLM instance, loading the specified model and tokenizer. Configures device (CUDA if available, else CPU) and data type (bfloat16 on compute capability >= 8, else float16). Uses 4-bit quantization via BitsAndBytesConfig by default if CUDA is available. Logs device, dtype, and loading progress.

Arguments:
- model_path (str, optional): The Hugging Face model identifier or local path to the CogVLM model. Defaults to 'THUDM/cogvlm2-llama3-chat-19B'.
Attributes:
- model_path: The path/identifier used.
- logger: Logger instance.
- DEVICE: The device ('cuda' or 'cpu').
- TORCH_TYPE: The torch data type (torch.bfloat16 or torch.float16).
- tokenizer: The loaded AutoTokenizer instance.
- model: The loaded AutoModelForCausalLM instance (potentially quantized).

`inference(self, query, system_prmpt=None, images=None, history=None, max_new_tokens=2048, pad_token_id=128002, top_k=1, user_name='USER', assistant_name='ASSISTANT', seed_response="")`

Performs inference using the loaded CogVLM model, handling text, optional images, and conversation history.

Arguments:
- query (str): The main text query or prompt for the current turn.
- system_prmpt (str, optional): A system prompt to prepend to the conversation context. Defaults to None.
- images (list, optional): A list containing image sources (paths, URLs, or PIL Images). Only the first image is used if multiple are provided. Defaults to None.
- history (list, optional): A list of (name, message) tuples representing the conversation history before the current turn. If None, a new history is started. The provided list is modified in place by appending the current user query and the assistant's response. Defaults to None.
- max_new_tokens (int, optional): Maximum number of new tokens to generate in the response. Defaults to 2048.
- pad_token_id (int, optional): Token ID for padding during generation. Defaults to 128002.
- top_k (int, optional): The number of highest probability vocabulary tokens to keep for top-k-filtering during generation. Defaults to 1.
- user_name (str, optional): Name tag for the user in the current turn. Defaults to 'USER'.
- assistant_name (str, optional): Name tag for the assistant in the current turn. Defaults to 'ASSISTANT'.
- seed_response (str, optional): A string to prepend to the model's generated output. This string is also included after the ASSISTANT: tag when building the input prompt. Defaults to "".
Returns:
- tuple: (response, history)
  - response (str): The generated text response from the model (with seed_response prepended), stripped of EOS tokens, or an error message string on failure.
  - history (list): The updated conversation history including the latest user query and the assistant's response (or the history before the failed turn if an exception occurred).
Notes:
- Handles image loading (load_image) and RGB conversion if images are provided. Logs a warning if multiple images are given.
- Appends the current (user_name, query) to the history list before inference.
- Formats the input prompt string including system prompt (if any), all history turns, and the current assistant tag (f"{assistant_name}:{seed_response}").
- Builds model inputs using model.build_conversation_input_ids (handles text and image token interleaving).
- Runs generation using model.generate within torch.no_grad().
- Decodes the generated output tokens, prepends seed_response, and cleans the result.
- Appends the successful (assistant_name, response) to the history list after successful inference.
- Includes error handling and logging for image processing and model inference steps.

`create_chat(self, user_name=None, assistant_name=None)`

Factory method to create a CogVLMChat instance associated with this CogVLM model.

Arguments:
- user_name (str, optional): User name for the chat session. Defaults to 'USER'.
- assistant_name (str, optional): Assistant name for the chat session. Defaults to 'ASSISTANT'.
Returns:
- CogVLMChat: A new chat session instance.

`generate_caption(self, image, query='Describe what you see in the image below. Write a concise, descriptive caption at least 10 words long.')`

A convenience method to generate a caption for a single image by calling the inference method with no history (used by --caption mode).

Arguments:
- image (Union[str, PIL.Image.Image]): The image to caption.
- query (str, optional): The prompt used to request the caption. Defaults to a descriptive prompt asking for at least 10 words.
Returns:
- str: The generated caption.
- None: If the input image is None or an error occurs during inference.

`request_json(self, query, image=None, extract=False, schema=None, validate_schema=False, max_retries=0)`

Requests a response from the model, specifically aiming for JSON output. Optionally extracts the JSON, validates it against a schema, and retries with feedback on failure (used by --json-demo mode).

Arguments:
- query (str): The user's query, intended to elicit a JSON response.
- image (Union[str, PIL.Image.Image], optional): An image to provide context. Defaults to None.
- extract (bool, optional): If True, attempts to extract JSON from the response using extract_json. Defaults to False.
- schema (dict or str, optional): A JSON schema (as a dictionary or JSON string) to validate against if validate_schema is True. If provided, it's also added to the system prompt to guide the model. Defaults to None.
- validate_schema (bool, optional): If True (and extract is True and schema is provided), validates the extracted JSON against the schema using validate_json_schema. Automatically disabled with a warning if extract is False or schema is not provided. Defaults to False.
- max_retries (int, optional): The number of times to retry if JSON extraction or validation fails. Defaults to 0.
Returns:
- tuple: (result, raw_response)
  - result:
    - If extract is False: The raw string response from the model.
    - If extract is True and successful (and validation passes, if enabled): The extracted JSON data (dict or list).
    - If extract is True but fails after max_retries: None.
  - raw_response (str): The final raw string response received from the model during the last attempt.
Notes:
- Sets a system prompt: "You are a helpful assistant that responds in a valid JSON format."
- If a valid schema is provided, it's loaded (if string) and appended to the system prompt within a JSON code block. Invalid schemas disable validation.
- Calls inference with the constructed system prompt and seed_response="\n```json\n{" to encourage JSON output.
- Maintains conversation history across retries.
- Implements a retry loop:
  - If extraction fails, retries with a modified query asking for strict JSON format, potentially including the start of the previous invalid response.
  - If validation fails, retries with a modified query including the schema validation error message, asking for correction.
- Logs warnings/errors during extraction, validation, and retries.

Command-Line Interface

The script uses argparse to handle command-line arguments, allowing users to select different modes of operation and configure settings.

`parse_arguments()`

Parses command-line arguments using argparse.

Returns:
- argparse.Namespace: An object containing the parsed arguments.
Arguments:
- Mode Selection (mutually exclusive):
  - --json-demo: Run the JSON demo.
  - --interactive: Start interactive chat mode. This is the default behavior if no mode is specified.
  - --caption: Generate a caption for an image (requires --image).
- Configuration:
  - --model-path: Path or HuggingFace ID of the model (default: THUDM/cogvlm2-llama3-chat-19B).
  - --image: Path or URL to an image (used by --caption, --interactive, --json-demo).
  - --schema: Path to a JSON schema file (used by --json-demo).
  - --query: Query to send to the model (used by --caption, --json-demo).
  - --retries: Number of retries for JSON extraction/validation (used by --json-demo, default: 0).
  - --verbose/-v: Increase logging verbosity (0: WARNING, 1: INFO, 2: DEBUG, default: 0).
  - --user-name: Name for the user in chat (default: USER).
  - --assistant-name: Name for the assistant in chat (default: ASSISTANT).

`setup_logging(verbosity)`

Configures the root logger based on the verbosity level provided by command-line arguments (--verbose). Suppresses overly verbose logs from dependencies unless verbosity is set to DEBUG (-vv).

Arguments:
- verbosity (int): The level of verbosity (0: WARNING, 1: INFO, 2: DEBUG).

`main()`

The main entry point of the script.

Parses arguments using parse_arguments().
Sets up logging using setup_logging().
Initializes the CogVLM model.
Runs the selected mode:
- If args.json_demo: Loads schema (if provided), then calls run_json_demo(). Handles schema file loading errors.
- If args.caption: Checks for required --image argument, then calls cogVLM.generate_caption() and prints the result. Handles image loading errors.
- If args.interactive OR if no other mode was specified: Creates a CogVLMChat instance, optionally loads the initial image specified by --image, and starts the interactive loop by calling chat.start_cmd_chat().
Includes top-level exception handling to catch and log unexpected errors.

`run_json_demo(cogVLM, image=None, schema=None, query=None, retries=0)`

Runs the JSON demonstration mode when explicitly requested via the --json-demo argument. It uses default values for image, schema, and query if none are provided via command-line arguments or if loading fails. It calls cogVLM.request_json with extraction and validation enabled, using the specified number of retries. Prints the raw response and the final extracted/validated JSON (or None if failed).

Arguments:
- cogVLM (CogVLM): The initialized CogVLM model instance.
- image (str, optional): Path or URL to the image. Uses a default URL if not provided.
- schema (dict, optional): The JSON schema to use for validation. Uses a default schema if not provided.
- query (str, optional): The query to send to the model. Uses a default query if not provided.
- retries (int, optional): Number of retries for JSON extraction/validation. Defaults to 0.

Spinnernicholas/CogVLM

CogVLM Stand Alone Script

Features ✨

Quickstart

Ubuntu

Note on Transformers Library Versions

Code Reference: CogVLM Interaction Utilities

Table of Contents

Dependencies

Helper Functions

extract_json(response)

validate_json_schema(json_data, schema)

get_random_examples(data, num_examples=3)

load_image(image)

Classes

CogVLMChat

__init__(self, model, user_name='USER', assistant_name='ASSISTANT')

chat(self, query)

open_image(self, image_path)

get_image_info(self)

clear_history(self)

reset(self)

start_cmd_chat(self)

CogVLM

__init__(self, model_path='THUDM/cogvlm2-llama3-chat-19B')

inference(self, query, system_prmpt=None, images=None, history=None, max_new_tokens=2048, pad_token_id=128002, top_k=1, user_name='USER', assistant_name='ASSISTANT', seed_response="")

create_chat(self, user_name=None, assistant_name=None)

generate_caption(self, image, query='Describe what you see in the image below. Write a concise, descriptive caption at least 10 words long.')

request_json(self, query, image=None, extract=False, schema=None, validate_schema=False, max_retries=0)