/candle-llava

implement llava using candle

Primary LanguageRustMIT LicenseMIT

candle-llava

implement LLaVA using candle

The code is based on https://github.com/haotian-liu/LLaVA, Hence the llava-hf version of config may perform differently.

The llava-hf models contain tokenizer.json, so if you want pure-rust experience, I suggest you to use llava-hf version.

model zoo

Right now I have tested on liuhaotian/llava-v1.6-vicuna-7b and llava-hf/llava-v1.6-vicuna-7b-hf. The memory use might have room for optimization.

eval

single-image

cargo run  # default args, use liuhaotian/llava-v1.6-vicuna-7b, default-image is image/llava_logo.png, prompt is "is this a cat?"
cargo run  -- --image-file "images/llava_v1_5_radar.jpg" --prompt "what does this picture show?"
cargo run -- --model-path "llava-hf/llava-v1.6-vicuna-7b-hf" # use llava-hf model

task

  • Download the corresponding weights from Hugging Face

  • Load the model weights and configs

    • general llava config(need to rethink what is necessary)
    • Vision tower(CLIP)
      • image processor(partial, the format of 'size' and 'crop size' not fully compatible with python transformer)
    • LLM
      • llama/vicuna
      • mistral
  • image preprocess

    • clip image processor
    • 'anyres' image preprocess
    • 'pad' image preprocess
  • conv template (partial, only implement conv_llava_v1 and conv_chatml_direct, which is enough for LLaVA v1.6)

  • Model structure Implementation

    • Vision tower
    • LLM
      • modify of llama code
        • output embedding result
        • generate from embed tensors
  • model forward

    • Vision tower
      • feature select
    • LLM
    • process of multiple images
      • read multiple images
      • multiple images patch process
    • concat of image features and text features
    • truncate of the concat features
  • main process

    • load model
    • load image
    • load text
    • tokenize text
    • forward
      • single image
    • output
    • KV cache
    • conversation mode
    • (long term) web?
  • quantization

    • 4-bit
    • 8-bit
  • (long term) Expand candle operators, including:

    • split
    • nonzero
    • where
  • top priority migrate to support llava-hf series model

    • determine whether it is a llava-hf model
    • translate of config
    • translate of model
    • take care of constant such as image_token_index
    • modify of image processor config
  • LoRA

  • contribution to other projects

  • memory optimization for LLaVA 1.6 version

  • (long term)model training c

Tokenizer Setup

conda create -n llava python=3.10  
pip install transformers protobuf

Download using mirror (for Chinese users)

pip install -U huggingface_hub  
export HF_ENDPOINT=https://hf-mirror.com  
huggingface-cli download --resume-download liuhaotian/llava-v1.6-vicuna-7b

Limitations

  • Tested only on liuhaotian/llava-v1.6-vicuna-7b version