/Phi2_multimodal

Primary LanguageJupyter Notebook

ERA-CAPSTONE

🤗Space Link

Tasks:

  1. Make a multi-modal LLM that can take these inputs:

    • ✔️ Text
    • ✔️ Image
    • ✔️ Audio
  2. Training:

    • Image:

      ✔️ Use the original Instruct 150k dataset, and use CLIP to get the image embeddings.

      ✔️ Add projection layer from this CLIP embeddings to something that can be fed to Phi Model.

      ✔️ Add an adapter to train (QLoRa) on the instruct 150k dataset.

    • Audio:

      ✔️ Need to use Whisper to perform ASR.

      ✔️ Add a projection layer for whisper output.

    • Text:

      ✔️ Give any text to generate the related details.

  3. ✔️ The output remains text, based on multimodal inputs - text, image, and audio.

  4. ✔️ The deployment page should look like ChatGPT only, where we can send in images, text, or upload audio (live recording or file).

Phi2 : Pretraining LLM from Scratch

Details

  1. Model used: Microsoft Phi2
  2. Dataset used: Tiny Stories dataset(100k samples) & Realtime data(100k samples) from finetuned Phi2 model via Ollama
  3. Pretraining approach: Pretraining using QLoRA

Training Loss Curve

Training Logs

image

Phi2 : Multimodal Finetuning

Details

  1. LLM Backbone: Phi2
  2. Vision Tower: clip-vit-large-patch14-336
  3. Audio Model: Whisper Tiny
  4. Pretraining Dataset: LAION-CC-SBU dataset with BLIP captions(200k samples)
  5. Finetuning Dataset: Instruct 150k dataset based on COCO
class AudioLanguageConnector:
  • This class prepares and tokenizes audio-related text data using the "microsoft/phi-2" model's tokenizer. The <audio_start> and <audio_end> tokens are added to the input text to provide context for audio-related processing. The tokenized output is then returned as a tensor. This class acts as a connector to process text data in a format suitable for the specified audio model.
class WhisperWithProjection:
  • This class transcribes audio by encapsulating the necessary steps. It uses a pre-trained model called "openai/whisper-tiny" to convert audio files into text transcriptions.
class MultiModalPhi2:
  • This class takes input text, audio, and images and constructs a conversation prompt with appropriate formatting for the model. It tokenizes the prompt, preprocesses the image, and concatenates audio embeddings if available, and generates new tokens using the pre-trained model, considering input modalities. Decodes and returns the generated output, handling special tokens and potential mismatches.

Pretraining

Training Loss Curve and Learning Rate

Training Logs

image

Finetuning

Training Loss Curve and Learning Rate

Training Logs

image

Results

image

Text & Image:

image

Audio & Image:

Question Asked: What is this image about?

image

Future Scope:

  • Incorporating the original Llava model's finetuning on a larger set of BLIP captions (558k) could lead to significant improvements.
  • Using GPTQ or AWQ can reduce latency, making the model more efficient.