OptimalScale/LMFlow

[Roadmap] LMFlow Roadmap

Opened this issue · 2 comments

This document includes the features in LMFlow's roadmap. We welcome any discuss or contribute to the specific features at related Issues/PRs. 🤗

Main Features

  • Data
    • DPO dataset format #867
    • Conversation template in DPO #883
    • jinja template
    • Tools in conversation dataset #884 #892
    • Packing with block diagonal attention
  • Model
    • Backend
      • 🏗️ Accelerate support
    • Tokenization
      • Tokenization update, using hf method
  • Pipeline
    • Train/Finetune/Align
      • DPO (multi-gpu) #867
      • Iterative DPO #867 #883
      • PPO
      • LISA (multi-gpu, qwen2, chatglm) #899
      • Batch size and learning rate recommendation (arxiv)
      • No trainer version pipelines, allowing users to customize/modify based on their needs
      • Sparse training for moe models #879
    • Inference
      • vllm inference #860 #863
      • Reward model scoring #867
      • Multiple instances inference (vllm, rm, others) #883
      • Inference checkpointing and resume from checkpoints
      • Inference accelerate EAGLE
      • Inferencer for chat/instruction models, and chatbot.py upgrade #917

Usability

  • Make some packages/functions (gradio, vllm, ray, etc.) optional, add conditional import. #905
  • Inference method auto-downgrading (vllm>ds, etc.), and make vllm package optional
  • Merging similar model methods into hf_model_mixin
  • Set torch_dtype='bfloat16' when bf16 is specified, etc. (bf16 is in FinetunerArguments but torch_dtype is in ModelArguments, thus cannot handle in __post_init__(). )

Bug fixes

  • model.generate() with dsz3 #861
  • merge_lora lora with abs path merging
  • load_dataset long data fix #878
  • src/lmflow/utils/common.py create_copied_dataclass compatibility when python version >= 3.10 (kw_only issue) #903 #905

Issues left over from history

  • use_accelerator -> use_accelerate typo fix (with Accelerate support PR)
  • model_args.use_lora leads to truncation of the sequence, mentioned in #867
  • Make ports, addresses, and all other settings in distributed training tidy and clear (with Accelerate support PR)

Documentation

  • Approx GPU memory requirement w.r.t model size & pipeline
  • Dev handbook, indicating styles, test list, etc.

Note on multiple instances inference:
In vllm inference, the number of attn heads should be devisible by vllm tensor parallel size. If we have a 14 heads LLM, then the options for tp is 1&2 (7 will cause another division issue, but I just forget what that issue is).
Say we have 8 gpus, then to utilize these devices, multiple instances vllm inference is necessary (tp=1 -> 8 instances, and tp=2 -> 4 instances)
Also, same for rm inference, and any other inference pipelines.

Now supports Iterative DPO #883