nebuly-ai/optimate

[Chatllama] Support Inference for trained models.

PierpaoloSorbellini opened this issue · 1 comments

Description

Currently to perform inference of the models generated the user needs to interact with the model generated writing a small python script accordingly to how the model is saved by library, by loading the resulting checkpoint or model saved after training.

Moreover a lot of optimization can be integrated to speed-up the inference such as:

  • CPU Offloading.
  • llama.ccp implementation
  • accelerate / deepspeed distributed inference.

TODO

  • Implement Inference Class to make inference very easy and even possible from CLI.
  • Implement Inference with the optimisations available from deepspeed
  • Implement inference with the optimisations available from accelerate
  • Implement fast lama inference with known library llama.ccp implementation

@PierpaoloSorbellini The inference section is tagged with WIP. Do we have any basic inference code available in chatllama to load actor_rl model and run few queries ?