[Chatllama] Support Inference for trained models.
PierpaoloSorbellini opened this issue · 1 comments
PierpaoloSorbellini commented
Description
Currently to perform inference of the models generated the user needs to interact with the model generated writing a small python script accordingly to how the model is saved by library, by loading the resulting checkpoint or model saved after training.
Moreover a lot of optimization can be integrated to speed-up the inference such as:
- CPU Offloading.
- llama.ccp implementation
- accelerate / deepspeed distributed inference.
TODO
- Implement Inference Class to make inference very easy and even possible from CLI.
- Implement Inference with the optimisations available from deepspeed
- Implement inference with the optimisations available from accelerate
- Implement fast lama inference with known library llama.ccp implementation
shrinath-suresh commented
@PierpaoloSorbellini The inference section is tagged with WIP. Do we have any basic inference code available in chatllama to load actor_rl model and run few queries ?