karpathy/llama2.c

llama2.c text generation Inference server in c

LVivona opened this issue · 0 comments

TLDR: I built this c server on top of llama2.c to host multiple small models within my MacBook, to use for text generation apps on my computer. I wanted to share it to see if it's something people would want to use.

The first thing I realized building this is I take higher-level languages for granted, mainly because strings are such a pain in C as well this was my first attempt at socket programming, and building a server within C or in general so forgive the messy code. if you dare open Pandora's box lol.

a lot of the server code is yanked off of hugging Face text-generation-inference and some of the c files which can be found in these repos libeom, and parson. from text-generation-inference, I yanked most of their Python client's code for a simple Python API as well and the object responses given by both the generated stream, and the generated are somewhat the same i.e.

token=Token(id=9038, text='Once', logprob=0.7856646180152893, special=True) generated_text=None details=None
token=Token(id=2501, text=' upon', logprob=0.9756098985671997, special=True) generated_text=None details=None
token=Token(id=263, text=' a', logprob=0.9996739625930786, special=True) generated_text=None details=None
...
token=Token(id=1, text='<EOS>', logprob=0.9589762091636658, special=True) 

Screen Recording 2023-08-28 at 2 37 04 PM
Note: in the gif the tok/s are slower aprox. ~381.6 tok/s, due what I assume is my internet but I'm not 100% that's the reason, I have reached speed to 500-700 tok/sec when generating <200 tokens on my MacBook, and I assume with more testing I'll be able to deliver a more definitive answer to determine why the speed of generation is effected.
image
(update) with some basic testing in trying to stress the server as well testing at a time where there is less network traffic it seems I'm getting faster result as shown above.

Some Questions You May Probably Be Asking

Where can you find it?

  • https://github.com/LVivona/llama2.c
  • from what I assume from the contribution this code would be consider to large to just migrate over so without doing a pull request I thought I put it in issue's. if popular enough I'll probably move it within its own repo.

Can I use this on my public server?

  • I would eventually like to but currently my suggestion is to not run it on anything public server, I haven't done much testing on the network side so there could be exploits which could make you vulnerable for attacks. my suggestion is more to use this for your own models you want to serve and access within your own computer using llama2.c

Can I run more then one model?

  • Currently no, I want to add the feature your able to route to the specific model listed within the command line when serving
  • if you wanted to serve multiple models you would need to spin them up on multiple ports which is not ideal.

Does it run on windows?

  • from my understanding no, unfortunately pthread from what I understand runs solely on unix machines as windows has its own way of threading and requires you to go through the window.h , I currently in the process of making a window thread pool so It can be done.

What is your current Hardware?

  • MacBook Pro 8GB Memory, Apple M2 Chip

What is this UI?

  • clone of chatGPT I have been building for these small models and integrate my own plugin features to better manage tasks. Built in react, with a python FastAPI backend to store information. I have it on private within my GitHub, as I'm not just ready to release it, but there are defiantly other options you can use. if you wanted to use your own GUI

if you have any more question let me know, I would be happy to answer them if I can.

Python API

  • Similar to HuggingFace text-generation-inference its simple
client = Client(base_url="http://127.0.0.1:9090")
for response in client.generate_stream("Once upon a time,", steps=100, seed=42):
    print(response) # BaseModel object StreamResponse
client = Client(base_url="http://127.0.0.1:9090")
print(client.generate("Once upon a time,", steps=100, seed=42))

I'll probably adding some async function similar to the text-generation-inference repo as well, but I wanted to get the base done.

curl

curl http://localhost:9090/v1/stream/generate \
-X POST \
-d "{\"inputs\":\"Once apon a time,\",\"parameters\":{\"steps\":100,\"seed\":42}}" \
-H "Content-Type: application/json"
curl http://localhost:9090/v1/generate \
-X POST \
-d "{\"inputs\":\"Once apon a time,\",\"parameters\":{\"steps\":100,\"seed\":42}}" \
-H "Content-Type: application/json"

How Can I Run it?

  • I have not tried running it on other computer so forgive me in advance if it doesn't work right away
>>> make all  # to make the exacutable ./serve
>>> ./serve ../stories15M.bin -z ../tokenizer.bin -p 9090 -a 127.0.0.1