STEPS

Video of text generation https://github.com/santoshdkolur/LLM-Runner/assets/48786464/cb9a881e-a9c4-4a40-9a45-b6e1e013c5f6

Things that would be covered in this guide:

We will learn how to set-up an android device to run an LLM model locally.
We will see how we can use my basic flutter application to interact with the LLM Model.
We can also connect to a public ollama runtime which can be hosted on your very own colab notebook to try out the models.

HOW TO SET-UP YOUR ANDROID DEVICE TO RUN AN LLM MODEL LOCALLY

We will be using the Termux application as our base. You can download the latest version of termux from here: Termux Releases
Once termux is installed, let us open it up and install basic ubuntu using our guide from AnLinux application (AnLinux App)
You can either use the app to get the code to install ubuntu and follow the steps or follow the guide below
On termux, run the following code :

pkg install wget openssl-tool proot -y && hash -r && wget https://raw.githubusercontent.com/EXALAB/AnLinux-Resources/master/Scripts/Installer/Ubuntu/ubuntu.sh && bash ubuntu.sh

This should install ubuntu on your system, now to start ubuntu, type ./start-ubuntu.sh in your terminal.
From here on, we can follow our normal guide to install ollama on linux. Run the below command to pull and install ollama on your device:

curl https://ollama.ai/install.sh | sh

Once ollama is installed, we need to start the ollama server in the background by running the command:

ollama serve &

We can see in the output that the ollama server has started running on our mobile phone. By default it will running on the endpoint “localhost:11434” of your phone.
To verify if the server is running or not, open any browser on your phone and paste the url “localhost:11434” without quotes. You must get an output which says “Ollama is running”.
You can now pull any LLM that you would like to run on your phone. Please chose a model based on your phone’s configurations. For this example lets run “tinyllama:chat” model.
Head on to the ollama website on your browser and click on the Models option on the top left corner.
Use the searchbar to search for tinyllama and select the option. Here under the Tags option, copy the code which is present across the “chat” option.
You can paste code on your terminal to run the model yourself and test it out. 😊

NOW LET US SEE HOW WE CAN USE THE BASIC FLUTTER APP TO INTERACT WITH THE MODEL

Please use the previous guide to set up the device upto the point where the ollama server is running on the port 11434.
You can download and store as many models as you want by just copying the model links from ollama as seen previously. Replace the word ‘run’ with ‘pull’. For example: If the command you copied is “ollama run tinyllama:chat”, open your terminal and run the command “ollama pull tinyllama:chat”
Download my flutter application from the github repo: LLM Runner
Enter the ollama endpoint on opening the application, in this case it would be http://localhost:11434
On the top right corner, you should be able to see a dropdown, here we will be able to see all the downloaded models that you currently have on ollama. You can choose the model which you would like to run. Since we just have one now, select tinyllama:chat
You can type your chat at the bottom of the screen and hit send. Since the model is running locally on your mobile, the inference times will be very slow compared to say a computer. It also depends on the size of the model that you are running and the available ram in your smartphone.
You can see your chat history saved on the application sidebar and manage the sessions. You can swipe to delete the old sessions.
You can even start a new session by clicking on the “+” icon on the top right corner.

CONNECT TO AN OLLAMA ENDPOINT RUNNING ON COLAB

Here is the link to the colab notebook. Please save a copy onto your drive before running it. Colab Notebook
Now, lets create an ngrok account. We would require this to make the ollama server endpoint accessible over the internet.
Head to Ngrok and create a free account. Once done, click on the Auth-Token option from the sidebar and copy your token.
Now, let’s go back to your colab file and paste the auth-token in the second cell of the notebook. Replace <ngrok authtoken> with your authtoken.
You can modify tinyllama:chat in the command “run_process(['ollama', 'pull','tinyllama:chat'])” on cell 3 with the LLM that wish to run.
Once all the changes are made, make sure your runtime is set to T4 GPU on the top right corner.
Let’s run the cells one by one.
When you get to the last cell, you should be able to see it generate an ngrok link in the output, let us copy that. Ex: https://9f5f-35-233-183-148.ngrok-free.app (do not end the url with ‘/’)
Now, download the LLM Runner application from LLM Runner
When you open the application, it is going to ask for the Ollama endpoint url, paste the url that you copied from colab as seen above. Ex: https://9f5f-35-233-183-148.ngrok-free.app (do not end the url with ‘/’)
On the top right corner, you should be able to see a dropdown, here we will be able to see all the downloaded models that you currently have on ollama. You can choose the model which you would like to run. Since we just have one now, select tinyllama:chat
You can type your chat at the bottom of the screen and hit send. Since the model is running locally on your mobile, the inference times will be very slow compared to say a computer. It also depends on the size of the model that you are running and the available ram in your smartphone.
You can see your chat history saved on the application sidebar and manage the sessions. You can swipe to delete the old sessions.
You can even start a new session by clicking on the “+” icon on the top right corner.