A GPU-poor developer’s not-so-detailed journey so far on building apps that integrate local LLMs
When I first discovered Hugging Face back in like the first quarter of 2023, I was excited that I could just download and run pre-trained models on my machine. I remembered that I copied the example sentiment classification script and ran it on my Hetzner VPS. From there, my interest in building side projects that integrate some AI models started. But I don’t have a GPU.
Early in my journey with locally running Large Language Models (LLMs), I played with Ollama and had bit of fun that I could chat with these LLMs without needing GPUs. Eventually when the relatively small Phi 3 Mini and Gemma 2 2B models came out, I found them interesting and impressive. My favorite being the gemma-2-2b-it-abliterated. And so because of these small models, I can finally build something useful that my hardware can handle and within tolerable latency. I set out to build apps with llama.cpp server for inferencing. Most of my projects don’t have the LLM-related libraries or packages such as the popular Langchain and Instructor which have a lot of features. All I need is structured outputs and llama.cpp’s JSON schema to GBNF grammars is good enough for me. This makes relatively lightweight container images for deployment, easy to extend to my liking, and less frameworks to learn and depend on.
As expected, without a GPU, I have to patiently wait while the CPU is working very hard to give the output I expect only to pass a supposed-to-be-structured-but-now-broken output to the next component of the pipeline. It’s a mix of fun and frustration. But the experience is overall rewarding. LLMs are probabilistic in nature, they need to be put to use in the proper places, and finding where those places are is the LLM user’s responsibility.
These days, the LLM-integrated applications that I run in production are powered by serverless GPUs. But I mostly develop using my local LLM setup without any GPU. The best one I built so far is an Information Extraction pipeline that my girlfriend uses for work. It used to only ran on my server’s CPU and takes at least 3-4 minutes to finish but now can do it within 2 minutes using serverless GPU service. There’s also a website where only the two of us can access to generate anime-style stickers using FLUX.1-schnell. Another project I really like is a pipeline integrating faster-whisper + Qwen2.5-Math-Instruct in Tool-Integrated Reasoning (TIR) mode to compute how much a customer’s coffee order is.
Though I’m GPU-poor, I’m excited about the current trend where small models are getting better and more options are available. The table below shows my current go-to small models and how I use them.
| Name | # of params, B | Use-case |
|---|---|---|
| gemma-2-2b-it-abliterated | 2.61 | General purpose |
| NuExtract-v1.5 | 3.82 | Information extraction |
| SmolLM2-1.7B-Instruct | 1.71 | Chat |
| Josiefied-Qwen2.5-1.5B-Instruct-abliterated-v1 | 1.78 | Information extraction |