Local AI is Hitting the Sweet Spot

Local LLMs are finally starting to feel effective as a daily driver.

Six months ago my engagement with language models was split 10% local, 90% cloud for chat & coding, and that local use was mostly a quick check of “are we there yet.”

Now, with Qwen3.5 and Gemma 4, I’m spending more than half of my chat-time in a local web UI.

Between the data privacy, model diversity, and evading advertisements, it’s become my default for most tasks.

The cloud still wins on huge reasoning models, fast inference speed, and massive context, but most of my day-to-day tasks don’t need these feats.

Plus, GPU hardware continues to see supply scarcity and obscene prices. I got lucky with a used RTX 3090 for $600usd in Fall 2025, but for most people, a new local rig isn’t feasible in 2026.

The main purpose of this post is to highlight a few QOL improvements in local LLMs that have hyped me up for the future of local compute. We’re rapidly moving away from a simple chat-bot interface to a much more complex personal assistant toolkit.

Hardware & Engine

As mentioned, my hardware is nothing complicated - a single dedicated RTX 3090 with 24GB VRAM. This is a five year old card, and it’s still plenty fast for my needs (Gemma 4 35 t/s).

I have recently simplified my software setup from OpenWebUI to llama-server’s latest WebUI, and I’m loving it.

The interface is so lightweight, it allows dynamic model loading/unloading, and they’ve recently added MCP server support which brings me to the first two major QOL improvement: get_current_datetime and Searxng.

Bridging Small Models with Live Search Data

I run a simple MCP server in Python that exposes two primary tools:

  • A lightweight “get_current_time” tool, and,
  • A self-hosted search engine called Searxng with Ihor-sokoliuk’s Searxng_mcp bridge

These two tools, paired with a basic system prompt like “Always check the current time and use Search tool when relevant” have been a game changer for me.

My normal questions of “what’s up with [news event]” or “how is the new open source model XYZ compared to the previous generation” used to yield obnoxious responses like “Those things aren’t real.”

With the addition of search, I’m now using local models easily five times as often.

And once I realized how easy it was to add, I was reminded that MCP (and Skills) provide an extremely powerful connection between models and code.

Experimenting with MCP and Skills

One downside to the integration of MCP servers with a local platform like llama-server is you have to iteratively set up each MCP connection.

Also, the MCP docstring that tells the model how every tool is supposed to be used is loaded into context during each conversation.

For some folks like me with limited VRAM, that context can be be unnecessarily restrictive.

Some community members recommend enabling/disabling MCP servers on the fly, which most of the local web UI tools support, but this too is an extra step that requires the user to pay careful attention to which tools are loaded or unloaded. Even knowing which tools you need in a conversation ahead of time can be a challenge.

I’ve chosen to break the normal MCP implementation by giving my local models access to an MCP “hub” that uses a traditional “list” and “help ” tool.

This “router” style for MCP isn’t a unique idea, but the reason I’m writing about it is because it highlights just how diverse the MCP interface can be.

Here are two other projects doing the same thing:

As an educational exercise, I’ve created a similar project for myself with a registry of commands. This has been a huge QOL improvement for me because it allows the model to choose when to hot-swap MCP tools without taking up as much context.

Sandboxing for Safety

For agentic coding, the Docker team has been working diligently on the Docker Sandbox toolkit and I can’t recommend it enough: https://www.docker.com/products/docker-sandboxes/

For other LLM-code interfaces like MCP servers, I highly recommend containerizing them as well. You can see an extremely brief example setup I run here: https://github.com/twwhite/twio-cairn/

Looking Forward

A huge shout-out goes out to the llama.cpp devs, the /r/localllama community, the Unsloth team, the Qwen team, MLX team, the community members that contribute day-one quants on huggingface, build verification, jailbreaking, and all other vital contributions to this wonderful ecosystem.

There are so many fascinating areas where local LLMs are being continuously developed and improved.

Some specific projects I’m keeping my eye on:

  • Google’s recent TurboQuant paper and how polar relations may enable smaller lossless quants
  • Better RAM/Disk offloading strategies for MoE models
  • Improvements to the Home Assistant local LLM integration
  • Better Omni-supported local models (shame about Gemma native voice being bound to the E2B and E4B models
  • Power efficiency trends for unified memory systems

What do you think is going to be the next big breakthrough?