Nua Computing

Dec 2025

Self-Correcting Smol Models Through Token Engineering

Proof of concept showing how monitoring token probabilities during inference enables small models to detect and correct bad outputs in real-time. The system intercepts likely errors, triggers verification loops with code runtime, and returns control—no multi-turn prompting needed. Surprisingly outperforms larger RL-trained models that second-guess themselves despite built-in verification.

Dec 2025

AI Discourse vs. Reality

80% of *active* consumer AI users send fewer than 4 messages/day on average. That’s search like behavior (occasional queries), not assistant like behavior (continuous engagement).

Dec 2025

Rethinking Function Calling with Small Language Models

The bottleneck in reliable function calling with small models isn't model size — it's probability collapse during unconstrained generation. By observing token-level confidence and decomposing function calling into staged classification and extraction steps with constrained decoding, even 350-700MB models achieve reliable results.

Dec 2025

The AI Recalibration: Cloud-first to Edge-first

Is the cloud-first AI era already peaking? While the industry fixates on massive infrastructure, model efficiency has surged 50-100x in just five years. We are reaching a tipping point where 3B-parameter local models are outperforming the 175B-parameter giants of 2020. This analysis explores the parameter and intelligence efficiency, and why the next 2-3 years could see true personal AI workloads move off the server for common tasks and onto your device.

Nov 2025

Instant TTS on the edge

By building a custom, native C++ inference runtime for Kokoro TTS, it is now possible to achieve high-quality audio generation entirely on-device and on the CPU. This optimized implementation delivers a Real-Time Factor (RTF) of 0.2—generating speech roughly five times faster than real-time—and provides up to a 10x reduction in cold-start times and a 3x reduction in memory usage compared to standard inference frameworks.

Nov 2025

Sub-150ms Cold Starts and Instant Semantic Retrieval

Most local embedding setups are weighed down by gigabytes of RAM and massive cold-start latencies (and thus people believe they need to use the cloud). By building a standalone C runtime with ONNX Runtime (C + ORT), I was able to achieve a 130ms cold start and 10ms inference speeds. This implementation provides an 80x faster startup and 10x lower memory footprint (~170MB) than standard PyTorch/Hugging Face configurations—enabling instant, private semantic search entirely on the CPU.

Nov 2025

Make Lynx Popular Again: Semantic 'AI Browser' for High-Recall Search

Built on top of the ultra-lightweight Lynx engine, this "AI browser" prototype replaces SEO-driven rankings with semantic relevance. By stripping web bloat and focusing on information density, it delivers high-recall search results that are optimized for knowledge retrieval rather than human sorting. When packaged as a Model Context Protocol (MCP), it provides a faster, cleaner alternative to traditional web search or fetch APIs for AI agents and CLI-based workflows. The challenge, however, is Lynx doesn't have a JS runtime so it needs some work.

MCP Tool Disambiguation: Fixing Agentic Tool-Call Misfires

LLM agents often fail when faced with overlapping tool descriptions or similar function names across MCP servers. I built a middleware layer that intercepts these calls, using a custom hybrid similarity score—rather than standard cosine similarity—to rank tool intent. By identifying conflicts and returning structured hints for self-correction, this system eliminates execution loops and significantly improves reliability in complex, multi-tool agentic workflows.

Introducing Flik: The Contextual, Ambient AI (Beyond Screenshots and Prompts)

Flik is a new multimodal interaction that lets you point or draw on any app to trigger AI actions instantly. Starting with native Linear and Jira integrations, Flik allows you to highlight a bug or insight and capture it directly into your workspace. It bridges the gap between humans and apps, enabling seamless, in-place AI interactions without ever leaving your current window.

get access

Oct 2025

Layer Streaming: Running Full-Precision LLMs on Small Devices

Most edge devices struggle to run full-precision LLMs without aggressive quantization or massive RAM. I developed a custom inference strategy that streams model layers on demand, loading only what is needed for a single forward pass before releasing it. This "layer streaming" approach allowed a 1.7B parameter model to run with just a 250MB memory footprint—an achievement that won "Best Memory Hack" at MongoDB.local London and proves that high-fidelity AI doesn't require high-end hardware.

Aug 2025

The Missing HCI Primitive: Why AI Needs System-Wide Drawing

We’ve given computers "eyes" through multimodal AI, but we still have no universal way to show them where to look. This post explores why freehand drawing should be a first-class HCI primitive—as fundamental as typing or clicking—to unlock true spatial grounding. By enabling drawing and unlocking input parallelism—the ability to sketch and speak simultaneously—we move past the "screenshot-and-prompt" ritual toward a higher-signal, Jarvis-like experience that matches the way we actually think and communicate.

Explorations & Analyses

Self-Correcting Smol Models Through Token Engineering

AI Discourse vs. Reality

Rethinking Function Calling with Small Language Models

The AI Recalibration: Cloud-first to Edge-first

Instant TTS on the edge

Sub-150ms Cold Starts and Instant Semantic Retrieval

Make Lynx Popular Again: Semantic 'AI Browser' for High-Recall Search

MCP Tool Disambiguation: Fixing Agentic Tool-Call Misfires

Introducing Flik: The Contextual, Ambient AI (Beyond Screenshots and Prompts)

Layer Streaming: Running Full-Precision LLMs on Small Devices

The Missing HCI Primitive: Why AI Needs System-Wide Drawing