Explorations & Analyses

Select log of projects, experiments, and writings. Hopefully give an idea of how they fit together towards a cohesive whole.

Self-Correcting Smol Models Through Token Engineering

Proof of concept showing how monitoring token probabilities during inference enables small models to detect and correct bad outputs in real-time. The system intercepts likely errors, triggers verification loops with code runtime, and returns control—no multi-turn prompting needed. Surprisingly outperforms larger RL-trained models that second-guess themselves despite built-in verification.

read more

AI Discourse vs. Reality

80% of *active* consumer AI users send fewer than 4 messages/day on average. That’s search like behavior (occasional queries), not assistant like behavior (continuous engagement).

read more

Rethinking Function Calling with Small Language Models

The bottleneck in reliable function calling with small models isn't model size — it's probability collapse during unconstrained generation. By observing token-level confidence and decomposing function calling into staged classification and extraction steps with constrained decoding, even 350-700MB models achieve reliable results.

read more

The AI Recalibration: Cloud-first to Edge-first

Is the cloud-first AI era already peaking? While the industry fixates on massive infrastructure, model efficiency has surged 50-100x in just five years. We are reaching a tipping point where 3B-parameter local models are outperforming the 175B-parameter giants of 2020. This analysis explores the parameter and intelligence efficiency, and why the next 2-3 years could see true personal AI workloads move off the server for common tasks and onto your device.

read more

Instant TTS on the edge

By building a custom, native C++ inference runtime for Kokoro TTS, it is now possible to achieve high-quality audio generation entirely on-device and on the CPU. This optimized implementation delivers a Real-Time Factor (RTF) of 0.2—generating speech roughly five times faster than real-time—and provides up to a 10x reduction in cold-start times and a 3x reduction in memory usage compared to standard inference frameworks.

read more

Sub-150ms Cold Starts and Instant Semantic Retrieval

Most local embedding setups are weighed down by gigabytes of RAM and massive cold-start latencies (and thus people believe they need to use the cloud). By building a standalone C runtime with ONNX Runtime (C + ORT), I was able to achieve a 130ms cold start and 10ms inference speeds. This implementation provides an 80x faster startup and 10x lower memory footprint (~170MB) than standard PyTorch/Hugging Face configurations—enabling instant, private semantic search entirely on the CPU.

read more

Make Lynx Popular Again: Semantic 'AI Browser' for High-Recall Search

Built on top of the ultra-lightweight Lynx engine, this "AI browser" prototype replaces SEO-driven rankings with semantic relevance. By stripping web bloat and focusing on information density, it delivers high-recall search results that are optimized for knowledge retrieval rather than human sorting. When packaged as a Model Context Protocol (MCP), it provides a faster, cleaner alternative to traditional web search or fetch APIs for AI agents and CLI-based workflows. The challenge, however, is Lynx doesn't have a JS runtime so it needs some work.

read more + why lynx

MCP Tool Disambiguation: Fixing Agentic Tool-Call Misfires

LLM agents often fail when faced with overlapping tool descriptions or similar function names across MCP servers. I built a middleware layer that intercepts these calls, using a custom hybrid similarity score—rather than standard cosine similarity—to rank tool intent. By identifying conflicts and returning structured hints for self-correction, this system eliminates execution loops and significantly improves reliability in complex, multi-tool agentic workflows.

read more + some more

Introducing Flik: The Contextual, Ambient AI (Beyond Screenshots and Prompts)

Flik is a new multimodal interaction that lets you point or draw on any app to trigger AI actions instantly. Starting with native Linear and Jira integrations, Flik allows you to highlight a bug or insight and capture it directly into your workspace. It bridges the gap between humans and apps, enabling seamless, in-place AI interactions without ever leaving your current window.

get access

Layer Streaming: Running Full-Precision LLMs on Small Devices

Most edge devices struggle to run full-precision LLMs without aggressive quantization or massive RAM. I developed a custom inference strategy that streams model layers on demand, loading only what is needed for a single forward pass before releasing it. This "layer streaming" approach allowed a 1.7B parameter model to run with just a 250MB memory footprint—an achievement that won "Best Memory Hack" at MongoDB.local London and proves that high-fidelity AI doesn't require high-end hardware.

read more

The Missing HCI Primitive: Why AI Needs System-Wide Drawing

We’ve given computers "eyes" through multimodal AI, but we still have no universal way to show them where to look. This post explores why freehand drawing should be a first-class HCI primitive—as fundamental as typing or clicking—to unlock true spatial grounding. By enabling drawing and unlocking input parallelism—the ability to sketch and speak simultaneously—we move past the "screenshot-and-prompt" ritual toward a higher-signal, Jarvis-like experience that matches the way we actually think and communicate.

read more