Explorations & Analyses
Select log of projects, experiments, and writings. Hopefully give an idea of how they fit together towards a cohesive whole.
Dec 2025
Self-Correcting Smol Models Through Token Engineering
Proof of concept showing how monitoring token probabilities during inference enables small models to detect and correct bad outputs in real-time. The system intercepts likely errors, triggers verification loops with code runtime, and returns control—no multi-turn prompting needed. Surprisingly outperforms larger RL-trained models that second-guess themselves despite built-in verification.
Dec 2025
AI Discourse vs. Reality
80% of *active* consumer AI users send fewer than 4 messages/day on average. That’s search like behavior (occasional queries), not assistant like behavior (continuous engagement).
Dec 2025
Rethinking Function Calling with Small Language Models
The bottleneck in reliable function calling with small models isn't model size — it's probability collapse during unconstrained generation. By observing token-level confidence and decomposing function calling into staged classification and extraction steps with constrained decoding, even 350-700MB models achieve reliable results.
Dec 2025
The AI Recalibration: Cloud-first to Edge-first
Is the cloud-first AI era already peaking? While the industry fixates on massive infrastructure, model efficiency has surged 50-100x in just five years. We are reaching a tipping point where 3B-parameter local models are outperforming the 175B-parameter giants of 2020. This analysis explores the parameter and intelligence efficiency, and why the next 2-3 years could see true personal AI workloads move off the server for common tasks and onto your device.
Nov 2025
Instant TTS on the edge
By building a custom, native C++ inference runtime for Kokoro TTS, it is now possible to achieve high-quality audio generation entirely on-device and on the CPU. This optimized implementation delivers a Real-Time Factor (RTF) of 0.2—generating speech roughly five times faster than real-time—and provides up to a 10x reduction in cold-start times and a 3x reduction in memory usage compared to standard inference frameworks.
Nov 2025
Sub-150ms Cold Starts and Instant Semantic Retrieval
Most local embedding setups are weighed down by gigabytes of RAM and massive cold-start latencies (and thus people believe they need to use the cloud). By building a standalone C runtime with ONNX Runtime (C + ORT), I was able to achieve a 130ms cold start and 10ms inference speeds. This implementation provides an 80x faster startup and 10x lower memory footprint (~170MB) than standard PyTorch/Hugging Face configurations—enabling instant, private semantic search entirely on the CPU.
Nov 2025
Make Lynx Popular Again: Semantic 'AI Browser' for High-Recall Search
Built on top of the ultra-lightweight Lynx engine, this "AI browser" prototype replaces SEO-driven rankings with semantic relevance. By stripping web bloat and focusing on information density, it delivers high-recall search results that are optimized for knowledge retrieval rather than human sorting. When packaged as a Model Context Protocol (MCP), it provides a faster, cleaner alternative to traditional web search or fetch APIs for AI agents and CLI-based workflows. The challenge, however, is Lynx doesn't have a JS runtime so it needs some work.
Oct 2025
MCP Tool Disambiguation: Fixing Agentic Tool-Call Misfires
LLM agents often fail when faced with overlapping tool descriptions or similar function names across MCP servers. I built a middleware layer that intercepts these calls, using a custom hybrid similarity score—rather than standard cosine similarity—to rank tool intent. By identifying conflicts and returning structured hints for self-correction, this system eliminates execution loops and significantly improves reliability in complex, multi-tool agentic workflows.
Oct 2025
Introducing Flik: The Contextual, Ambient AI (Beyond Screenshots and Prompts)
Flik is a new multimodal interaction that lets you point or draw on any app to trigger AI actions instantly. Starting with native Linear and Jira integrations, Flik allows you to highlight a bug or insight and capture it directly into your workspace. It bridges the gap between humans and apps, enabling seamless, in-place AI interactions without ever leaving your current window.
Oct 2025
Layer Streaming: Running Full-Precision LLMs on Small Devices
Most edge devices struggle to run full-precision LLMs without aggressive quantization or massive RAM. I developed a custom inference strategy that streams model layers on demand, loading only what is needed for a single forward pass before releasing it. This "layer streaming" approach allowed a 1.7B parameter model to run with just a 250MB memory footprint—an achievement that won "Best Memory Hack" at MongoDB.local London and proves that high-fidelity AI doesn't require high-end hardware.
Aug 2025
The Missing HCI Primitive: Why AI Needs System-Wide Drawing
We’ve given computers "eyes" through multimodal AI, but we still have no universal way to show them where to look. This post explores why freehand drawing should be a first-class HCI primitive—as fundamental as typing or clicking—to unlock true spatial grounding. By enabling drawing and unlocking input parallelism—the ability to sketch and speak simultaneously—we move past the "screenshot-and-prompt" ritual toward a higher-signal, Jarvis-like experience that matches the way we actually think and communicate.