Explorations & Thoughts
Log of select projects, experiments, and thoughts. I began documenting these very late in the process and there's a backlog from pre-2024 I hope to catch up on sometime, but feel free to check out my GitHub repos, gists, or X feed for some more meanwhile. -Rishi
Dec 2025
Self-Correcting Smol Models via. Token Engineering
Proof of concept showing how monitoring token probabilities during inference enables small models to detect and correct bad outputs in real-time. The system intercepts likely errors, triggers verification loops with code runtime, and returns control — no multi-turn prompting needed. Surprisingly outperforms larger RL-trained models that second-guess themselves despite built-in verification. Small models have improved dramatically over three years, but tool-calling reliability remains the critical bottleneck — solving this unlocks edge AI at scale.
Dec 2025
AI Discourse vs. Reality (Analysis)
Today, 80% of active consumer AI users send fewer than 4 messages per day — behavior that resembles search (occasional queries) rather than true assistant use (continuous engagement). A handful of power users dominate the conversation, skewing perceptions. The future is here, just not evenly distributed yet.
Dec 2025
Rethinking Function Calling with Small Language Models
The bottleneck in reliable function calling with small models isn't model size — it's probability collapse during unconstrained generation. By observing token-level confidence and decomposing function calling into staged classification and extraction steps with constrained decoding, even 350-700MB models achieve reliable results.
Dec 2025
The AI Recalibration: Cloud-first to Edge-first (Analysis)
Is the cloud-first AI era already peaking? While the industry fixates on massive infrastructure, model efficiency has surged 50-100x in just five years. We are reaching a tipping point where 3B-parameter local models are outperforming the 175B-parameter giants of 2020. This analysis explores the parameter and intelligence efficiency, and why the next 2-3 years could see true personal AI workloads move off the server for common tasks and onto your device.
Nov 2025
Instant TTS on the Edge
By building a custom, native C++ inference runtime for Kokoro TTS, it is now possible to achieve high-quality audio generation entirely on-device and on the CPU. This optimized implementation delivers a Real-Time Factor (RTF) of 0.2 — generating speech roughly five times faster than real-time — and provides up to a 10x reduction in cold-start times and a 3x reduction in memory usage compared to standard inference frameworks.
Nov 2025
Sub-150ms Cold Starts and Instant Semantic Retrieval
Most local embedding setups are weighed down by gigabytes of RAM and massive cold-start latencies (and thus people believe they need to use cloud setups). By building a standalone C runtime with ONNX Runtime (C + ORT), I was able to achieve a 130ms cold start and 10ms inference speeds. This implementation provides an 80x faster startup and 10x lower memory footprint (~170MB) than standard PyTorch/Hugging Face configurations — enabling instant, private semantic search entirely on the CPU.
Nov 2025
Make Lynx Popular Again: Semantic 'AI Browser' for High-Recall Search
Built on top of the ultra-lightweight Lynx engine, this "AI browser" prototype replaces SEO-driven rankings with semantic relevance. By stripping web bloat and focusing on information density, it delivers high-recall search results that are optimized for knowledge retrieval rather than human sorting. When packaged as a Model Context Protocol (MCP), it provides a faster, cleaner alternative to traditional web search or fetch APIs for AI agents and CLI-based workflows. The challenge, however, is Lynx doesn't have a JS runtime so it needs more work.
Oct 2025
MCP Tool Disambiguation: Tool Descriptor Tuning via Simulations
ML techniques first spot confusing tool pairs (e.g. reset_user_password vs reset_system_password at high similarity). Then: use an LLM as the simulation engine to generate realistic but ambiguous user queries against those pairs → measure mis-selection risk → iteratively rewrite descriptions until the problematic similarity collapses. A lightweight, repeatable way to make tool-calling more reliable without over-engineering the model.
Oct 2025
MCP Tool Disambiguation: Fixing Agentic Tool-Call Misfires
LLM agents often fail when faced with overlapping tool descriptions or similar function names across MCP servers. I built a middleware layer that intercepts these calls, using a custom hybrid similarity score — rather than standard cosine similarity alone — to rank tool intent. By identifying conflicts and returning structured hints for self-correction, this system eliminates execution loops and significantly improves reliability in complex, multi-tool agentic workflows.
Oct 2025
Launch: Flik (The Contextual, Ambient AI)
Flik is a new system-wide multimodal interaction that lets you point or draw on any app to trigger AI actions instantly. Starting with native Linear and Jira integrations, Flik allows you to highlight a bug or insight and use that as a visual prompt to AI. It bridges the gap between humans and apps, enabling seamless, in-place AI interactions without ever leaving your current window.
Oct 2025
Layer Streaming: Running Full-Precision LLMs on Small Devices
Most edge/consumer devices struggle to run full-precision LLMs without aggressive quantization or massive RAM. I developed a custom inference strategy that streams model layers on demand, loading only what's needed for a single forward pass before releasing it. This "layer streaming" approach enabled a 1.7B parameter model to run with just a 250MB memory footprint—an achievement that won "Best Memory Hack" at MongoDB.local London and demonstrates that high-fidelity AI doesn't require high-end hardware. The trade-off is throughput versus memory, but it unlocks unique use cases like background jobs and ongoing, user-specific learning for preferences and personalization.
Sep 2025
Exploring Custom Lossless Compression of Weights: Great Ratios, Not Ready for Prime Time
The original goal was on-the-fly decompression of tensors in memory to shrink footprint during inference. Built and tested a custom compressor first on PNGs and semi-structured NASA logs (beats gzip and LZMA down to ~5.8% original size), then applied it to GPT-2-medium tensors. Full-precision weights only shrink ~7%, but int8 quantization unlocks dramatic gains — up to 85% size reduction and ~6.65× overall ratios. The catch: decompression time is far too slow to justify the savings for runtime use. Still, the extreme compressibility of quantized models is the real signal — worth exploring further for offline/on-disk storage where slow decompression is acceptable.
Sep 2025
Multimodal Interactions: Need For Better I/O Devices
Most computers still rely on the keyboard as the primary input, limiting how we use voice and visual modalities. To unlock richer multimodal workflows, we need new I/O primitives: pens, programmable remotes, and hybrid devices that let you seamlessly talk and act. I repurposed a pair of old earbuds — originally designed to launch Siri or Music — by intercepting Bluetooth and low-level events, turning them into a programmable remote for my Mac. Now, a double tap can take a screenshot, paste it in VS Code, and trigger voice input. Small, custom I/O upgrades like this can transform everyday computer interaction and enable entirely new workflows.
Aug 2025
Rich Interleaved Transcripts: Signal Processing + ML Ensemble for Voice/Background Audio
Idea: use Digital Signal Processing (DSP) to separate voice from background audio in realtime, process vocals with Apple's SpeechAnalyzer (or similar STT), run background stream through an ML ensemble for sound/music classification, then merge with timestamp alignment for richly annotated transcripts. Results on movie clips (Top Gun radio chatter + jet sounds, Iron Man scenes) and casual standup audio look promising — adds context or background music cues without losing dialogue accuracy. Proves hybrid classical/ML pipelines still have legs even in 2025 — and we can do a lot more with transcripts generated today.
Aug 2025
Personalization with LoRAs: Promising, but Edge Constraints Loom
LoRA experiments on my chat conversations get ~90% of my writing style (casual, emojis included) — beats homogenized generic LLM outputs. Worked well on 8B+ param models, but models smaller than 4B params look iffy for real personalization depth. Average consumers can't run big models on-device while multi-tasking. Eyeing what Gemini Nano / Apple Intelligence do with edge AI personalization. Bottom line: for on-device viability, focus on custom inference tricks to shrink memory footprint (LoRA training does not have to be realtime)
Aug 2025
The Missing HCI Primitive: Why AI Needs System-Wide Drawing
We've given computers "eyes" through multimodal AI, but we still have no universal way to show them where to look. This post explores why freehand drawing should be a first-class HCI primitive — as fundamental as typing or clicking — to unlock true spatial grounding. By enabling drawing and unlocking input parallelism — the ability to sketch and speak simultaneously — we move past the "screenshot-and-prompt" ritual toward a higher-signal, Jarvis-like experience that matches the way we actually think and communicate.
Dec 2024
UX Exploration: Highlights for Email
A tool (chrome extension) that identifies the most important snippets in an email and visually highlights them for quick triage instead of commonly used AI summaries as the default AI interaction. Built for a hackathon that was targeted at apps built using Gemini Chrome Nano (Google's local model in Chrome).
Jun 2024
Launch: Spacefold ('Computer Use' and Visual Prompting)
Spacefold is a functional prototype of true multimodal AI interaction for macOS. It explores autonomous control of apps (Update: also known as Computer Use since Anthropic's launch in Oct 2024) and visual prompting. In beta, it is an ongoing exploration of what works and what doesn't with real users. After launch, I learned that even though the technology works, simulating mouse clicks or typing in front of users felt jarring and out of their control. What we really need is a new protocol (and new APIs) that lets people talk to an app directly, and the app itself handles the orchestration of actions. For user facing experiences, simulating mouse movements or keystrokes are a transient phase. Testing also revealed all the failure modes and the realization that screenshots alone are not enough to provide true context for AI interaction. What works? Visual prompting. Next steps: deeper OS-level integration to prevent context loss from visual input in the form of screenshots alone.
Apr 2024
Model Capability Hack: Animations from Text Models
Vignette is a text-to-image or animation generator built on top of the Claude API. Built for a hackathon by Anthropic. One of the first demonstrations of using text-to-text models to generate SVG animations before it became common.
Apr 2024
Launch: Snapsked
Snapsked turns screenshots and photos into structured action items that can be sent directly to a calendar, reminders, etc. Update: One of the key lessons from building Snapsked was that screenshots alone were not enough for robust model capabilities at the time. For example, a screenshot of a chat can't capture the reference date needed to calculate when "in 3 days" actually is. The ease of sending screenshots to a bot is great, but if reliability struggles, users quickly churn.