Explorations & Thoughts
Log of select projects, experiments, and thoughts. I began documenting these very late in the process and there's a backlog from pre-2024 I hope to catch up on sometime, but feel free to check out my GitHub repos, gists, or X feed for some more meanwhile. -Rishi
Apr 2026
Precision != Information Density?
It’s fascinating how we treat "full precision" in LLMs, treating 32-bit floats as the ground truth of intelligence and viewing quantization as a tool to save VRAM often. But after tinkering with signal processing techniques, learned projections etc., I'm more convinced that precision is just a property of the data container, whereas information density and structure is what matters more. Back in January, I had this vague notion that the real breakthrough in efficiency wouldn't come from making operations faster, but from figuring out which computations to ignore or reuse. Today, it feels that some of "efficiency engineering" could really just be repurposing old-school signal processing techniques to strip away the floating-point "jitter” i.e., bits that represent noise rather than actual semantic meaning.
I ran an embedding/retrieval experiment with 50k documents, comparing a standard fp32 index against one that uses a 2-bit learned quantizer using DSP techniques. At k=1, the 2-bit model is "blurry” (expected) but as k increases to 10 and 20, the 2-bit learned signal actually outperforms fp32. My intuition is this is a denoising effect; by quantizing intelligently, we’re performing a semantic low-pass filter that ignores the high-frequency noise distracting the distance calculations in high-dimensional space. The graph in the linked post shows the crossover clearly — as the search space widens, the "lower-resolution" structured signal becomes a more reliable map of the global topology than the raw/unfiltered floats.
I also noticed another approach performed better at low k and that led me to a hybrid, adaptive quantizer. If we are able to structure the info so the first 2 bits carry the core topology and the subsequent bits carry the residual error, we can effectively create a dynamic filter. We use the blurry 2-bit base for a massive, efficient scan to find the general neighborhood, then only layer in the residuals for a sharp 4-bit refinement at the very top. This yields a system that is 8-16x smaller yet beats fp32 in every scenario in this specific experiment with embedding and retrieving. The other advantage of such encoding is we can use SIMD accelerated computation etc. for low-bit scan thus making it way faster, and defer expensive operations to CPU/GPU (on demand).
This experiment is definitely promising and has real-world utility, but the real stress test will be how much of this transfers beyond embeddings and into the dynamic state of the KV cache etc., where the "signal" is constantly evolving. I guess the real takeaway for me is precision alone is neither signal nor information density, and there’s a lot to explore in efficiency engineering and training rather than just throwing compute or memory at noise.
Mar 2026
Edge AI is a Fundamentally Different Optimization Problem — and GPU-first Playbooks are Often Wrong
We’ve grown lazy with standard PyTorch and MLX defaults for local/edge AI. For edge AI, the "best" compute unit isn't a constant; it’s a variable of workload, batch size, cache residency, hardware arithmetic intensity etc. I’ve been hearing a lot about the Apple's ANE recently, and while the ANE is finally getting attention, the AMX (Apple Matrix Coprocessor) still remains under-explored and under-exploited. In basic tests, I hit 1.4 TFLOPS on an M2 laptop, making AMX ~481x faster than naive C++ and many orders faster than even SIMD/NEON.
By not using popular AI abstractions and writing custom kernels to target the AMX (via Accelerate), and using CoreML to target ANE, I noticed the performance gap is staggering. On an M2 (4y old device), a C++/Accelerate (AMX) MiniLM cross-reranker hits a 47ms cold start—nearly 150x faster than ~7s Python/MPS. Query times are great too. While the industry fixates on GPUs, I found that for single-query latency, the CPU/AMX crushes the GPU by eliminating dispatch overhead and the memory move tax. The key is data structuring: by staying resident in the tiny L-cache/SLC/SRAM, one can get outsized latency and throughput efficiencies for many tasks. It's work but the payoff is huge for certain tasks.
Some of the trade-offs are also interesting: ANE+GPU dominates throughput for batch processing (e.g., embedding 7,500+ docs/sec), but degrades once you hit ANE tiling limits. Meanwhile, CPU-only AMX scales more steadily and wins at low batches. Edge AI isn't about the biggest model, the most VRAM, or simply running quantized models. It’s about building hardware-accelerated, cache-optimized, and workload-aware pipelines. It’s about finding the optimal path to let users multitask while delivering intelligence without the wait.
Mar 2026
Intent Classification with Margin Loss
Intent classification (query → intent) on Banking77 (MTEB) with a simple margin-loss head is surprisingly strong. Freeze a SentenceTransformer encoder (e.g., MiniLM-L6-v2 or BGE-Small), L2-normalize embeddings, and train a lightweight MLP using a combined margin loss + label-smoothed cross-entropy (90/10 weight split; label smoothing=0.1). With margin=0.5, dropout=0.3, lr=1e-3 and early stopping, a "dumb" MiniLM-L6-v2 jumped ~32% in accuracy — achieving top-1 93.5% and top-3 98.3% — while keeping deployment cost near-zero since only the MLP is trained. If you can generate labeled data easily, this is a low-resource, high-accuracy option for intent classification.
Dec 2025
Self-Correcting Smol Models via. Token Engineering
Proof of concept showing how monitoring token probabilities during inference enables small models to detect and correct bad outputs in real-time. The system intercepts likely errors, triggers verification loops with code runtime, and returns control — no multi-turn prompting needed. Surprisingly outperforms larger RL-trained models that second-guess themselves despite built-in verification. Small models have improved dramatically over three years, but tool-calling reliability remains the critical bottleneck — solving this unlocks edge AI at scale.
Dec 2025
AI Discourse vs. Reality (Analysis)
Today, 80% of active consumer AI users send fewer than 4 messages per day — behavior that resembles search (occasional queries) rather than true assistant use (continuous engagement). A handful of power users dominate the conversation, skewing perceptions. The future is here, just not evenly distributed yet.
Dec 2025
Rethinking Function Calling with Small Language Models
The bottleneck in reliable function calling with small models isn't model size — it's probability collapse during unconstrained generation. By observing token-level confidence and decomposing function calling into staged classification and extraction steps with constrained decoding, even 350-700MB models achieve reliable results.
Dec 2025
The AI Recalibration: Cloud-first to Edge-first (Analysis)
Is the cloud-first AI era already peaking? While the industry fixates on massive infrastructure, model efficiency has surged 50-100x in just five years. We are reaching a tipping point where 3B-parameter local models are outperforming the 175B-parameter giants of 2020. This analysis explores the parameter and intelligence efficiency, and why the next 2-3 years could see true personal AI workloads move off the server for common tasks and onto your device.
Nov 2025
Instant TTS on the Edge
By building a custom, native C++ inference runtime for Kokoro TTS, it is now possible to achieve high-quality audio generation entirely on-device and on the CPU. This optimized implementation delivers a Real-Time Factor (RTF) of 0.2 — generating speech roughly five times faster than real-time — and provides up to a 10x reduction in cold-start times and a 3x reduction in memory usage compared to standard inference frameworks.
Nov 2025
Sub-150ms Cold Starts and Instant Semantic Retrieval
Most local embedding setups are weighed down by gigabytes of RAM and massive cold-start latencies (and thus people believe they need to use cloud setups). By building a standalone C runtime with ONNX Runtime (C + ORT), I was able to achieve a 130ms cold start and 10ms inference speeds. This implementation provides an 80x faster startup and 10x lower memory footprint (~170MB) than standard PyTorch/Hugging Face configurations — enabling instant, private semantic search entirely on the CPU.
Nov 2025
Make Lynx Popular Again: Semantic 'AI Browser' for High-Recall Search
Built on top of the ultra-lightweight Lynx engine, this "AI browser" prototype replaces SEO-driven rankings with semantic relevance. By stripping web bloat and focusing on information density, it delivers high-recall search results that are optimized for knowledge retrieval rather than human sorting. When packaged as a Model Context Protocol (MCP), it provides a faster, cleaner alternative to traditional web search or fetch APIs for AI agents and CLI-based workflows. The challenge, however, is Lynx doesn't have a JS runtime so it needs more work.
Oct 2025
MCP Tool Disambiguation: Tool Descriptor Tuning via Simulations
ML techniques first spot confusing tool pairs (e.g. reset_user_password vs reset_system_password at high similarity). Then: use an LLM as the simulation engine to generate realistic but ambiguous user queries against those pairs → measure mis-selection risk → iteratively rewrite descriptions until the problematic similarity collapses. A lightweight, repeatable way to make tool-calling more reliable without over-engineering the model.
Oct 2025
MCP Tool Disambiguation: Fixing Agentic Tool-Call Misfires
LLM agents often fail when faced with overlapping tool descriptions or similar function names across MCP servers. I built a middleware layer that intercepts these calls, using a custom hybrid similarity score — rather than standard cosine similarity alone — to rank tool intent. By identifying conflicts and returning structured hints for self-correction, this system eliminates execution loops and significantly improves reliability in complex, multi-tool agentic workflows.
Oct 2025
Launch: Flik (The Contextual, Ambient AI)
Flik is a new system-wide multimodal interaction that lets you point or draw on any app to trigger AI actions instantly. Starting with native Linear and Jira integrations, Flik allows you to highlight a bug or insight and use that as a visual prompt to AI. It bridges the gap between humans and apps, enabling seamless, in-place AI interactions without ever leaving your current window.
Oct 2025
Layer Streaming: Running Full-Precision LLMs on Small Devices
Most edge/consumer devices struggle to run full-precision LLMs without aggressive quantization or massive RAM. I developed a custom inference strategy that streams model layers on demand, loading only what's needed for a single forward pass before releasing it. This "layer streaming" approach enabled a 1.7B parameter model to run with just a 250MB memory footprint—an achievement that won "Best Memory Hack" at MongoDB.local London and demonstrates that high-fidelity AI doesn't require high-end hardware. The trade-off is throughput versus memory, but it unlocks unique use cases like background jobs and ongoing, user-specific learning for preferences and personalization.
Sep 2025
Exploring Custom Lossless Compression of Weights: Great Ratios, Not Ready for Prime Time
The original goal was on-the-fly decompression of tensors in memory to shrink footprint during inference. Built and tested a custom compressor first on PNGs and semi-structured NASA logs (beats gzip and LZMA down to ~5.8% original size), then applied it to GPT-2-medium tensors. Full-precision weights only shrink ~7%, but int8 quantization unlocks dramatic gains — up to 85% size reduction and ~6.65× overall ratios. The catch: decompression time is far too slow to justify the savings for runtime use. Still, the extreme compressibility of quantized models is the real signal — worth exploring further for offline/on-disk storage where slow decompression is acceptable.
Sep 2025
Multimodal Interactions: Need For Better I/O Devices
Most computers still rely on the keyboard as the primary input, limiting how we use voice and visual modalities. To unlock richer multimodal workflows, we need new I/O primitives: pens, programmable remotes, and hybrid devices that let you seamlessly talk and act. I repurposed a pair of old earbuds — originally designed to launch Siri or Music — by intercepting Bluetooth and low-level events, turning them into a programmable remote for my Mac. Now, a double tap can take a screenshot, paste it in VS Code, and trigger voice input. Small, custom I/O upgrades like this can transform everyday computer interaction and enable entirely new workflows.
Aug 2025
Rich Interleaved Transcripts: Signal Processing + ML Ensemble for Voice/Background Audio
Idea: use Digital Signal Processing (DSP) to separate voice from background audio in realtime, process vocals with Apple's SpeechAnalyzer (or similar STT), run background stream through an ML ensemble for sound/music classification, then merge with timestamp alignment for richly annotated transcripts. Results on movie clips (Top Gun radio chatter + jet sounds, Iron Man scenes) and casual standup audio look promising — adds context or background music cues without losing dialogue accuracy. Proves hybrid classical/ML pipelines still have legs even in 2025 — and we can do a lot more with transcripts generated today.
Aug 2025
Personalization with LoRAs: Promising, but Edge Constraints Loom
LoRA experiments on my chat conversations get ~90% of my writing style (casual, emojis included) — beats homogenized generic LLM outputs. Worked well on 8B+ param models, but models smaller than 4B params look iffy for real personalization depth. Average consumers can't run big models on-device while multi-tasking. Eyeing what Gemini Nano / Apple Intelligence do with edge AI personalization. Bottom line: for on-device viability, focus on custom inference tricks to shrink memory footprint (LoRA training does not have to be realtime)
Aug 2025
The Missing HCI Primitive: Why AI Needs System-Wide Drawing
We've given computers "eyes" through multimodal AI, but we still have no universal way to show them where to look. This post explores why freehand drawing should be a first-class HCI primitive — as fundamental as typing or clicking — to unlock true spatial grounding. By enabling drawing and unlocking input parallelism — the ability to sketch and speak simultaneously — we move past the "screenshot-and-prompt" ritual toward a higher-signal, Jarvis-like experience that matches the way we actually think and communicate.
Dec 2024
UX Exploration: Highlights for Email
A tool (chrome extension) that identifies the most important snippets in an email and visually highlights them for quick triage instead of commonly used AI summaries as the default AI interaction. Built for a hackathon that was targeted at apps built using Gemini Chrome Nano (Google's local model in Chrome).
Jun 2024
Launch: Spacefold ('Computer Use' and Visual Prompting)
Spacefold is a functional prototype of true multimodal AI interaction for macOS. It explores autonomous control of apps (Update: also known as Computer Use since Anthropic's launch in Oct 2024) and visual prompting. In beta, it is an ongoing exploration of what works and what doesn't with real users. After launch, I learned that even though the technology works, simulating mouse clicks or typing in front of users felt jarring and out of their control. What we really need is a new protocol (and new APIs) that lets people talk to an app directly, and the app itself handles the orchestration of actions. For user facing experiences, simulating mouse movements or keystrokes are a transient phase. Testing also revealed all the failure modes and the realization that screenshots alone are not enough to provide true context for AI interaction. What works? Visual prompting. Next steps: deeper OS-level integration to prevent context loss from visual input in the form of screenshots alone.
Apr 2024
Model Capability Hack: Animations from Text Models
Vignette is a text-to-image or animation generator built on top of the Claude API. Built for a hackathon by Anthropic. One of the first demonstrations of using text-to-text models to generate SVG animations before it became common.
Apr 2024
Launch: Snapsked
Snapsked turns screenshots and photos into structured action items that can be sent directly to a calendar, reminders, etc. Update: One of the key lessons from building Snapsked was that screenshots alone were not enough for robust model capabilities at the time. For example, a screenshot of a chat can't capture the reference date needed to calculate when "in 3 days" actually is. The ease of sending screenshots to a bot is great, but if reliability struggles, users quickly churn.