Side projects born from real frustrations. Eval-driven from the start. Nothing ships until I can measure that it works. Some are tools I use, some are bets on what's coming, some are just for fun. All are public.
If eval has friction, people skip it. When people skip eval, bad outputs reach users.
A Python library that scores LLM output quality in one function call. Works with any model provider. Zero mandatory dependencies. Ships with CI regression gates so a prompt change can't quietly break production accuracy.
AI models are becoming the first stop for purchase decisions. No established playbook exists for optimizing brand presence in LLM responses.
When someone asks ChatGPT "best CRM?", most companies have no idea what comes back. This runs a 4-stage agentic pipeline across multiple AI models: audit current visibility, send a probe agent to dig into gaps, run cross-model diagnosis, then output a concrete GEO playbook with prioritized actions.
View on GitHub →Eval before you build. Know if the problem is solvable with your current approach before spending cycles iterating.
A prompt evaluation harness for VLM-based extraction on freight documents. Compares three strategies (naive, structured, few-shot) against ground truth and classifies every failure: is it layout ambiguity, format variance, scan quality, or something a better prompt can actually fix? The taxonomy tells teams what to iterate on and what to stop wasting time on.
View on GitHub →You learn agent patterns better when the stakes are low and the feedback loop is fast.
Type a shower thought. Four AI agents debate it: an Optimist argues why you're brilliant, a Cynic finds the fatal flaw, a Researcher searches the web for prior art, and a Judge delivers the verdict. Built to explore multi-agent orchestration, role-based personas, and tool use in a context where experimentation is the point.
View on GitHub →Enterprise GenAI delivered for real clients. Production systems, customer rooms, the whole thing.
Open to interesting problems in AI engineering, enterprise GenAI, and anything that actually needs to ship.