Specialization

AI PM - where product thinking meets the model

I build LLM products the way a PM ships any other product: with a crisp problem, an eval rubric, a cost envelope, and a way to roll back. This page collects the playbooks, artefacts, and shipped work behind that stance - most of it learned building Aarchid with Dilpreet Grover.

Playbooks

How I work on AI products

Scoping an LLM feature

How to write a PRD when the model is the product. Success criteria, eval harness, guardrails, and cost envelope - before a single prompt is written.

Framework

Eval-driven development

Treat your golden set like a test suite. Offline evals → shadow traffic → A/B. How we validated 92% diagnosis accuracy on Aarchid.

Method

Cost modelling at the edge

Per-request math for multi-model pipelines (vision + retrieval + research). Caching, batching, and the $0.25/user/mo envelope.

Economics

Citations or it didn't happen

Why user trust collapses without grounded sources, and the architectural pattern for research-augmented LLM responses.

Trust

Live Demo

An eval harness, in your browser

Six plant-diagnosis cases. Two model versions. One confidence gate. Toggle the controls and watch the same golden set re-score in real time - this is how I validate an LLM feature before it ships.

Confidence gate 70%

100%Accuracy (6/6)

90%Avg confidence

2.75sAvg latency

PASSMonstera deliciosa, yellowing lower leaves, soil wet 4 days post-water
ExpectedOverwatering / root rot risk
PredictedOverwatering / root rot riskcited
Confidence91%
Latency2.94s
PASSFiddle-leaf fig, brown spots with yellow halo, recent move near AC vent
ExpectedCold draft + fungal stress
PredictedCold draft + fungal stresscited
Confidence86%
Latency2.71s
PASSSnake plant, mushy base, leaves falling at touch
ExpectedAdvanced root rot
PredictedAdvanced root rotcited
Confidence96%
Latency2.53s
PASSPothos, pale variegated leaves, low-light corner for 6 weeks
ExpectedInsufficient light
PredictedInsufficient lightcited
Confidence92%
Latency2.64s
PASSCalathea orbifolia, crispy edges, indoor humidity 28%
ExpectedLow humidity stress
PredictedLow humidity stresscited
Confidence89%
Latency2.82s
PASSZZ plant, drooping stems, watered weekly past month
ExpectedOverwatering
PredictedOverwateringcited
Confidence88%
Latency2.89s

Toggle between the v1 baseline and the grounded v2 stack, or raise the confidence gate, to see how the same golden set re-scores. This is the same shape of harness we used on Aarchid to validate the 92% diagnosis accuracy claim before any user saw the model in production.

Live Demo

Cost modelling, in real time

Same harness mindset, applied to economics. Move the sliders to see how batch size, cache hit rate, and request volume reshape the per-user-per-month bill - and whether you stay inside the $0.25 envelope.

Requests / user / month8Batch size 1Cache hit rate 45%Active users 5,000

$0.00472Effective / request

$0.038/ user / month · in envelope

$189Monthly run-rate

Per-request breakdown

Vision (Gemini 1.5 Pro)$0.00350
Retrieval (Exa AI API)$0.00500
Embed (cache lookup, on hit)$0.00010
Edge worker$0.0000005

The Aarchid envelope is $0.25 / active user / month. Vision is the dominant cost - batching it across multiple images (gallery upload, time-lapse) and caching repeat diagnoses by perceptual hash are the two levers that keep us under budget at scale.

Case Study

Aarchid - shipped proof

AI Botanical Intelligence · 92% diagnosis accuracy

Co-created with Dilpreet Grover. Multimodal vision (Gemini 1.5 Pro) grounded by research-augmented reasoning (Exa AI API), running on Cloudflare Workers. Sub-10s P95, $0.25 per active user per month at scale.

Read the case study →

Writing

Essays on AI + product

Shipping LLM Products Starts With the Eval Harness, Not the Prompt

A prompt is an artefact. An eval harness is a product. Here's how I scope LLM features so the output doesn't surprise users - or me.

8 min read

What's Next

On the bench

AI PM interview prep kit - deconstructed case questions, eval-harness design, and model economics cheatsheets.
Second Aarchid-scale build - applying the same Edge Stack pattern to a different problem domain.
Essay series: “The PRD is dead, long live the eval set” - in progress.

Looking for an AI PM who can spec, eval, and ship? Get in touch.