Specialization

AI PM - where product thinking meets the model

I build LLM products the way a PM ships any other product: with a crisp problem, an eval rubric, a cost envelope, and a way to roll back. This page collects the playbooks, artefacts, and shipped work behind that stance - most of it learned building Aarchid with Dilpreet Grover.

How I work on AI products

Scoping an LLM feature

How to write a PRD when the model is the product. Success criteria, eval harness, guardrails, and cost envelope - before a single prompt is written.

Framework

Eval-driven development

Treat your golden set like a test suite. Offline evals → shadow traffic → A/B. How we validated 92% diagnosis accuracy on Aarchid.

Method

Cost modelling at the edge

Per-request math for multi-model pipelines (vision + retrieval + research). Caching, batching, and the $0.25/user/mo envelope.

Economics

Citations or it didn't happen

Why user trust collapses without grounded sources, and the architectural pattern for research-augmented LLM responses.

Trust

An eval harness, in your browser

Six plant-diagnosis cases. Two model versions. One confidence gate. Toggle the controls and watch the same golden set re-score in real time - this is how I validate an LLM feature before it ships.

100%Accuracy (6/6)
90%Avg confidence
2.75sAvg latency
  • PASSMonstera deliciosa, yellowing lower leaves, soil wet 4 days post-water
    ExpectedOverwatering / root rot risk
    PredictedOverwatering / root rot riskcited
    Confidence91%
    Latency2.94s
  • PASSFiddle-leaf fig, brown spots with yellow halo, recent move near AC vent
    ExpectedCold draft + fungal stress
    PredictedCold draft + fungal stresscited
    Confidence86%
    Latency2.71s
  • PASSSnake plant, mushy base, leaves falling at touch
    ExpectedAdvanced root rot
    PredictedAdvanced root rotcited
    Confidence96%
    Latency2.53s
  • PASSPothos, pale variegated leaves, low-light corner for 6 weeks
    ExpectedInsufficient light
    PredictedInsufficient lightcited
    Confidence92%
    Latency2.64s
  • PASSCalathea orbifolia, crispy edges, indoor humidity 28%
    ExpectedLow humidity stress
    PredictedLow humidity stresscited
    Confidence89%
    Latency2.82s
  • PASSZZ plant, drooping stems, watered weekly past month
    ExpectedOverwatering
    PredictedOverwateringcited
    Confidence88%
    Latency2.89s

Toggle between the v1 baseline and the grounded v2 stack, or raise the confidence gate, to see how the same golden set re-scores. This is the same shape of harness we used on Aarchid to validate the 92% diagnosis accuracy claim before any user saw the model in production.

Cost modelling, in real time

Same harness mindset, applied to economics. Move the sliders to see how batch size, cache hit rate, and request volume reshape the per-user-per-month bill - and whether you stay inside the $0.25 envelope.

$0.00472Effective / request
$0.038/ user / month · in envelope
$189Monthly run-rate

Per-request breakdown

  • Vision (Gemini 1.5 Pro)$0.00350
  • Retrieval (Exa AI API)$0.00500
  • Embed (cache lookup, on hit)$0.00010
  • Edge worker$0.0000005

The Aarchid envelope is $0.25 / active user / month. Vision is the dominant cost - batching it across multiple images (gallery upload, time-lapse) and caching repeat diagnoses by perceptual hash are the two levers that keep us under budget at scale.

Aarchid - shipped proof

AI Botanical Intelligence · 92% diagnosis accuracy

Co-created with Dilpreet Grover. Multimodal vision (Gemini 1.5 Pro) grounded by research-augmented reasoning (Exa AI API), running on Cloudflare Workers. Sub-10s P95, $0.25 per active user per month at scale.

Read the case study →

On the bench

  • AI PM interview prep kit - deconstructed case questions, eval-harness design, and model economics cheatsheets.
  • Second Aarchid-scale build - applying the same Edge Stack pattern to a different problem domain.
  • Essay series: “The PRD is dead, long live the eval set” - in progress.

Looking for an AI PM who can spec, eval, and ship? Get in touch.