

Discover the top web development stacks of 2025, their key features, and best use cases.
Every team building RAG runs into the same problem: a demo that wows the founder and falls apart in production. The reason is almost always the same — there is no eval set, so there is no way to know whether last Friday's prompt change made things better or worse.
Start with a golden set
Before tuning anything, build 200–500 graded examples that cover your real distribution of questions. Have humans grade outputs on faithfulness, helpfulness, and citation coverage. Without a ground truth, every change feels like progress because the demo still works.
Measure the right things
Track faithfulness, context recall, and answer relevance separately. A pipeline can be perfectly faithful and completely useless if it answers a different question than the user asked. Each metric catches a different failure mode.
Tie eval to deploys
Block PRs on regressions. Fail CI when faithfulness drops below threshold. Eval drift is silent until users complain — and by then you have already shipped the regression to your most important customer.
What good looks like
After three months of disciplined eval, your team should be able to answer: which model + prompt + retrieval combination won this week, by how much, and on which slice of users. If you cannot answer that, you do not have an eval — you have wishful thinking.
Author
Aarav Mehta
Writing on web development at Technoblick


