Highlights
- Understand limits of traditional unit tests
- Define AI evaluation criteria
- Build a representative test dataset
- Implement LLM-as-judge evaluation
- Write deterministic output checks
- Detect and flag hallucinations
- Build a prompt-change regression suite
- Track metrics over time
- Integrate evaluation into CI/CD
- Communicate quality standards clearly
Course Details
Why AI testing is different:
non-determinism, semantic correctness, and exactly where assert statements break down
Defining your rubric:
a workshop to build shared scoring criteria for accuracy, tone, helpfulness, and safety
Test dataset construction:
sourcing real prompts from production logs, writing expected outputs, and covering the important edge cases
LLM-as-judge setup:
configuring a second model to score outputs automatically and understanding when to trust that score
Structured output validation:
JSON schema checks, field-level assertions, and format regression tests that run in milliseconds
Hallucination detection lab:
reference matching techniques, factual grounding checks, and a confidence scoring approach you can implement today
Regression suite build:
running your full evaluation set on every prompt change and alerting when scores drop below threshold
Metrics dashboard:
tracking pass rates, average quality scores, and trends over time in a lightweight tool
CI/CD integration: adding evaluation runs to your existing pipeline and configuring deploy blocks on quality failures
Stakeholder reporting: translating evaluation scores into plain-language quality summaries that non-technical audiences can act on
Who should attend
|
Developers and Engineers |
Feedback
4.8 out of 5 average
"Our tailored course provided a well rounded introduction and also covered some intermediate level topics that we needed to know. Clive gave us some best practice ideas and tips to take away. Fast paced but the instructor never lost any of the delegates"
Brian Leek, Data Analyst, May 2022