Highlights
- Identify data quality issues for AI
- Design structured and unstructured ingestion
- Implement cleaning and normalisation
- Build schema validation and rejection
- Connect to a real data source
- Automate pipeline scheduling
- Log data lineage end to end
- Test with synthetic bad data
- Monitor pipeline health and drift
- Document data contracts for AI teams
Course Details
AI failure modes caused by bad data:
real case studies showing how upstream data problems destroy downstream model quality
Ingestion architecture lab:
building working connectors for databases, REST APIs, CSV files, and email attachments in a single pipeline
Cleaning pipeline build:
handling nulls, duplicates, encoding errors, and format normalisation with tested, reusable code
Schema validation workshop:
writing contracts, routing bad records to a rejection queue, and alerting on unexpected format changes
Live source connection lab:
participants connect the pipeline to a real or simulated data source relevant to their actual work environment
Orchestration setup:
configuring scheduling, dependency management, and retry logic in Airflow or Prefect with working examples
Lineage logging:
tagging every data record with source, timestamp, and version so any model run is fully traceable back to its input
Chaos testing:
deliberately injecting bad records, missing fields, and schema breaks to verify error handling works as expected
Pipeline monitoring:
dashboards covering run success rates, row counts, processing latency, and early indicators of data drift
Data contract documentation:
writing a clear specification that an AI or ML team can consume without needing to ask you questions
Who should attend
Data Engineer
Feedback
4.8 out of 5 average
"Our tailored course provided a well rounded introduction and also covered some intermediate level topics that we needed to know. Clive gave us some best practice ideas and tips to take away. Fast paced but the instructor never lost any of the delegates"
Brian Leek, Data Analyst, May 2022