Highlights
- Compare self-hosted versus API models
- Select the right open-source model
- Install and run Ollama or vLLM
- Quantise models for available hardware
- Expose a local API endpoint
- Secure the endpoint internally
- Monitor GPU and memory usage
- Update models without disruption
- Benchmark against cloud quality
- Document deployment for handover
Course Details
Build versus buy analysis: total cost of ownership, data control, latency, and capability gaps between hosted and self-hosted options
Model selection workshop: Llama, Mistral, Phi, and others compared on benchmark scores and hardware requirements for your infrastructure
Ollama and vLLM install lab: getting a chosen model running on target hardware with a working endpoint in under one hour
Quantisation explained and applied: GGUF formats, Q4 versus Q8 tradeoffs, and live measurement of quality impact on your test prompts
Local API setup: configuring an OpenAI-compatible endpoint so existing application code requires minimal or no changes
Network security lab: binding to internal interfaces only, firewall rule configuration, and role-based access control setup
Resource monitoring setup: dashboards tracking GPU utilisation, memory pressure, and request throughput with alert thresholds
Model update process: pulling new versions safely, running quality tests before promoting, and a documented rollback procedure
Quality benchmarking: running your evaluation test set against the self-hosted model and a cloud baseline side by side
Handover documentation: writing a runbook covering deployment steps, monitoring response, update procedure, and incident handling
Who should attend
Feedback
4.8 out of 5 average
"Our tailored course provided a well rounded introduction and also covered some intermediate level topics that we needed to know. Clive gave us some best practice ideas and tips to take away. Fast paced but the instructor never lost any of the delegates"
Brian Leek, Data Analyst, May 2022