Infrastructure
Overview of the hardware and cloud setup for Echoes.
Headquarters VM (Persistent)
The main Azure VM serves as the central hub.
Specifications:
- OS: Ubuntu 20.04 LTS
- Size: Standard_D4s_v3 (4 vCPU, 16GB RAM)
- OS Disk: 29GB (ephemeral - code and environments)
- Data Disk: 118GB (persistent - /mnt/echoes_data/)
Running Services: - nginx (reverse proxy) - MLflow server - TensorBoard server - Custom dashboard server - Code repository
Storage Mounts:
/home/aclarke/echoes/ # Code repository
├── experiments/ # Training scripts
├── models/ # Model implementations
├── scripts/ # Utilities (Azure runner, deploy, etc.)
├── logs → /mnt/echoes_data/logs/
├── mlruns → /mnt/echoes_data/mlruns/
└── tfruns → /mnt/echoes_data/tfruns/
/mnt/echoes_data/ # Persistent data disk
├── ucf101/ # Dataset (13GB, 13,320 videos)
├── logs/ # TensorBoard logs
├── mlruns/ # MLflow experiment data
├── tfruns/ # TensorFlow/TB runs
└── azure_results/ # Downloaded results from experiment VMs
Experiment VMs (Ephemeral)
Temporary GPU VMs for running experiments.
Provisioning:
python scripts/azure_gpu_runner.py experiments/train_comprehensive.py --vm-size Standard_NC6s_v3
VM Lifecycle:
1. Create VM
├─ Configure networking
├─ Assign public IP
└─ Create/attach data disk
2. Environment Setup
├─ Install dependencies (conda, etc.)
├─ Copy code from headquarters
└─ Download dataset from persistent disk
3. Run Experiment
├─ Train model on GPU
├─ Log metrics to MLflow
└─ Save artifacts to persistent disk
4. Result Collection
├─ Download artifacts locally
└─ Update headquarters persistent storage
5. Cleanup
├─ Delete VM
├─ Release public IP
└─ Delete OS disk (saves costs)
└─ Keep data disk results
Available Sizes:
| Size | GPU | vCPU | RAM | Cost/hr |
|---|---|---|---|---|
| Standard_NC6s_v3 | 1x K80 | 6 | 112GB | ~$0.94 |
| Standard_NC12s_v3 | 2x K80 | 12 | 224GB | ~$1.88 |
| Standard_NC24s_v3 | 4x K80 | 24 | 448GB | ~$3.76 |
| Standard_B1s | None | 1 | 1GB | ~$0.01 |
Recommended:
- Standard_NC6s_v3 - Good balance of cost and performance
- Standard_B1s - For quick testing (no GPU)
Networking
Domain Names
lonel.ai # Main domain
├── mlflow.lonel.ai # MLflow UI (port 5000)
├── tensorboard.lonel.ai # TensorBoard (port 6006)
├── dashboard.lonel.ai # Model comparison dashboard
└── docs.lonel.ai # Documentation (this site)
SSL/TLS Certificates
- Provider: Let's Encrypt
- Renewal: Automatic via certbot systemd timer
- Duration: 90 days
Reverse Proxy
nginx forwards requests:
Client Request
↓
nginx (port 443 HTTPS)
├─ mlflow.lonel.ai → localhost:5000 (MLflow UI)
├─ tensorboard.lonel.ai → localhost:6006 (TensorBoard)
├─ dashboard.lonel.ai → localhost:8000 (Dashboard)
└─ docs.lonel.ai → localhost:8080 (Documentation)
↓
Local Services (port <5000)
Authentication
- HTTP Basic Auth for MLflow and TensorBoard
- Credentials stored in
/etc/nginx/.htpasswd
Deployment Architecture
Services
MLflow Server
systemd service: mlflow.service
├─ Runs: mlflow ui --host 0.0.0.0 --port 5000
├─ Data: /mnt/echoes_data/mlruns/
├─ Logs: systemd journal
└─ nginx proxy: mlflow.lonel.ai
TensorBoard
systemd service: tensorboard.service
├─ Runs: tensorboard --logdir=/mnt/echoes_data/tfruns/
├─ Port: 6006
├─ Data: /mnt/echoes_data/tfruns/
├─ Logs: systemd journal
└─ nginx proxy: tensorboard.lonel.ai
Dashboard
systemd service: dashboard.service
├─ Serves: Static HTML/JS from /var/www/dashboard/
├─ Port: 8000
├─ Data Source: MLflow API
├─ Logs: systemd journal
└─ nginx proxy: dashboard.lonel.ai
Documentation
systemd service: docs.service
├─ Serves: Static HTML from /var/www/docs/site/
├─ Port: 8080
├─ Data: Built MkDocs output
├─ Logs: systemd journal
└─ nginx proxy: docs.lonel.ai
Deployment Script
scripts/local_deploy.sh automates all deployment:
# Deploy all services
./scripts/local_deploy.sh
# Deploy specific service
./scripts/local_deploy.sh --service mlflow
./scripts/local_deploy.sh --service tensorboard
./scripts/local_deploy.sh --service dashboard
./scripts/local_deploy.sh --service docs
# Validate deployment
./scripts/local_deploy.sh --validate
Data Flow
Training Experiment
┌─────────────────────────────────────────┐
│ Create Experiment VM on Azure │
└──────────────────┬──────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Copy code + download dataset │
│ (from headquarters persistent disk) │
└──────────────────┬──────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Run training on GPU │
│ Log to MLflow │
│ Save checkpoints to persistent disk │
└──────────────────┬──────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Download results to headquarters │
│ (via persistent disk share) │
└──────────────────┬──────────────────────┘
↓
┌─────────────────────────────────────────┐
│ Delete experiment VM (save costs!) │
└─────────────────────────────────────────┘
↓
┌─────────────────────────────────────────┐
│ View results in MLflow/Dashboard │
│ at mlflow.lonel.ai │
└─────────────────────────────────────────┘
Cost Optimization
Headquarters
- Always running (~$150-200/month)
- Cost of persistent storage (~$50/month)
Experiment VMs
- Create on-demand only
- Automatically destroy after experiments
- Standard_NC6s_v3 @ ~$0.94/hr = $11/12hr experiment
Example: 1 week of experiments (12hrs/day GPU) - 7 days × 12 hours × $0.94/hr = ~$79 - Much cheaper than always-on GPU VM
Troubleshooting
Check Service Status
sudo systemctl status mlflow
sudo systemctl status tensorboard
sudo systemctl status dashboard
sudo systemctl status docs
View Logs
sudo journalctl -u mlflow -f # MLflow logs
sudo journalctl -u tensorboard -f # TensorBoard logs
Restart Services
sudo systemctl restart mlflow
sudo systemctl restart tensorboard
sudo systemctl restart dashboard
sudo systemctl restart docs
# Or use deployment script
./scripts/local_deploy.sh --restart-all
Verify Web Services
curl -u mlflow:password https://mlflow.lonel.ai
curl https://docs.lonel.ai
Next Steps
- Deployment Guide - How to deploy services
- Running Experiments - How to run on Azure VMs