Architecture Overview
Echoes is designed to compare different neural network architectures for video classification through a modular, scalable system.
High-Level Architecture
┌─────────────────────────────────────────────────────────┐
│ Experiment Scripts │
│ (train_simple.py, train_comprehensive.py) │
└────────────┬────────────────────────────────────────────┘
│
├──────────────────────────┬──────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Data Layer │ │Model Layer │ │Training Loop│
├─────────────┤ ├──────────────┤ ├─────────────┤
│ UCF101 │ │ESN │ │Metrics │
│ CIFAR-10 │ │RNN │ │Checkpoints │
│ etc. │ │LSTM/GRU │ │Logging │
└─────────────┘ └──────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────┐
│ MLflow Experiment Tracking │
│ (metrics, params, models, artifacts) │
└─────────────────────────────────────────────────────────┘
│
├──────────────────┬──────────────────┬──────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
TensorBoard MLflow UI Dashboard Plots
(local logs) (experiment view) (architecture viz) (analysis)
Component Overview
Data Layer (data/)
Handles dataset loading, preprocessing, and batching:
- Dataset classes: UCF101Dataset, VideoFrameDataset
- Dataloaders: Batch management with custom transforms
- Preprocessing: Frame extraction, normalization, augmentation
Model Layer (models/)
Implements different architectures:
| Model | Type | Trainable Params | Use Case |
|---|---|---|---|
| SimpleRNN | Traditional RNN | ~50K-500K | Baseline |
| SimpleESN | Echo State Network | ~10K-50K (readout only) | Research |
| DeepRNN | Multi-layer RNN | ~1M-10M | Complex patterns |
| DeepESN | Multi-layer ESN | ~50K-100K (readout only) | Scalability |
| LSTM | LSTM networks | Similar to RNN | Gradient stability |
Training (experiments/)
Experiment orchestration: - Data loading and preprocessing - Model instantiation - Loss computation and backpropagation - Metrics logging (accuracy, loss, inference time) - Model checkpointing and artifact storage
Tracking (mlruns/)
MLflow stores: - Params: model size, learning rate, batch size, epochs - Metrics: loss, accuracy, inference time, memory usage - Artifacts: trained models, configuration files - Metadata: tags, notes, timestamps
Storage Architecture
Local Development
echoes/
├── models/ # Model implementations
├── data/ # Dataset loading code
├── experiments/ # Training scripts
├── logs/ # TensorBoard logs (local)
├── mlruns/ # MLflow tracking (local)
└── tfruns/ # TensorFlow/TB runs (local)
Production (Azure)
Headquarters VM (persistent disk)
/mnt/echoes_data/
├── ucf101/ # Dataset (13GB)
├── logs/ # Training logs
├── mlruns/ # MLflow experiments
├── tfruns/ # TensorBoard logs
└── azure_results/ # Downloaded results
Ephemeral Experiment VMs automatically: 1. Copy code and dataset from headquarters 2. Run experiments 3. Stream results back to persistent storage 4. Self-destruct to save costs
Data Flow
Training Run
1. Load data from disk
↓
2. Create model (initialize weights)
↓
3. Forward pass → Compute loss
↓
4. Backward pass → Update weights
↓
5. Log metrics to MLflow
↓
6. Save checkpoint to persistent disk
↓
7. Repeat for N epochs
↓
8. Evaluate on test set
↓
9. Save final model and artifacts
Experiment Comparison
1. MLflow stores all run data
↓
2. Dashboard queries MLflow API
↓
3. Extract model architectures
↓
4. Render comparison visualizations
↓
5. Display at https://dashboard.lonel.ai
Key Design Principles
Modularity
- Models are independent, can be swapped easily
- Datasets are abstracted, support multiple sources
- Training logic is separate from data/model
Reproducibility
- All hyperparameters logged to MLflow
- Models saved with exact configuration
- Seeds set for deterministic results
Scalability
- Local training for quick iteration
- Azure GPU VMs for serious experiments
- Persistent storage for long-term tracking
Observability
- Comprehensive logging at each step
- Real-time monitoring via TensorBoard
- Experiment comparison via MLflow UI
Next Steps
- Models - Detailed model implementations
- Infrastructure - Cloud setup and deployment
- Running Experiments - How to train models