Unit 5: Train Your First Policy — OpenArm Learning Path

What Imitation Learning Actually Does

Before running the training command, take two minutes to understand what the model is actually learning. Imitation learning trains a policy network to map observations (camera images + current joint state) to actions (next joint angles). The network never receives a reward signal — it only sees your demonstrations and learns to reproduce the distribution of actions you performed in similar states.

ACT (Action Chunking with Transformers) predicts a chunk of 100 future actions at once rather than a single step. This prevents error accumulation across the episode: even if an individual prediction is slightly off, the chunk provides a stable trajectory buffer. It then re-plans every 100 timesteps (2 seconds at 50Hz). This is why ACT handles longer tasks better than plain behavior cloning.

For the full theoretical background, read Imitation Learning Fundamentals in the Robotics Library.

GPU or CPU?

Training on an NVIDIA GPU with 8GB+ VRAM takes approximately 45 minutes for 100k steps. Training on CPU takes 3–4 hours for the same run. Both produce equivalent model quality — GPU is just faster. If you don't have a local GPU, the training command works identically on a cloud instance (Lambda Labs or Google Colab with A100 runtime). Instructions are in the README of the LeRobot repo.

Train ACT on Your Dataset

Run the training script from your virtual environment. The config values below are calibrated for 50-episode pick-and-place datasets on OpenArm — do not change them for your first run:

source ~/openarm-env/bin/activate

python -m lerobot.scripts.train \
  --dataset-path ~/openarm-datasets/pick-and-place \
  --policy act \
  --batch-size 8 \
  --lr 1e-5 \
  --num-train-steps 100000 \
  --eval-freq 5000 \
  --save-freq 10000 \
  --log-freq 500 \
  --output-dir ~/openarm-policies/pick-and-place-v1

# Training will print loss every 500 steps and eval results every 5000 steps
# Checkpoints saved every 10k steps to ~/openarm-policies/pick-and-place-v1/

Start training, then monitor the output. You do not need to watch it the entire time — but check back every 20–30 minutes to confirm the loss is decreasing and the run has not crashed. Training can run overnight while you sleep.

Understanding Training Curves

ACT's training output shows two key metrics. Learn to read them correctly — they tell you whether your training is healthy and when to stop.

Training Loss

Should decrease steeply in the first 20k steps, then continue decreasing more slowly. A loss that plateaus above 0.05 usually indicates data quality problems — check your dataset. A loss that oscillates widely suggests your learning rate is too high.

Eval Success Rate

Appears every 5k steps (requires a physical arm or sim). This is the number that actually matters. You want this above 70% before deploying. It often lags the training loss — the loss can look good while success rate is still improving.

Action MSE

Mean squared error between predicted and ground-truth actions. Should drop below 0.01 for a well-trained pick-and-place policy. High action MSE after 80k steps means the model is struggling with the task complexity or your data is inconsistent.

KL Divergence (ACT-specific)

ACT uses a CVAE with a KL weight that is annealed from 0 to 10 during training. Watch for this stabilizing around step 40k. If it never converges, the model is failing to encode style — try adding more data.

When to Stop Training

Do not simply run to 100k steps and stop. Use these signals to decide when your checkpoint is ready for deployment:

Eval success rate has plateaued for 3 consecutive evaluations — the model has converged. Further training will not help without more or different data.
Eval success rate is above 70% — this is the threshold for Unit 6 deployment. If you hit 70% at 60k steps, you can stop early and deploy that checkpoint.
Training loss is still decreasing but eval is flat or declining — the model is overfitting. Take the last checkpoint where eval was at its peak. This is the best checkpoint.
After 100k steps — if success rate is below 40%, go back to Unit 4. The data problem is more likely than a training problem at this point.

Optional Deep-Dives

Beyond ACT — Diffusion Policy and π₀

Once you have a working ACT policy, the natural next experiment is Diffusion Policy. It handles multi-modal tasks better (e.g., the arm can approach the object from two angles) at the cost of slower inference. The SVRC Research section covers both. Browse research articles →

Unit 5 Complete When...

Training has completed (or you have stopped it at a good checkpoint). Your eval success rate is above 70% on the pick-and-place task. You have a saved checkpoint at ~/openarm-policies/pick-and-place-v1/ and you know which step number produced your best result. You are ready to put this policy on the real arm in Unit 6.