Everything you need to know to score robot behavior.
We are training a humanoid robot to follow instructions like a good human worker. The dataset consists of head-mounted camera videos of the humanoid doing various tasks — folding clothes, picking up objects, cleaning the kitchen, and many others. Each video contains many instructions, with at most one instruction active at any time.
The robot will learn to maximize its score, so your scores directly shape its behavior.
Each episode is segmented into labels — each label has a start frame, end frame, and a language instruction (e.g. "pick up the red block").
For each label you do two things: check that the instruction makes sense, then score the robot's behavior.
The instructions were generated automatically, so some of them won't be right. Before scoring, ask yourself one question:
Would it make sense to give this instruction to the robot in this situation?
If the answer is yes — even if the robot completely fails to do it — leave the instruction as-is and give it a low score. We need those low scores to teach the robot that ignoring a valid instruction is bad. Only edit or reject when the instruction itself doesn't make sense.
Examples:
Check that the start and end frames match the task — the label should cover exactly the action described by the instruction, no more.
Use i to set the start frame and o to set the end frame to the current video position.
Imagine you hired this robot as a helper and gave it the instruction. Rate its performance from 0 to 5 stars — like leaving a star review for a service you paid for. Trust your gut feeling. Don't overthink it.
| Score | Meaning | Description |
|---|---|---|
| 0 | Dangerous | The robot does something catastrophic — injures a person, destroys something valuable, or creates a dangerous situation. This score exists so we can find and eliminate dangerous behavior |
| 1 | Terrible | You'd be annoyed or angry. The robot ignores the instruction entirely, breaks something, or does something you really don't want |
| 2 | Bad | A poor attempt. Wrong object, clumsy execution, mostly fails at the goal. You're not happy |
| 3 | Mediocre | Meh. It sort of did the thing, but you're not impressed. Noticeable issues or inefficiency |
| 4 | Good | Solid performance. You're satisfied, but you can think of something you wish it did better |
| 5 | Impressive | You're genuinely impressed. The robot does exactly what you want — you wouldn't change a thing |
| / | Edit the instruction text (marks original as not plausible) |
| i | Set start frame to current video position |
| o | Set end frame to current video position |
| u | Undo your last structural edit |
| r | Reject label — mark instruction as not plausible and skip |
| c | Create a new label at the current frame (use i/o// to adjust) |
Edits create a new version of the label (the original is preserved).
| Space / k | Play / pause |
| Shift+S / Shift+D | Slow down / speed up playback |
| 0 - 5 | Score the selected label |
| r | Reject label (not plausible) |
| ! | Add reviewer note (flags for admin review) |
| c | Create new label at current frame |
| n / Tab | Next unreviewed label |
| p | Previous unreviewed label |
| b | Seek to start of current label |
| a | Toggle auto-pause at label end |
| h / l | Step back / forward 1 frame |
| H / L | Step back / forward 1 second |
| i / o | Set start / end frame |
| / | Edit instruction text |
| u | Undo last edit |
| ? | Show shortcut overlay in viewer |
Open these example episodes to see reviewed labels with scoring notes explaining why each score was given. Click any label in the timeline to see its notes in the side panel.
No example episodes configured yet. An admin can add them from the Admin panel.