Review Guide

Everything you need to know to score robot behavior.

Context

We are training a humanoid robot to follow instructions like a good human worker. The dataset consists of head-mounted camera videos of the humanoid doing various tasks — folding clothes, picking up objects, cleaning the kitchen, and many others. Each video contains many instructions, with at most one instruction active at any time.

The robot will learn to maximize its score, so your scores directly shape its behavior.

What you're reviewing

Each episode is segmented into labels — each label has a start frame, end frame, and a language instruction (e.g. "pick up the red block").

For each label you do two things: check that the instruction makes sense, then score the robot's behavior.

Checking the instruction

The instructions were generated automatically, so some of them won't be right. Before scoring, ask yourself one question:

Would it make sense to give this instruction to the robot in this situation?

If the answer is yes — even if the robot completely fails to do it — leave the instruction as-is and give it a low score. We need those low scores to teach the robot that ignoring a valid instruction is bad. Only edit or reject when the instruction itself doesn't make sense.

Examples:

Instruction is "pick up the hammer" and there is a hammer, but the robot grabs an apple instead → instruction is fine, score low
Instruction is "pick up the hammer" but there is no hammer in the scene → the instruction doesn't make sense here, edit it (/)
Instruction is nonsense like "die" or "execute routine 4" → edit if you can tell what the robot should be doing, or reject (r) if you can't
Instruction is "Pull down the apple" but the robot is clearly picking something up off a table → wrong action, edit it to describe what you'd actually tell the robot

When in doubt, don't edit. If the instruction is roughly right and you could imagine saying it to a person in that situation, leave it alone. Only edit when something is clearly wrong.

Fixing boundaries

Check that the start and end frames match the task — the label should cover exactly the action described by the instruction, no more.

End is too late — e.g. instruction is "pick it up", the robot picks it up but the label keeps going while it walks away. Move the end to right after the pick-up
Start is too early — the label begins while the robot is still finishing a previous task
Start is too late — the robot has already begun the action before the label starts

Use i to set the start frame and o to set the end frame to the current video position.

Always fix first, then score. If you edit the instruction or adjust boundaries, do that before pressing a score key. You're scoring the corrected version.

Scoring

Imagine you hired this robot as a helper and gave it the instruction. Rate its performance from 0 to 5 stars — like leaving a star review for a service you paid for. Trust your gut feeling. Don't overthink it.

Score	Meaning	Description
0	Dangerous	The robot does something catastrophic — injures a person, destroys something valuable, or creates a dangerous situation. This score exists so we can find and eliminate dangerous behavior
1	Terrible	You'd be annoyed or angry. The robot ignores the instruction entirely, breaks something, or does something you really don't want
2	Bad	A poor attempt. Wrong object, clumsy execution, mostly fails at the goal. You're not happy
3	Mediocre	Meh. It sort of did the thing, but you're not impressed. Noticeable issues or inefficiency
4	Good	Solid performance. You're satisfied, but you can think of something you wish it did better
5	Impressive	You're genuinely impressed. The robot does exactly what you want — you wouldn't change a thing

Don't overthink it. There are no objectively correct scores. Watch the clip, form an impression, and go with your gut. Speed and consistency within your own judgement matter more than agonizing over individual scores.

Turn on audio. Sound helps you judge things that are hard to see — collisions, scraping, objects dropping, motors straining. If you hear a loud bang during a "place the cup gently" task, that matters.

Workflow

1Go to the Queue tab and click Review on an episode.

2The viewer opens on the first unreviewed label. Press Space to play.

3Watch the segment. Check if the instruction and boundaries make sense (see above).

4Fix the instruction text and/or start/end frames if needed.

5Press a number key 0–5 to score the robot's behavior. The viewer auto-advances to the next unreviewed label.

6Repeat until all labels are reviewed. Your progress shows in the counter at the top.

Editing shortcuts

/	Edit the instruction text (marks original as not plausible)
i	Set start frame to current video position
o	Set end frame to current video position
u	Undo your last structural edit
r	Reject label — mark instruction as not plausible and skip
c	Create a new label at the current frame (use i/o// to adjust)

Edits create a new version of the label (the original is preserved).

Keyboard shortcuts

Space / k	Play / pause
Shift+S / Shift+D	Slow down / speed up playback
0 - 5	Score the selected label
r	Reject label (not plausible)
!	Add reviewer note (flags for admin review)
c	Create new label at current frame
n / Tab	Next unreviewed label
p	Previous unreviewed label
b	Seek to start of current label
a	Toggle auto-pause at label end
h / l	Step back / forward 1 frame
H / L	Step back / forward 1 second
i / o	Set start / end frame
/	Edit instruction text
u	Undo last edit
?	Show shortcut overlay in viewer

Examples

Open these example episodes to see reviewed labels with scoring notes explaining why each score was given. Click any label in the timeline to see its notes in the side panel.

No example episodes configured yet. An admin can add them from the Admin panel.