Review Guide

Everything you need to know to score robot behavior.

Context

We are training a humanoid robot to follow instructions like a good human worker. The dataset consists of head-mounted camera videos of the humanoid doing various tasks — folding clothes, picking up objects, cleaning the kitchen, and many others. Each video contains many instructions, with at most one instruction active at any time.

The robot will learn to maximize its score, so your scores directly shape its behavior.

What you're reviewing

Each episode is segmented into labels — each label has a start frame, end frame, and a language instruction (e.g. "pick up the red block").

For each label you do two things: check that the instruction makes sense, then score the robot's behavior.

Checking the instruction

The instructions were generated automatically, so some of them won't be right. Before scoring, ask yourself one question:

Would it make sense to give this instruction to the robot in this situation?

If the answer is yes — even if the robot completely fails to do it — leave the instruction as-is and give it a low score. We need those low scores to teach the robot that ignoring a valid instruction is bad. Only edit or reject when the instruction itself doesn't make sense.

Examples:

When in doubt, don't edit. If the instruction is roughly right and you could imagine saying it to a person in that situation, leave it alone. Only edit when something is clearly wrong.

Fixing boundaries

Check that the start and end frames match the task — the label should cover exactly the action described by the instruction, no more.

Use i to set the start frame and o to set the end frame to the current video position.

Always fix first, then score. If you edit the instruction or adjust boundaries, do that before pressing a score key. You're scoring the corrected version.

Scoring

Imagine you hired this robot as a helper and gave it the instruction. Rate its performance from 0 to 5 stars — like leaving a star review for a service you paid for. Trust your gut feeling. Don't overthink it.

ScoreMeaningDescription
0 Dangerous The robot does something catastrophic — injures a person, destroys something valuable, or creates a dangerous situation. This score exists so we can find and eliminate dangerous behavior
1 Terrible You'd be annoyed or angry. The robot ignores the instruction entirely, breaks something, or does something you really don't want
2 Bad A poor attempt. Wrong object, clumsy execution, mostly fails at the goal. You're not happy
3 Mediocre Meh. It sort of did the thing, but you're not impressed. Noticeable issues or inefficiency
4 Good Solid performance. You're satisfied, but you can think of something you wish it did better
5 Impressive You're genuinely impressed. The robot does exactly what you want — you wouldn't change a thing
Don't overthink it. There are no objectively correct scores. Watch the clip, form an impression, and go with your gut. Speed and consistency within your own judgement matter more than agonizing over individual scores.
Turn on audio. Sound helps you judge things that are hard to see — collisions, scraping, objects dropping, motors straining. If you hear a loud bang during a "place the cup gently" task, that matters.

Workflow

1Go to the Queue tab and click Review on an episode.
2The viewer opens on the first unreviewed label. Press Space to play.
3Watch the segment. Check if the instruction and boundaries make sense (see above).
4Fix the instruction text and/or start/end frames if needed.
5Press a number key 0–5 to score the robot's behavior. The viewer auto-advances to the next unreviewed label.
6Repeat until all labels are reviewed. Your progress shows in the counter at the top.

Editing shortcuts

/Edit the instruction text (marks original as not plausible)
iSet start frame to current video position
oSet end frame to current video position
uUndo your last structural edit
rReject label — mark instruction as not plausible and skip
cCreate a new label at the current frame (use i/o// to adjust)

Edits create a new version of the label (the original is preserved).

Keyboard shortcuts

Space / kPlay / pause
Shift+S / Shift+DSlow down / speed up playback
0 - 5Score the selected label
rReject label (not plausible)
!Add reviewer note (flags for admin review)
cCreate new label at current frame
n / TabNext unreviewed label
pPrevious unreviewed label
bSeek to start of current label
aToggle auto-pause at label end
h / lStep back / forward 1 frame
H / LStep back / forward 1 second
i / oSet start / end frame
/Edit instruction text
uUndo last edit
?Show shortcut overlay in viewer

Examples

Open these example episodes to see reviewed labels with scoring notes explaining why each score was given. Click any label in the timeline to see its notes in the side panel.

No example episodes configured yet. An admin can add them from the Admin panel.