If you train manipulation policies, you already know the unglamorous truth of imitation learning: the model is only as good as the segments. A single robot demonstration — "open the drawer, pick up the block, place it in the bin" — is one continuous video, but the policy needs to know where one action ends and the next begins. Get those boundaries right and a behavior-cloning or skill-conditioned model has clean, well-aligned supervision. Get them sloppy and you've quietly poisoned every downstream training run.

This is a practical guide to action-segmentation labeling for robot-demonstration video: what makes a boundary "correct," the rules that keep a timeline consistent, how to do quality control across multiple annotators without hand-waving, and how to get the result into the formats the open robot-learning ecosystem actually trains on (LeRobotDataset v3 and RLDS / Open X-Embodiment).

Everything described here is something you can do today on FoodforThought, Kindly's open platform for robotics data. If you want to skip the theory and feel it for yourself, you can label a robot demo in under two minutes — no account needed.

What action segmentation actually is

Action segmentation is temporal annotation: you watch a demonstration video and mark labeled spans on the timeline — each span has a start time, an end time, and an action label (reach, grasp, transport, place, retract, and so on).

It is different from the other annotation types you'll meet in a robot-data pipeline:

Bounding boxes localize objects in space (where is the cube in this frame).
Polygon / mask segmentation gives you precise object shapes.
Q&A annotation enriches metadata (which gripper, what surface, did the grasp succeed).
Action segmentation localizes behavior in time (when does the grasp happen).

For policy learning, temporal segments are what turn a long, unstructured demo into a set of named, reusable sub-skills — and they're what let you slice a dataset by "show me every grasp across 4,000 episodes."

The thing that separates good segments from bad ones: frame accuracy

The most common quality problem in temporal labeling is boundaries that are close but not exact. A grasp that "starts around 3.2 seconds" is useless if the actual finger-contact frame is at 3.07s, because the next person — and the model — has no way to know whether your boundary was deliberate or eyeballed.

Two habits fix this.

1. Mark on frames, not on seconds. Video is discrete. At 30 fps, one frame is 33.3 ms; the difference between "the gripper is open" and "the gripper has closed" is often a single frame. Scrub frame-by-frame to the exact transition and set your in/out point there. Frame-stepping (rather than dragging a playhead) is what makes repeated boundaries reproducible: stepping one frame at a time and snapping to frame boundaries avoids the floating-point drift you get from clicking around on a timeline.

2. Define your transition convention and keep it. Decide once what "start of grasp" means — first frame of finger motion toward the object? First frame of contact? First frame the object moves with the gripper? — and apply it to every episode. Most disagreement between annotators isn't carelessness; it's two people using two reasonable-but-different definitions of the same boundary.

FoodforThought's action labeler is built around this. It uses the video's fps for frame-accurate scrubbing (defaulting to 30 fps when a clip carries no fps metadata), shows you the current time and frame number, and exposes a small keyboard layout so you never have to leave the keyboard mid-demo:

Key	Action
`Space`	Play / pause
`←` / `→`	Step one frame back / forward
`i`	Set in-point
`o`	Set out-point (creates the segment)
`l`	Cycle the action label
`n` / `p`	Jump to next / previous segment

The workflow is: scrub to the exact first frame of the action, press i; scrub to the exact last frame, press o; the segment is created. Then l to label it. It's the same muscle memory as an NLE timeline, which is the point — frame-exact work should feel like editing video, not filling out a form.

The two rules that keep a timeline consistent

Once you have multiple segments on one demo, two structural questions decide whether the annotation is valid: can segments overlap, and do gaps matter? FoodforThought encodes a specific, defensible answer to both, and it's worth understanding the reasoning even if you label somewhere else.

Overlaps are illegal (with one deliberate exception)

Two segments that share more than a hair of timeline are treated as an error, and the editor refuses to create the second one. The rationale: a single robot is doing one thing at a time during a manipulation primitive, so two overlapping grasp/transport spans almost always means a mistake, not intent.

The deliberate exception is touching endpoints. If one segment ends at exactly the frame the next begins (reach ends, grasp starts), that is not an overlap — adjacent, back-to-back segments are legal and expected. Under the hood this is handled with a small time epsilon so that a boundary touch never gets misread as a one-frame collision. In practice: you can tile the whole demo with consecutive segments, you just can't double-cover the same instant.

Gaps are allowed, but surfaced

Uncovered spans between segments — the moment between place and the start of retract, or dead time before the robot starts moving — are advisory, not illegal. They never block you from saving, because real demonstrations genuinely have idle stretches. But the editor shows them to you (including a leading gap before your first segment and a trailing gap to the end of the clip) so an accidental gap, where you meant the segments to be contiguous, doesn't slip through silently.

This "overlaps block, gaps warn" split is the honest version of timeline validation: it stops the errors that are almost always wrong, and merely flags the ones that are sometimes intentional.

Quality control: how do you know a label is right?

Frame-accurate, well-structured segments from a single annotator are still one person's opinion. For a dataset you intend to train on, you need inter-annotator agreement — and for temporal labels, agreement needs a temporal definition.

The metric that makes sense for action segments is temporal IoU (intersection-over-union of two time spans for the same label). If two annotators both mark a grasp and their spans overlap heavily, IoU is near 1.0; if one is shifted half a second late, IoU drops. FoodforThought's consensus layer computes this per annotation type — temporal IoU for action segments, spatial IoU for boxes and masks, value match for Q&A — and rolls it up into a per-task agreement score.

A few details that make this usable rather than decorative:

Greedy same-label matching. For each segment one annotator marked, it's paired with the best-IoU same-label segment from the other annotator, and the score is normalized by the larger segment count — so missing or extra segments are penalized, not hidden.
Reputation weighting. Agreement can be weighted by an annotator's level and rolling accuracy, with the weight bounded so no single voice dominates a task. (Reputation is read-only to the consensus math — it never silently edits anyone's XP.)
Auto-flagging for adjudication. Tasks whose agreement falls below a threshold are routed to a reviewer instead of being trusted blindly. There's an adjudication queue for exactly this — a human breaks the tie when annotators genuinely disagree.

The point of QC isn't to produce a number; it's to find the demos where reasonable people disagreed and resolve them, because those are precisely the boundaries your model will struggle with too.

Getting clean segments into a trainable dataset

Labels are only valuable if your training stack can read them. The open robot-learning world has effectively standardized on two formats, and FoodforThought is built to speak both natively rather than inventing a silo:

LeRobotDataset v3 — Hugging Face's current on-disk format, where frame rows and video from many episodes are concatenated into a few large chunked files and per-episode views are reconstructed from metadata offsets (the v2 format used one file per episode).
RLDS — the episodes-of-steps schema that Open X-Embodiment, RT-X, and Octo train on, where each step bundles observation + action with the RLDS boundary flags (is_first / is_last / is_terminal) and a language_instruction.

The open-source FoodforThought CLI (ate, MIT-licensed, on PyPI) reads and writes both. A typical round-trip:

pip install foodforthought-cli

# Pull an existing open dataset in — LeRobot from the HF Hub,
# or an Open X-Embodiment / RLDS dataset:
ate data import lerobot/pusht --source hf-lerobot
ate data import <oxe-dataset> --source open-x-embodiment

# Export your labeled recordings back out to a trainable format:
ate data export --from ./labeled_recordings --format lerobot-v3 -o ./export
ate data export --from ./labeled_recordings --format rlds       -o ./export

Because the LeRobot v3 codec is the canonical source of truth, the v3 and RLDS exports of the same labeled dataset stay consistent by construction — your grasp segment means the same thing whether you train with tensorflow_datasets or the LeRobot loader. A note on honesty: the RLDS path is a pure-Python, TF-free structure writer/reader (TensorFlow is only needed via an optional extra), and video re-encoding is out of scope — frame-level tabular data round-trips losslessly and existing MP4 shards are copied verbatim, but the codec doesn't transcode video streams.

This is the design → data → deploy → operate loop FoodforThought is built around, with lineage tracked at every hop: raw demos → processed → labeled → trained skill, so you can always trace a deployed behavior back to the exact segments it learned from.

A short, honest checklist

If you take nothing else from this guide, take the checklist:

Pick one definition per boundary (e.g. "grasp = first frame of contact") and write it down.
Step frame-by-frame to the transition; never eyeball a boundary from a dragged playhead.
Tile, don't overlap — use touching endpoints for back-to-back actions; let real idle time be a (visible) gap.
Label every segment before you save.
Get a second annotator on a sample and look at temporal-IoU agreement; adjudicate the disagreements rather than averaging them away.
Export to LeRobotDataset v3 or RLDS so the work is trainable and portable — and so you can take it elsewhere.

Why neutrality matters here

Robot data and tooling are consolidating fast into a few hyperscaler and foundation-lab orbits. Kindly's deliberate position is the opposite: a neutral, open, interoperable layer — open clients and open schemas (ROS 2, LeRobotDataset, RLDS/OXE, URDF/Xacro, MCAP), with the CLI and SDK MIT-licensed so nothing about adopting Kindly traps your team. The labeling you do feeds a data flywheel for open robot learning, not a single vendor's silo. (See our open-source posture for what's open, what's commercial, and why.) Kindly Robotics is a 501(c)(3) nonprofit; FoodforThought is researcher-first and free to use.

Try it — under two minutes, no account

Reading about frame-exact in/out points is one thing; doing one is another. We built a guided flow where you can label your first robot demonstration in under two minutes, with no sign-up:

→ Try labeling a robot demo now

It's the same task machinery real annotators use. When you're ready for more, you can browse real labeling tasks, check the leaderboard, or explore the open datasets already on the platform — Open X-Embodiment, BridgeData V2, and more.

Good segments are the cheapest way to make a robot-learning model better. Spend two minutes seeing how they're made.

A Practical Guide to Action-Segmentation Labeling for Robot-Demonstration Video