Wednesday, 14 November 2012

HTML5 Text Track Model

The purpose of WebVTT files in HTML is to create a list of Text Track Cues for the HTML5 Text Track object, which is in turn part of a media element. It is the responsibility of the parser to return a list of text track cues to the text track along with the rendering rules for it. That means our WebVTT parser will be called upon to return those two things. There are extensive rules for how the browser should deal with tracks and how they behave. I will not cover them here. This post is for describing the what the objects look like.

I wrote before how the HTML5 <track> object is composed. Now I will explain the how the text track model is composed in the DOM.

Text Track Model

There is a text track list in the media element and they can come from three sources. They are added in the following order:
  1. Tracks listed with the <track> tag in the order specified in HTML.
  2. Tracks added dynamically with the addTextTrack() method in the order they are added.
  3. Tracks that are embedded in a media object.
Tracks can be enabled or loaded automatically depending on many setting but most importantly on user preferences. The rules for track selection and fetching are not relevant in describing the model and they can be found in the specifications.

The following are components of a text track.


A string which represents how the text track is handled by the browser. It can change and is set by the <track> tag. If the tag's values is changed, so must this value. It must be one of the following:
  1. subtitles
  2. captions
  3. descriptions
  4. chapters
  5. metadata


A string the identifies the track for users. It can change as is set by the <track> tag. If the tag's values is changed, so must this value. If it is empty the browser should generate a label based on other properties such as kind and language.

in-band metadata track dispatch type

The in-band metadata track dispatch type of the track is embedded in the media object and the kind is metadata then this is a string used to get scripts to work with the track, otherwise it is an empty string. The rules for getting this vary with the type of media (see here).


A string which indicates the language of the track. It is a BCP 47 language tag. It can change as is set by the <track> tag. If the tag's values is changed, so must this value.

readiness state

This is not actually in this model but in the <track> tag model (see pseudocode at bottom of this post). The loading status of the track. Initially it is set to NONE for not loaded. A number which indicates one of the following states:
  1. NONE
    • Value is 0
    • The track has not obtained any cues.
    • Value is 1
    • The track is loading has not hit any errors. The cues are still loading.
    • Value is 2
    • The track has loaded and there were no errors. All cues are loaded.
  4. ERROR
    • Value is 3
    • The track hit one or more errors. Cues may not have been loaded.


The active state of the track. Initially set to disabled. It is one of three values:

  1. disabled
    • Track is not active and is ignored by the browser.
  2. hidden
    • The track is active but the cues are not being rendered.
  3. showing
    • The track is active and the cues are being rendered as appropriate for the track's kind.

list of cues

A list of text track cues. This list is dynamic since the cues are parsed asynchronously. It is initially empty. It is also has the rules for rendering the text track, which for WebVTT is found in the WebVTT specification.

Text Track Cue Model

The WebVTT parser returns a list of text track cues which is added to the text track. There are a A text track cue has the following format.


An arbitrary string. It is initially an empty string.

start time

The start time of the cue in seconds and fractions of a second.

end time

The end time of the cue in seconds and fractions of a second.

pause-on-exit flag

A true of false value. If true the media element will pause at the end of the current cue. It is initially false.

writing direction

A string representing if the writing is to be displayed horizontally or vertically. It is initially horizontal. There are three possible values:
  1. Horizontal
    • Value is an empty string
    • lines are horizontal
    • consecutive lines are displayed below each other
    • line position is relative to height
    • text position and size are relative to width
  2. Vertical growing left
    • Value is string rl
    • lines are vertical
    • consecutive lines are displayed to the left of each other
    • line position is relative to width
    • text position and size are relative to height
  3. Vertical growing right
    • Value is string lr
    • lines are vertical
    • consecutive lines are displayed to the right of each other
    • line position is relative to width
    • text position and size are relative to height

snap-to-lines flag

A true or false value. If true line position indicates a position like a line of text in a document. If false line position is a percentage. It is initially true.

line position

An integer representing the position where the text is to be displayed. The direction is indicated by writing direction. The snap-to-lines flag indicates it is either a percentage or a position like line number on a document. It can also be set to the string auto which means it is determined based on other active cues. If it is a percentage, then it must be between 0 and 100 (inclusive). It is initially auto.

text position

An integer percentage between 0 and 100 (inclusive) that represents the position where the text is to be displayed. The direction is indicated by writing direction. It is initially 50.


An integer percentage between 0 and 100 (inclusive) that represents the width (or height) of the text display area. The direction is indicated by writing direction.

Example 1: Let writing direction be horizontal, then size is the width of the Caption Rendering Box.


A string the indicates how text is aligned within the rendering area (Caption Rendering Box in Example 1). The start and end side depend on the writing direction. It is initially middle. There are five possible values:

  1. start
    • Text is aligned to the start side.
  2. middle
    • Text is centred.
  3. end
    • Text is aligned to the end side.
  4. left
    • Text is aligned to the left.
  5. right
    • Text is aligned to the right.


The actual text of the cue. In addition is associated with the rules for how it is to be interpreted. The rules for interpretation are the WebVTT parsing rules, WebVTT cue text rendering rules, and WebVTT DOM construction rules.

active flag

A true or false value. It is used to make sure the cue is rendered properly. It's behavior is dynamic and complex. Please see the specifications for more information.

display state

It is used for rendering. Used in conjunction with active flag. It's behavior is dynamic and complex. Please see the specifications for more information.

Additional Text Track Cue Information

There are a number of methods that the different objects require but I want to mention only one since it relates to how WebVTT files get parsed.


This is a method that the text track cue has that returns a document fragment (which is a small document object or piece of the DOM) by converting the cue text by the WebVTT parsing rules  and WebVTT DOM construction rules.


W3C HTML5 Specification [1] [2] [3] [4]

interface HTMLTrackElement : HTMLElement {
           attribute DOMString kind;
           attribute DOMString src;
           attribute DOMString srclang;
           attribute DOMString label;
           attribute boolean default;

  const unsigned short NONE = 0;
  const unsigned short LOADING = 1;
  const unsigned short LOADED = 2;
  const unsigned short ERROR = 3;
  readonly attribute unsigned short readyState;

  readonly attribute TextTrack track;

enum TextTrackMode { "disabled", "hidden", "showing" };
interface TextTrack : EventTarget {
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;
  readonly attribute DOMString inBandMetadataTrackDispatchType;

           attribute TextTrackMode mode;

  readonly attribute TextTrackCueList? cues;
  readonly attribute TextTrackCueList? activeCues;

  void addCue(TextTrackCue cue);
  void removeCue(TextTrackCue cue);

           attribute EventHandler oncuechange;

// Represents a dynamically updating list
interface TextTrackCueList {
  readonly attribute unsigned long length;   // Number of cues
  getter TextTrackCue (unsigned long index);
  TextTrackCue? getCueById(DOMString id);    // By identifier

enum AutoKeyword { "auto" };
[Constructor(double startTime, double endTime, DOMString text)]
interface TextTrackCue : EventTarget {
  readonly attribute TextTrack? track;

           attribute DOMString id;              // Identifier
           attribute double startTime;
           attribute double endTime;
           attribute boolean pauseOnExit;
           attribute DOMString vertical;
           attribute boolean snapToLines;
           attribute (long or AutoKeyword) line;
           attribute long position;
           attribute long size;
           attribute DOMString align;
           attribute DOMString text;
  DocumentFragment getCueAsHTML();

           attribute EventHandler onenter;
           attribute EventHandler onexit;

No comments:

Post a Comment