Thursday, 4 October 2012

Web Video Text Tracks (WebVTT)

NO LONGER UPDATED
SEE: https://developer.mozilla.org/en-US/docs/HTML/WebVTT

------------------------------------------------------------------------------------

UPDATE 11//2012: Cue start time may not be less than any previous start time but may be equal to them.

UPDATE 10/27/2012: The string "-->" is no longer allowed in a cue text payload. Also, the CSS now allows opacity and visibility properties.

UPDATE 10/14/2012: Undid newlines in header part. This post is correct for the WebVTT syntax. The parsing rules are more flexible than the syntax rules, but just because it passes the parser doesn't mean it is valid WebVTT. That's technical for "it may work but that doesn't mean it's right".

UPDATE 10/10/2012: Cue start time must be greater than all previous cue start time. End times may be such that two cues overlap. Also, you are allowed newlines in the WebVTT header, but two in a row (a blank line) or the string "-->" indicate the start of the cues.

WebVTT is a new standard developed by World Wide Web Consortium (W3C) primarily for displaying subtitles in HTML5 video. It is a living standard, meaning that it is continuing to develop. It is set to be adopted by the major browsers which would mean that WebVTT could become the standard for subtitles on the internet. Right now a variety of file types are in use including SubRip (.srt) and SubStation Alpha (.ssa)

Text vs. Image Subtitles

WebVTT is a text based format. That means that it is encoded in plain text. This type of file is commonly used on the internet but it is not what it typically used on DVDs and Blu-rays. Those use images for the subtitles. The VobSub format (.idx, .sub) is an example. It is relatively easy to convert text based to image based, but to convert images to text requires some sort of optical character recognition (OCR). On the internet it is much easier to use text based files.


Track Format

A WebVTT file is embedded in an HTML5 page by using the track element. The track element must be contained by a media element. According to the specifications, a media element is either video or audio, although I do not know how a track would work with an audio element. track is defined in W3C HTML5 Spec Section 4.8.9. WebVTT will not work with an audio element according to the WebVTT specifications.

Please note that technically attribute values do not need to be surrounded by quotes unless the values contain certain characters. However, quotes are highly recommended. Please see WC3 HTML5 Spec Section 8.1.2.3 for more information.

Example 1 - Simple track example
 <video src="mytrip.webm">  
   <track src="subtitles.vtt" srclang="en">
 </video>  

The track element has five attributes:
  1. kind
  2. src
  3. srclang
  4. label
  5. default
No two track elements can have the same kind, srclang, and label. Omitted and empty attributes are equivalent.

kind

kind specifies how the subtitles are meant to be used. If omitted, the default kind value is subtitles. There are five possible values for kind:
  1. subtitles
    • Forced subtitles that provides translation of content that cannot be understood by the viewer. For example dialogue or text that is not English in an English language film. Subtitles may also contain additional content, usually extra background information. For example the text at the beginning or Star Wars films, or the current date, time, and/or location of the scene.
  2. captions
    • Closed captioning provides a transcription and possibly translation of the audio. It may include important non-verbal information such as music cues and sound effects. This should also contain content that would also be provided in subtitles (translations and additional content), and it should indicate the source (music, text, dialogue, etc.)
  3. descriptions
    • This is a description of the content of the video. It is usually synthesised into audio. It is for people who cannot see the video possibly because they are blind or the video cannot be seen clearly or at all.
  4. chapters
    • Chapter titles used for navigating the media content.
  5. metadata
    • Meant to be used in scripts. Not meant to be displayed.

Example 2 - kind examples (The first two are equivalent)
 <track src="subtitles.vtt" srclang="en">  
 <track kind="subtitles" src="subtitles.vtt" srclang="en">  
 <track kind="captions" src="subtitles.vtt">
 <track kind="descriptions" src="subtitles.vtt">
 <track kind="chapters" src="subtitles.vtt">

src

The URL address of the files must be specified in the src. It is required and cannot be empty.

srclang

The language of the track is specified by srclang. It must be a valid BCP 47 language tag. If the kind attribute is "subtitles", then srclang must be defined. If srclang is not required and it is omitted or empty, then the track has no language. It is recommended that it be defined.

Example 3 - scrlang examples (The first two are equivalent)
 <track src="subtitles.vtt" srclang="en">  
 <track kind="subtitles" src="subtitles.vtt" srclang="en">  
 <track kind="captions" src="subtitles.vtt">
 <track kind="captions" src="subtitles.vtt" srclang="en"> 

label

A label is the title of the track that the user will see. If omitted, the label is an empty string.

Example 4 - The first two are equivalent
 <track src="subtitles.vtt" srclang="en">  
 <track src="subtitles.vtt" srclang="en" label="">  
 <track src="subtitles.vtt" srclang="en" label="English">  

default

A track with the default attribute set will automatically be enabled unless the user's preferences indicate a more better setting. default can be set in the following ways:

Example 4 - default examples (All three are equivalent)
 <track src="subtitles.vtt" srclang="en" default>  
 <track src="subtitles.vtt" srclang="en" default="default">  
 <track src="subtitles.vtt" srclang="en" default="">  



WebVTT Format

The primary purpose of WebVTT files is to add subtitles to a video. This is done using a simple format. In fact it is similar to other subtitle formats like the SubRip format. A WebVTT file is intended to be used in the track element in HTML5. The mime type of WebVTT is "text/vtt".

A WebVTT file must be encoded in UTF-8 format. Where I indicate that you can use spaces, you can use spaces, tabs, or both in any combination if more than one character is allowed.

Newlines can be in Windows or Unix format. Specifically they can be a carriage return (CR), line feed (LF), or both (CR+LF). You may recognize these by the C escape sequences \r for carriage return, \n for line feed, or \r\n for both. Two newlines is the same as having a blank line.

Example 5 - Simplest possible WebVTT file
WEBVTT


Example 6 - Simple WebVTT example
WEBVTT

00:00:22.230 --> 00:00:24.606
Nobody lives here now.

00:00:30.739 --> 00:00:34.074
They stayed only a few hours.

00:00:34.159 --> 00:00:35.743
When they had gone,

00:00:35.827 --> 00:00:40.122
a community, which had lived for a thousand years, was dead.

00:00:43.251 --> 00:00:48.005
This is Oradour-sur-Glane in France.

Before we describe how to structure a WebVTT files it is beneficial to describe cues, which is the real meat and potatoes of subtitles.

WebVTT Cues

A cue is a single subtitle block that has a single start time, end time, and textual payload. Example 6 consists of the header, a blank line, and then five cues separated by blank lines. A cue consists of exactly five components:
  1. Optionally, a cue identifier followed by a new line
  2. Cue timings
  3. If cue settings are used, at least one space followed by the cue settings.
  4. A new line
  5. The cue payload (text)
Example 7 - Example of a cue
1 - Title Crawl
00:00:5.000 --> 00:00:10.000 line:0 position:20% size:60% align:start
A long time ago in a galaxy far,
far away....

Cue Identifier

The identifier is a name that identifies the cue. It is typically used to reference the cue from a script. The only requirements are that it must not contain a new line character, cannot contain the string --> which is used to identify cue timings, and it must end with a single new line character. There is no requirement that they are unique, although it is common to number them (1, 2, 3, ...).

Example 8 - Cue identifier from Example 7
1 - Title Crawl

Example 9 - Common usage of identifiers
WEBVTT

1
00:00:22.230 --> 00:00:24.606
Nobody lives here now.

2
00:00:30.739 --> 00:00:34.074
They stayed only a few hours.

3
00:00:34.159 --> 00:00:35.743
When they had gone,

4
00:00:35.827 --> 00:00:40.122
a community, which had lived for a thousand years, was dead.

5
00:00:43.251 --> 00:00:48.005
This is Oradour-sur-Glane in France.

Cue Timings

A cue timing indicates when the cue shown. It has a start and end time which are represented by timestamps. The end time must be greater than the start time, and the start time must be greater than or equal to all previous start times. Cues may have overlapping timings, although if two overlapping cues are in the same position they may be unreadable.

If the WebVTT file is being used for chapters (kind attribute is set to chapters for the track in HTML5) then the file cannot have overlapping timings.

The timestamps must be in one of two formats:
  1. mm:ss.ttt
  2. hh:mm:ss.ttt
Where the components are defined as follows.
  • hh is hours
    • must be at least two digits and not less than 01
    • hours can be greater than two digits (9999:00:00.000)
  • mm is minutes
    • must be between 00 and 59 inclusive
  • ss is senconds
    • must be between 00 and 59 inclusive
  • ttt is miliseconds
    • must be between 000 and 999 inclusive
    • I used ttt because milliseconds are thousandths of a second
Each cue timing contains exactly five components:
  1. Timestamp for start time
  2. At least one space
  3. The string -->
  4. At least one space
  5. Timestamp for end time which must be greater than the start time
Example 10 - Cue timing examples
00:22.230 --> 00:24.606
00:30.739 --> 00:00:34.074
00:00:34.159 --> 00:35.743
00:00:35.827 --> 00:00:40.122

Example 11 - Overlapping cue timing examples
00:00:00.000 --> 00:00:10.000
00:00:05.000 --> 00:01:00.000
00:00:30.000 --> 00:00:50.000

Example 12 - Non-overlapping cue timing examples
00:00:00.000 --> 00:00:10.000
00:00:10.000 --> 00:01:00.581
00:01:00.581 --> 00:02:00.100
00:02:01.000 --> 00:02:01.000

Cue Settings

Cue settings are used to position where the cue payload text will be displayed on the video. This includes whether the text is displayed horizontally or vertically. They are optional. You can use zero or more of them, and they can be used in any order, so long as each setting is used no more than once.

The cue settings are added to the right cue timings. There must be one or more spaces between the cue timing and the first setting, and between each cue setting. Each cue setting's name and value are separated by a colon. The settings are case sensitive, so use lower case as shown. There are exactly five cue settings:
  1. vertical
    • Indicates that the text will be displayed vertically rather than horizontally, such as in some Asian languages.
    • Values are either rl for right to left, or lr for left to right
    • Example 13 - vertical usage
      vertical:rl
      vertical:lr

  2. line
    • Specifies where text appears vertically. If vertical is set, specifies where text appears horizontally.
    • Value can be a line number
      • The line height is the height of the first line of the cue as it appears on the video.
      • If the line number is positive, it is starting from the top of the video. If it is negative, it is starting from the bottom.
      • The line height may not divide perfectly by the video height, so you're likely not to be able to get text right at the bottom using positive numbers, so use negative numbers when working from the bottom of the video.
    • Or value can be a percentage
      • Must be an integer (no decimals) between 0 and 100 inclusive.
      • Must be followed by a percent sign (%).
    • Table 1 - line examples
      nonevertical:rlvertical:lr
      line:0toprightleft
      line:-1bottomleftright
      line:0%toprightleft
      line:100%bottomleftright

  3. position
    • Specifies where the text will appear horizontally. If vertical is set, position specifies where the text will appear vertically.
    • Value is a percentage
      • Must be an integer (no decimals) between 0 and 100 inclusive.
      • Must be followed by a percent sign (%).
    • Table 2 - position examples
      nonevertical:rlvertical:lr
      line:0%lefttoptop
      line:100%rightbottombottom

  4. size
    • Specifies the width of the text area. If vertical is set, size specifies the height of the text area.
    • Value is a percentage
      • Must be an integer (no decimals) between 0 and 100 inclusive.
      • Must be followed by a percent sign (%).
    • Table 3 - size examples
      nonevertical:rlvertical:lr
      size:100%full widthfull heightfull height
      size:50%half widthhalf heighthalf height

  5. align
    • Specifies the alignment of the text. Text is aligned within the space given by the size cue setting if it is set.
    • Values are start, middle, and end.
    • Table 4 - align examples
      nonevertical:rlvertical:lr
      align:startlefttoptop
      align:middlecentred horizontallycentred verticallycentred vertically
      align:endrightbottombottom

Example 14 - Cue setting examples
00:00:5.000 --> 00:00:10.000
00:00:5.000 --> 00:00:10.000 line:63% position:72% align:start
00:00:5.000 --> 00:00:10.000 line:0 position:20% size:60% align:start
00:00:5.000 --> 00:00:10.000 vertical:rt line:-1 align:end

In Example 14, the first example demonstrates no settings. The second example is what you might use to overlay text over a sign or writing. The third example might be used for a title. The last example might be used to an Asian language.

Cue Payload

The payload is where the main information is. In normal usage the payload contains the subtitles to be displayed. If you recall from the section on the HTML5 track element, track has 5 values for kind: subtitles, captions, descriptions, chapters, and metadata.

The payload text may contain newlines but it cannot contain a blank line, which is two newlines in a row. If you wish to get around that use a newline, a space, and then another newline. It should look the same. A blank line signifies the end of a cue. A cue text payload may not contain the string -->.

If you are using the WebVTT file for metadata, there are no further restrictions concerning the text. Otherwise you cannot use the ampersand character (&) or the less-than sign (<). Instead use the escape sequences "&amp;" for ampersand and "&lt;" for less-than. It is also recommended that you use the greater-than escape sequence "&gt;" instead of using character (>), despite the fact that it is allowed. This is to avoid confusion with tags.

In addition to the three escape sequences mentioned above, there are fours others. They are all listed below in Table 5. If the WebVTT file is being used for metadata, escape sequences may not work because it depends on whatever scripts are using the metadata.

Table 5 - Escape sequences
NameCharacterEscape Sequence
Ampersand&&amp;
Less-than<&lt;
Greater-than>&gt;
Left-to-right mark&lrm;
Right-to-left mark&rlm;
Non-breaking space  &nbsp;

Cue Payload Text Tags

There are also a number of tags, such as bold (<b>text</b>), that can be used. However, if you are using the WebVTT file in a track where kind is chapters, then you cannot use tags.

Before I describe the rest of the tags, I'd like to cover one important and special tag. You can use timestamps inside the cue payload. The timestamp must be greater that the cue's start timestamp, greater than any previous timestamp in the cue payload, and less than the cue's end timestamp. The text between the timestamp and the next timestamp or the end of the payload if there is not another timestamp in the payload is the active text. Any text before the active text in the payload is previous text. Any text beyond the active text is the future text. This enables karaoke style captions.

Example 15 - Karaoke style text
1
00:16.500 --> 00:18.500
When the moon <00:17.500>hits your eye

1
00:00:18.500 --> 00:00:20.500
Like a <00:19.000>big-a <00:19.500>pizza <00:20.000>pie

1
00:00:20.500 --> 00:00:21.500
That's <00:00:21.000>amore

In addition to the timestamp tag, there are 7 other tags. Unlike the timestamp tag, these additional tags require opening and closing tags (<b>text</b>).
  1. Class tag (<c></c>)
    • Style the contained text using a CSS class.
    • Exmaple 16 - Class tag
      <c.classname>text</c>
  2. Italics tag (<i></i>)
    • Italicize the contained text.
    • Exmaple 17 - Italics tag
      <i>text</i>
  3. Bold tag (<b></b>)
    • Style the contained text using a CSS class.
    • Exmaple 18 - Bold tag
      <b>text</b>
  4. Underline tag (<u></u>)
    • Style the contained text using a CSS class.
    • Exmaple 19 - Underline tag
      <u>text</u>
  5. Ruby tag (<ruby></ruby>)
    • Used with ruby text tags to display ruby characters (small annotative characters above other characters).
    • Exmaple 20 - Ruby tag
      <ruby>汉 <rt> hàn </rt>字 <rt> zì  </rt></ruby>
  6. Ruby text tag (<rt></rt>)
    • Used with ruby tags to display ruby characters (small annotative characters above other characters).
    • Exmaple 21 - Ruby text tag
      <ruby>汉 <rt> hàn </rt>字 <rt> zì  </rt></ruby>
  7. Voice tag (<v></v>)
    • Similar to class tag, also used to style the contained text using CSS.
    • Exmaple 22 - Voice tag
      <v Bob>text</v>

WebVTT Body

The structure of a WebVTT file is fairly simple. It requires only two things and four optional components.
  1. The first character may be an optional byte order mark (BOM). If you don't know what it is, don't worry about it.
  2. The string WEBVTT is required.
  3. You may optionally add text as a header to the right of WEBVTT. You could use this to add a description of the file. You may use anything except newlines or the string -->. If you add text there must be at least one space after WEBVTT.
  4. A blank line is required. Which is two newlines in a row. Or you may start with a cue with the timing (the line contains the string -->).
  5. You may have zero or more cues.
  6. You may have zero or more blank lines.
Example 16 - Simplest possible WebVTT file
WEBVTT


Example 17 - Very simple WebVTT file
WEBVTT - This file has no cues.


Example 18 - Common WebVTT example
WEBVTT - This file has cues.

14
00:01:14.815 --> 00:01:18.114
- Hmm ?
- Suddenly there was
a terrible roar all around us...

15
00:01:18.171 --> 00:01:20.991
and the sky was full of
what looked like huge bats...

16
00:01:21.058 --> 00:01:23.868
- [ Bats Screeching ]
- all swooping and screeching
and diving around the car.


CSS Styling for WebVTT

To style cues in CSS, 4 new pseudo-element selectors have been added. Although it would seem that only class (<c>) and voice (<v>) tags can be selected, any of the seven tag spans can be selected. Timestamp tags cannot be selected. It also appears that you can add classes to any of the tags just like the class tag, then you can use a class selector in CSS, however it is recommended to only use CSS on class and voice tags.

The following are the new pseudo-element selectors:
  1. ::cue
    • Matches all cues
    • Example 19 - ::cue
      video::cue { background-color:white; color:black; }
      

  2. ::cue(selector)
    • Matches selected cues
    • Example 20 - ::cue(selector)
      video::cue(c.classpink) { text-outline:2px 2px pink; }
      video::cue(v) { font-family:"Times New Roman"; }
      video::cue(v[voice="Bob"]) { color:blue; }
      

  3. :past
    • Matches all text previous to the active text in a cue with timestamp tags.
    • Example 21 - :past
      video::cue(:past) { color:green; }
      

  4. :future
    • Matches all text beyond the active text in a cue with timestamp tags.
    • Example 22 - :future
      video::cue(:future) { color:red; }
      
Not all CSS properties can be used. Only the following can:
  1. color
  2. opacity
  3. visibility
  4. text-decoration
  5. text-outline
  6. text-shadow
  7. all properties corresponding to the background shorthand
  8. all properties corresponding to the outline shorthand
  9. all properties corresponding to the font shorthand (including line-height)
    • Except if :future or :past is used in the selector
  10. white-space
    • Except if :future or :past is used in the selector
  11. properties relating to the transition and animation features
    • Only when using ::cue(selector)

The following table lists WebVTT elements that can be selected. The voice tag includes the attribute voice which corresponds to the name used in the tag (<v Bob>text</v>).

Table 6 - WebVTT CSS
Tag NameElementAttributes
Classc
Italici
Boldb
Underlineu
Rubyruby
Ruby Textrt
Voicevvoice

Example 23 - CSS Example
video::cue { background-color:white; color:black; }
video::cue(c.music) { text-outline:2px 2px pink; }
video::cue(v) { font-family:"Times New Roman"; }
video::cue(v[voice="Bob"]) { color:blue; }
video::cue(v[voice="Narrator"]) { color:grey; }
video::cue(v[voice="Doug"]) { color:red; }
video::cue(:past) { color:green; }
video::cue(:future) { color:teal; }

No comments:

Post a Comment