**************************** * ISO 14496-1 Media Format * **************************** - values use big endian (network) byte order - general terms: integer = signed value - general values: byte/char/octet = 8-bit value; short/word = 16-bit value; long = 32-bit value - fixed point values: value made up of an integer for whole numbers and an unsigned value for the decimal - binary values: base-2 long unsigned values (values from 0 and 1) - octal values: base-8 long unsigned values (values from 0 through to 7) - decimal values: base-10 long unsigned values (values from 0 through to 9) - hexadecimal (hex) values: base-16 long unsigned values (values from 0 to 9 and A to F) - box offsets: values relative to boxes only and are used to skip to the next box - sample chunk/block offsets: values relative to the file's length - UUID: a hexadecimal Universal Unique Identifier that is 128 bits in length FILE INFO Suffixes = ".mp4", ".m4a"; Mac OS Type = "mpg4"; Mac OS Creator = "TVOD"; MIME="video/mp4" and "audio/mp4" Standard single fork binary file that only uses a resource fork on HFS/HFS+ volumes to store mac specific file info, quicktime movie previews and can store a quicktime version of the file's header, but this is only valid if transcoded to the quicktime format as other storage media may not use or support multiple file forks. Unknown boxes can be safely skipped over, most boxes can be in any order and most lowercase long ASCII text strings used for box names/types were pre-defined by Apple and any others are reserved for future use by Apple and the ISO. It is discouraged to use custom boxes and to only use ISO defined ones. Box type strings can be either standard length atom type strings or a 32 byte UUID, UUIDs are appended following the standard type of 'uuid' and if the box offset is equal to one then a 64-bit box offset is appended after the box type string or UUID. Wide boxes used in the 'mdat' box can be used with other box types as needed. The term QUICKTIME denotes an unused atom/box or item from the format that this one was based upon. The terms 3GPP and APPLE denote custom additions to the format. Even though the original ISO specification is static Apple members have added bits from the 3GPP and iTunes versions as extensions such as those in parts 10 and 12. FILE IDENTIFICATION * 8+ bytes file type box = long unsigned offset + long ASCII text string 'ftyp' -> 4 bytes major brand = long ASCII text main type string -> 4 bytes major brand version = long unsigned main type revision value -> 4+ bytes compatible brands = list of long ASCII text used technology strings - types are ISO 14496-1 Base Media = isom ; ISO 14496-12 Base Media = iso2 - types are ISO 14496-1 vers. 1 = mp41 ; ISO 14496-1 vers. 2 = mp42 - types are quicktime movie = 'qt ' ; JVT AVC = avc1 - types are 3G MP4 profile = '3gp' + ASCII value ; 3G Mobile MP4 = mmp4 - types are Apple AAC audio w/ iTunes info = 'M4A ' ; AES encrypted audio = 'M4P ' - types are Apple audio w/ iTunes position = 'M4B ' ; ISO 14496-12 MPEG-7 meta data = 'mp71' - NOTE: All compatible with 'isom', vers. 1 uses no Scene Description Tracks, vers. 2 uses the full part one spec, M4A uses custom ISO 14496-12 info, qt means the format complies with the original Apple spec, 3gp uses sample descriptions in the same style as the original Apple spec. FILE MEDIA DATA Note: if any box grows in excess of 2^32 bytes (> 4.2 GB), the box size can be extended in increments of 64 bits (18.4 EB). By setting the box size to 1 and appending a new 64 bit box size. This is why empty 'wide' boxes may be found on either side of this box header for future expansion of the sample data. By setting the box size to 0, the media data box is open ended and extends to the end of the file. * 8+ bytes media (sample) data box = long unsigned offset + long ASCII text string 'mdat' -> 8 bytes larger file offset place holder box = long unsigned offset set to 8 + long ASCII text string 'wide' OR -> 8 bytes wider mdat box offset = 64-bit unsigned offset - only if mdat standard offset set to 1 -> Sample data = hex dump - Media with multiple tracks have sample data interleaved unless preloaded. UNUSED SPACE OR DATA TO BE DELETED/REUSED WITHIN FILE * 8+ bytes free space (current) box = long unsigned offset + long ASCII text string 'free' * 8+ bytes skip over (older) box = long unsigned offset + long ASCII text string 'skip' * 8+ bytes widen (lengthen) file box = long unsigned offset + long ASCII text string 'wide' EXTERNAL MPEG-7 META DATA ONLY * 8+ bytes optional ISO/IEC 14496-12 presentation meta data box = long unsigned offset + long ASCII text string 'meta' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) * 8+ bytes ISO/IEC 14496-12 handler reference box = long unsigned offset + long ASCII text string 'hdlr' - this box must be toward the start of the meta box -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 4 bytes QUICKTIME type = long ASCII text string (eg. Media Handler = 'mhlr') -> 4 bytes subtype/meta data type = long ASCII text string - types are MPEG-7 XML = 'mp7t' ; MPEG-7 binary XML = 'mp7b' - type is APPLE meta data for iTunes reader = 'mdir' -> 4 bytes QUICKTIME manufacturer reserved = long ASCII text string (eg. Apple = 'appl' or 0) -> 4 bytes QUICKTIME component reserved flags = long hex flags (none = 0) -> 4 bytes QUICKTIME component reserved flags mask = long hex mask (none = 0) -> component type name ASCII string (eg. "Meta Data Handler" - no name = zero length string) -> 1 byte component name string end = byte padding set to zero - note: the quicktime spec uses a Pascal string instead of the above C string * 8+ bytes optional ISO/IEC 14496-12 MPEG-7 XML box = long unsigned offset + long ASCII text string 'xml ' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> MPEG-7 XML meta data = text dump * 8+ bytes optional ISO/IEC 14496-12 MPEG-7 binary XML box = long unsigned offset + long ASCII text string 'bxml' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> MPEG-7 encoded XML meta data = hex dump * 8+ bytes optional ISO/IEC 14496-12 item location box = long unsigned offset + long ASCII text string 'iloc' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 1 nibble size of access offsets = 4 bits one byte multiples - 8-bit offset = 0 ; 32-bit offset = 4 ; 64-bit offset = 8 -> 1 nibble size of data lengths = 4 bits one byte multiples - 8-bit offset = 0 ; 32-bit offset = 4 ; 64-bit offset = 8 -> 1 nibble size of starting offset = 4 bits one byte multiples - 8-bit offset = 0 ; 32-bit offset = 4 ; 64-bit offset = 8 -> 1 nibble reserved = 4 bits set to zero -> 2 bytes number of locations = short unsigned index total -> 2+ bytes item reference = short unsigned id -> 2+ bytes stream data reference = short unsigned index from 'dref' box - if meta data item in same file set to zero -> 1-8+ bytes starting offset = byte - dlong unsigned offset -> 2+ bytes number of access points = short unsigned index total -> 1-8+ bytes access offset = byte - dlong unsigned relative offset (relative to starting offset) -> 1-8+ bytes data length = byte - dlong unsigned length * 8+ bytes optional ISO/IEC 14496-12 primary item box = long unsigned offset + long ASCII text string 'pitm' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 2 bytes main item reference = short unsigned id * 8+ bytes optional ISO/IEC 14496-12 item encryption box = long unsigned offset + long ASCII text string 'ipro' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 2 bytes number of encryption boxes = short unsigned index total * 8+ bytes ISO/IEC 14496-12 encryption scheme info box = long unsigned offset + long ASCII text string 'sinf' - if meta data encrypted to ISO/IEC 14496-12 standards * 8+ bytes ISO/IEC 14496-12 original format box = long unsigned offset + long ASCII text string 'frma' -> 4 bytes description format = long ASCII text string * 8+ bytes optional ISO/IEC 14496-12 IPMP info box = long unsigned offset + long ASCII text string 'imif' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> IPMP descriptors = hex dump from IPMP part of ES Descriptor box * 8+ bytes optional ISO/IEC 14496-12 scheme type box = long unsigned offset + long ASCII text string 'schm' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0 ; contains URI if flags = 0x000001) -> 4 bytes encryption type = long ASCII text string - types are 128-bit AES counter = 'ACM1' ; 128-bit AES FS = 'AFS1' - types are NULL algorithm = 'ENUL' ; 160-bit HMAC-SHA-1 = 'SHM2' - types are RTCP = 'ANUL' ; private scheme = ' ' -> 2 bytes encryption version = short unsigned version -> optional scheme URI string = UTF-8 text string (eg. web site) -> 1 byte optional scheme URI string end = byte padding set to zero * 8+ bytes ISO/IEC 14496-12 scheme data box = long unsigned offset + long ASCII text string 'schi' -> encryption related key = hex dump * 8+ bytes optional ISO/IEC 14496-12 item information box = long unsigned offset + long ASCII text string 'pitm' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 2 bytes main item reference = short unsigned id -> 2 bytes encryption box array value = short unsigned index -> item name or URL string = UTF-8 text string -> 1 byte name or URL c string end = byte value set to zero -> item mime type string = UTF-8 text string -> 1 byte mime type c string end = byte value set to zero -> optional item transfer encoding string = UTF-8 text string -> 1 byte optional transfer encoding c string end = byte value set to zero FILE MEDIA HEADER Note: the header is safer when stored at the beginning of the file or in another file fork as HFS resource type 'moov'; ID any. The advantage of using another file fork is that the header can be lengthened without recalculating the sample offsets or new header must be written at the end of the file. * 8+ bytes movie (presentation) box = long unsigned offset + long ASCII text string 'moov' * 8+ bytes QUICKTIME movie data reference atom = long unsigned offset + long ASCII text string 'mdra' - if this is used no other atoms or boxes should be present at this level * 8+ bytes data reference atom = long unsigned offset + long ASCII text string 'dref' -> 4 bytes reference type name = long ASCII text string - types are file alias = 'alis' ; resource alias = 'rsrc' ; - types are url c string = 'url ' -> 4 bytes reference version/flags = byte hex version (current = 0) + 24-bit hex flags - some flags are external data = 0x000000 ; internal data = 0x000001 -> mac os file alias record structure OR -> mac os file alias record structure plus resource info OR -> url c string = ASCII text string -> 1 byte url c string end = byte value set to zero * 8+ bytes QUICKTIME compressed moov atom = long unsigned offset + long ASCII text string 'cmov' - if this is used no other atoms should be present as this is for an entire compressed movie resource * 8+ bytes data compression atom = long unsigned offset + long ASCII text string 'dcom' -> 4 bytes compression code = long ASCII text string - compression codes are Deflate = 'zlib' ; Apple Compression = 'adec' * 8+ bytes compressed moov data atom = long unsigned offset + long ASCII text string 'cmvd' -> 4 bytes uncompressed size = long unsigned value -> entire compressed movie 'moov' resource = hex dump * 8+ bytes QUICKTIME reference movie record atom = long unsigned offset + long ASCII text string 'rmra' - if this atom is used it must come first within the movie resource box * 8+ bytes reference movie descriptor atom = long unsigned offset + long ASCII text string 'rmda' * 8+ bytes reference movie data reference atom = long unsigned offset + long ASCII text string 'rdrf' -> 4 bytes reference version/flags = byte hex version (current = 0) + 24-bit hex flags - some flags are external data = 0x000000 ; internal data = 0x000001 -> 4 bytes reference type name = long ASCII text string (if internal = 0) - types are file alias = 'alis' ; resource alias = 'rsrc' ; - types are url c string = 'url ' -> 4+ bytes reference data = long unsigned length -> mac os file alias record structure OR -> mac os file alias record structure plus resource info OR -> url c string = ASCII text string -> 1 byte url c string end = byte value set to zero * 8+ bytes optional reference movie quality atom = long unsigned offset + long ASCII text string 'rmqu' -> 4 bytes queue position = long unsigned value from 100 to 0 * 8+ bytes optional reference movie cpu rating atom = long unsigned offset + long ASCII text string 'rmcs' -> 4 bytes reserved flag = byte hex version + 24-bit hex flags (current = 0) -> 2 bytes speed rating = short unsigned value from 500 to 100 * 8+ bytes optional reference movie version check atom = long unsigned offset + long ASCII text string 'rmvc' -> 4 bytes flags = byte hex version + 24-bit hex flags (current = 0) -> 4 bytes gestalt selector = long ASCII text string (eg. quicktime = 'qtim') -> 4 bytes gestalt min value = long hex value (eg. QT 3.02 mac file version = 0x03028000) -> 4 bytes gestalt no value = long value set to zero OR -> 4 bytes gestalt value mask = long hex mask -> 4 bytes gestalt value = long hex value -> 2 bytes gestalt check type = short unsigned value (min value = 0 or mask = 1) * 8+ bytes optional reference movie component check atom = long unsigned offset + long ASCII text string 'rmcd' -> 4 bytes flags = byte hex version + 24-bit hex flags (current = 0) -> 8 bytes component type/subtype = long ASCII text string + long ASCII text string (eg. Timecode Media Handler = 'mhlrtmcd') -> 4 bytes component manufacturer = long ASCII text string (eg. Apple = 'appl' or 0) -> 4 bytes component flags = long hex flags (none = 0) -> 4 bytes component flags mask = long hex mask (none = 0) -> 4 bytes component min version = long hex value (none = 0) * 8+ bytes optional reference movie data rate atom = long unsigned offset + long ASCII text string 'rmdr' -> 4 bytes flags = byte hex version + 24-bit hex flags (current = 0) -> 4 bytes data rate = long integer bit rate value - common analog modem rates are 1400; 2800; 3300; 5600 - common broadband rates are 5600; 11200; 25600; 38400; 51200; 76800; 100000 - common high end broadband rates are T1 = 150000; no limit/LAN = 0x7FFFFFFF * 8+ bytes optional reference movie language atom = long unsigned offset + long ASCII text string 'rmla' -> 4 bytes flags = byte hex version + 24-bit hex flags (current = 0) -> 2 bytes mac language = short unsigned language value (english = 0) * 8+ bytes optional reference movie alternate group atom = long unsigned offset + long ASCII text string 'rmag' (structure was not provided in MoviesFormat.h of the 4.1.2 win32 sdk) -> 4 bytes flags = long value set to zero -> 2 bytes alternate/other = short integer track id value (none = 0) * 8+ bytes optional initial object descriptor box = long unsigned offset + long ASCII text string 'iods' - NOTE: this was added in vers. 2 of spec -> 4 bytes version/flags = 8-bit hex version + 24-bit hex flags -> 1 byte file IOD type tag = 8-bit hex value 0x10 -> 3 bytes extended descriptor type tag string = 3 * 8-bit hex value - types are Start = 0x80 ; End = 0xFE - NOTE: the extended start tags may be left out -> 1 byte descriptor type length = 8-bit unsigned length -> 2 bytes OD ID = 16-bit unsigned value -> 1 byte OD profile level = 8-bit unsigned value -> 1 byte scene profile level = 8-bit unsigned value -> 1 byte audio profile level = 8-bit unsigned value -> 1 byte video profile level = 8-bit unsigned value -> 1 byte graphics profile level = 8-bit unsigned value - NOTE: if level unused then set to 0xFF -> 1 byte ES ID included descriptor type tag = 8-bit hex value 0x0E -> 3 bytes extended descriptor type tag string = 3 * 8-bit hex value - types are Start = 0x80 ; End = 0xFE - NOTE: the extended start tags may be left out -> 1 byte descriptor type length = 8-bit unsigned length -> 4 bytes Track ID = 32-bit unsigned value - NOTE: refers to non-data system tracks * 8+ bytes movie (presentation) header box = long unsigned offset + long ASCII text string 'mvhd' -> 1 byte version = 8-bit unsigned value - if version is 1 then date and duration values are 8 bytes in length -> 3 bytes flags = 24-bit hex flags (current = 0) -> 4 bytes created mac UTC date = long unsigned value in seconds since beginning 1904 to 2040 -> 4 bytes modified mac UTC date = long unsigned value in seconds since beginning 1904 to 2040 OR -> 8 bytes created mac UTC date = 64-bit unsigned value in seconds since beginning 1904 -> 8 bytes modified mac UTC date = 64-bit unsigned value in seconds since beginning 1904 -> 4 bytes time scale = long unsigned time unit per second (default = 600) -> 4 bytes duration = long unsigned time length (in time units) OR -> 8 bytes duration = 64-bit unsigned time length (in time units) -> 4 bytes decimal user playback speed = long fixed point rate (normal = 1.0) -> 2 bytes decimal user volume = short fixed point level (mute = 0.0 ; normal = 1.0 ; QUICKTIME MAX = 3.0) -> 10 bytes reserved = 5 * short values set to zero -> 4 bytes decimal window geometry matrix value A = long fixed point width scale (normal = 1.0) -> 4 bytes decimal window geometry matrix value B = long fixed point width rotate (normal = 0.0) -> 4 bytes decimal window geometry matrix value U = long fixed point width angle (restricted to 0.0) -> 4 bytes decimal window geometry matrix value C = long fixed point height rotate (normal = 0.0) -> 4 bytes decimal window geometry matrix value D = long fixed point height scale (normal = 1.0) -> 4 bytes decimal window geometry matrix value V = long fixed point height angle (restricted to 0.0) -> 4 bytes decimal window geometry matrix value X = long fixed point positon (left = 0.0) -> 4 bytes decimal window geometry matrix value Y = long fixed point positon (top = 0.0) -> 4 bytes decimal window geometry matrix value W = long fixed point divider scale (restricted to 1.0) -> 8 bytes QUICKTIME preview = long unsigned start time + long unsigned time length (in time units) -> 4 bytes QUICKTIME still poster = long unsigned frame time (in time units) -> 8 bytes QUICKTIME selection time = long unsigned start time + long unsigned time length (in time units) -> 4 bytes QUICKTIME current time = long unsigned frame time (in time units) -> 4 bytes next/new track id = long integer value (single track = 2) * 8+ bytes QUICKTIME clipping (mask) atom = long unsigned offset + long ASCII text string 'clip' * 8+ bytes clipping region atom = long unsigned offset + long ASCII text string 'crgn' -> 2 bytes region size = short unsigned box size -> 8 bytes region boundary = long fixed point x value + long fixed point y value -> QuickDraw Region Data = hex dump * 8+ bytes track (element) box = long unsigned offset + long ASCII text string 'trak' * 8+ bytes track (element) header box = long unsigned offset + long ASCII text string 'tkhd' -> 1 byte version = byte unsigned value - if version is 1 then date and duration values are 8 bytes in length -> 3 bytes flags = 24-bit unsigned flags - sum of TrackEnabled = 1 ; TrackInMovie = 2 ; TrackInPreview = 4; TrackInPoster = 8 - MPEG-4 only defines TrackEnabled as being valid -> 4 bytes created mac UTC date = long unsigned value in seconds since beginning 1904 to 2040 -> 4 bytes modified mac UTC date = long unsigned value in seconds since beginning 1904 to 2040 OR -> 8 bytes created mac UTC date = 64-bit unsigned value in seconds since beginning 1904 -> 8 bytes modified mac UTC date = 64-bit unsigned value in seconds since beginning 1904 -> 4 bytes track id = long integer value (first track = 1) -> 8 bytes reserved = 2 * long value set to zero -> 4 bytes duration = long unsigned time length (in time units) OR -> 8 bytes duration = 64-bit unsigned time length (in time units) - if duration is undefined set above bits to all ones -> 4 bytes reserved = long value set to zero -> 2 bytes video layer = short integer positon (middle = 0 ; negatives are in front) -> 2 bytes QUICKTIME alternate/other = short integer track id (none = 0) -> 2 bytes track audio volume = short fixed point level (mute = 0x0001 ; 100% = 1.0 ; QUICKTIME 200% max = 2.0) -> 2 bytes reserved = short value set to zero -> 4 bytes decimal video geometry matrix value A = long fixed point width scale (normal = 1.0) -> 4 bytes decimal video geometry matrix value B = long fixed point width rotate (normal = 0.0) -> 4 bytes decimal video geometry matrix value U = long fixed point width angle (restricted to 0.0) -> 4 bytes decimal video geometry matrix value C = long fixed point height rotate (normal = 0.0) -> 4 bytes decimal video geometry matrix value D = long fixed point height scale (normal = 1.0) -> 4 bytes decimal video geometry matrix value V = long fixed point height angle (restricted to 0.0) -> 4 bytes decimal video geometry matrix value X = long fixed point positon (left = 0.0) -> 4 bytes decimal video geometry matrix value Y = long fixed point positon (top = 0.0) -> 4 bytes decimal video geometry matrix value W = long fixed point divider scale (restricted to 1.0) -> 8 bytes decimal video frame size = long fixed point width + long fixed point height * 8+ bytes QUICKTIME clipping (mask) atom = long unsigned offset + long ASCII text string 'clip' - see moov clipping atom above * 8+ bytes QUICKTIME matte (video overlay) atom = long unsigned offset + long ASCII text string 'matt' * 8+ bytes compressed matte atom = long unsigned offset + long ASCII text string 'kmat' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> Matte Image Description Structure (similar to Media Sample Description Table) -> Matte Data = hex dump * 8+ bytes optional edits (# of external tracks) box = long unsigned offset + long ASCII text string 'edts' - if tracks are of different start times this atom is needed to maintain media sync. * 8+ bytes optional edit list box = long unsigned offset + long ASCII text string 'elst' -> 1 byte version = byte unsigned value - if version is 1 then duration values are 8 bytes in length -> 3 bytes flags = 24-bit hex flags (current = 0) -> 4 bytes number of edits = long unsigned total (default = 1) -> 8 bytes edit time = long unsigned time length + long unsigned start time (in time units) OR -> 16 bytes edit time = 64-bit unsigned time length + 64-bit unsigned start time (in time units) - if start time is -1, then that time length is edited out -> 4 bytes decimal playback speed = long fixed point rate (normal = 1.0) * 8+ bytes QUICKTIME preload atom = long unsigned offset + long ASCII text string 'load' -> 8 bytes preload time = long unsigned start time + long unsigned time length (in time units) -> 4 bytes flags = long integer value - flags are PreloadAlways = 1 or TrackEnabledPreload = 2 -> 4 bytes default hints flags = long hex data play options - flags are KeepInBuffer = 0x00000004 ; HighQuality = 0x00000100 ; - flags are SingleFieldPlayback = 0x00100000 - flags are DeinterlaceFields = 0x04000000 * 8+ bytes optional track references box = long unsigned offset + long ASCII text string 'tref' * 8+ bytes type of reference box = long unsigned offset + long ASCII text string -> vers. 1 box type is stream hint = 'hint' -> vers. 2 box types are other dependency = 'dpnd' ; IPI declarations = 'ipir' -> vers. 2 box types are elementary stream = 'mpod' ; -> vers. 2 box types are synchronization (video/audio) = 'sync -> QUICKTIME atom types are timecode = 'tmcd'; chapterlist = 'chap' -> QUICKTIME atom types are transcript (text) = 'scpt' -> QUICKTIME atom types are non-primary source (used in other track) = 'ssrc' -> 4+ bytes Track IDs = long integer track numbers (Disabled Track ID = 0) * 8+ bytes QUICKTIME non-primary source input map atom = long unsigned offset + long ASCII text string 'imap' * 8+ bytes input atom = long unsigned offset + long ASCII text string 0x0000 + 'in' -> 4 bytes atom ID = long integer atom reference (first ID = 1) -> 2 bytes reserved = short value set to zero -> 2 bytes number of internal atoms = short unsigned count -> 4 bytes reserved = long value set to zero * 8+ bytes input type atom = 32-bit integer unsigned + long ASCII text string 0x0000 + 'ty' -> 4 bytes type modifier name = long integer value -> name values are matrix = 1 ; clip = 2 ; -> name values are volume = 3; audio balance = 4 -> name values are graphics mode = 5; matrix object = 6 -> name values are graphics mode object = 7; image type = 'vide' * 8+ bytes object ID atom = long unsigned offset + long ASCII text string 'obid' -> 4 bytes object ID = long integer value * 8+ bytes media (stream) box = long unsigned offset + long ASCII text string 'mdia' * 8+ bytes media (stream) header box = long unsigned offset + long ASCII text string 'mdhd' -> 1 byte version = byte unsigned value - if version is 1 then date and duration values are 8 bytes in length -> 3 bytes flags = 24-bit unsigned flags (current = 0) -> 4 bytes created mac UTC date = long unsigned value in seconds since beginning 1904 to 2040 -> 4 bytes modified mac UTC date = long unsigned value in seconds since beginning 1904 to 2040 OR -> 8 bytes created mac UTC date = 64-bit unsigned value in seconds since beginning 1904 -> 8 bytes modified mac UTC date = 64-bit unsigned value in seconds since beginning 1904 -> 4 bytes time scale = long unsigned media time unit (video = fps rate ; audio = sample per sec. rate) -> 4 bytes duration = long unsigned media time length (in media time units) OR -> 8 bytes duration = 64-bit unsigned time length (in time units) -> 1/8 byte ISO language padding = 1-bit value set to 0 -> 1 7/8 bytes content language = 3 * 5-bits ISO 639-2 language code less 0x60 - example code for english = 0x15C7 -> 2 bytes QUICKTIME quality = short integer playback quality value (normal = 0) * 8+ bytes handler reference box = long unsigned offset + long ASCII text string 'hdlr' - this box must be toward the start of the media box -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 4 bytes QUICKTIME type = long ASCII text string (eg. Media Handler = 'mhlr') -> 4 bytes subtype/media type = long ASCII text string - types are Visual Media = 'vide' ; Audio Media = 'soun' ; Hint = "hint' - types are Object Descriptor = 'odsm' ; Clock Reference = 'crsm' - types are Scene Description = 'sdsm' ; MPEG-7 Stream = 'm7sm' - types are Object Content Info = 'ocsm' ; IPMP = 'ipsm' : MPEG-J = 'mjsm' -> 4 bytes QUICKTIME manufacturer reserved = long ASCII text string (eg. Apple = 'appl' or 0) -> 4 bytes QUICKTIME component reserved flags = long hex flags (none = 0) -> 4 bytes QUICKTIME component reserved flags mask = long hex mask (none = 0) -> component type name ASCII string (eg. "Media Handler" - no name = zero length string) -> 1 byte component name string end = byte padding set to zero - note: the quicktime spec uses a Pascal string instead of the above C string * 8+ bytes media (stream) information box = long unsigned offset + long ASCII text string 'minf' * 8+ bytes visual media (stream) info header box = long unsigned offset + long ASCII text string 'vmhd' -> 4 bytes version/flags = byte hex version + 24-bit hex flags - version = 0 ; flags = 0x000001 for QUICKTIME or zero MPEG-4 -> 2 bytes QuickDraw graphic mode = short hex type - mode types are copy = 0x0000 ; dither copy = 0x0040 ; straight alpha = 0x0100 - mode types are composition dither copy = 0x0103 ; blend = 0x0020 - mode premultipled types are white alpha = 0x101 ; black alpha = 0x102 - mode color types are transparent = 0x0024; straight alpha blend = 0x0104 - NOTE: MPEG-4 only uses copy mode and quicktime uses dither copy by default -> 6 bytes graphic mode color = 3 * short unsigned QuickDraw RGB color values OR * 8+ bytes sound media (stream) info header box = long unsigned offset + long ASCII text string 'smhd' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 2 bytes audio balance = short fixed point value - balnce scale is left = negatives ; normal = 0.0 ; right = positives -> 2 bytes reserved = short value set to zero OR * 8+ bytes hint stream (stream) info header box = long unsigned offset + long ASCII text string 'hint' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 2 bytes maximum packet delivery unit = short unsigned value -> 2 bytes average packet delivery unit = short unsigned value -> 4 bytes maximum bit rate = long unsigned value -> 4 bytes average bit rate = long unsigned value -> 4 bytes reserved = long value set to zero OR * 8+ bytes mpeg-4 media (stream) header box = long unsigned offset + long ASCII text string 'nmhd' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) * 8+ bytes QUICKTIME handler reference atom = long unsigned offset + long ASCII text string 'hdlr' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 8 bytes type/subtype = long ASCII text string + long ASCII text string (eg. Alias Data Handler = 'dhlralis' ; URL Data Handler = 'dhlrurl ') -> 4 bytes manufacturer reserved = long ASCII text string (eg. Apple = 'appl' or 0) -> 4 bytes component reserved flags = long hex flags (none = 0) -> 4 bytes component reserved flags mask = long hex mask (none = 0) -> 1 byte component name string length = byte unsigned length (no name = zero length string) -> component type name ASCII string (eg. "Data Handler") * 8+ bytes data (locator) information box = long unsigned offset + long ASCII text string 'dinf' * 8+ bytes data reference box = long unsigned offset + long ASCII text string 'dref' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 4 bytes number of references = long unsigned total (minimum = 1) * 8+ bytes reference type box = long unsigned offset + long ASCII text string - box types are url c string = 'url ' ; urn c strings = 'urn ' - QUICKTIME atom types are file alias = 'alis' ; resource alias = 'rsrc' -> 4 bytes version/flags = byte hex version (current = 0) + 24-bit hex flags - some flags are external data = 0x000000 ; internal data = 0x000001 -> url c string = ASCII text string points to external data -> 1 byte url c string end = byte value set to zero OR -> urn c string = ASCII text string points to external data -> 1 byte urn c string end = byte value set to zero -> url c string = ASCII text string points to external data -> 1 byte url c string end = byte value set to zero OR -> QUICKTIME mac os file alias record structure points to external data OR -> QUICKTIME mac os file alias record structure plus resource info points to external data OR * 8+ bytes Data URL box = long unsigned offset + long ASCII text string 'url ' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> url c string = ASCII text string points to external data -> 1 byte url c string end = byte value set to zero OR * 8+ bytes Data URN box = long unsigned offset + long ASCII text string 'urn ' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> urn c string = ASCII text string points to external data -> 1 byte urn c string end = byte value set to zero -> url c string = ASCII text string points to external data -> 1 byte url c string end = byte value set to zero * 8+ bytes sample (framing info) table box = long unsigned offset + long ASCII text string 'stbl' * 8+ bytes sample (frame encoding) description box = long unsigned offset + long ASCII text string 'stsd' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 4 bytes number of descriptions = long unsigned total (default = 1) -> 4 bytes description length = long unsigned length -> 4 bytes description visual format = long ASCII text string 'mp4v' - if encoded to ISO/IEC 14496-10 or 3GPP AVC standards then use: -> 4 bytes description visual format = long ASCII text string 'avc1' - if encrypted to ISO/IEC 14496-12 or 3GPP standards then use: -> 4 bytes description visual format = long ASCII text string 'encv' - if encoded to 3GPP H.263v1 standards then use: -> 4 bytes description visual format = long ASCII text string 's263' -> 6 bytes reserved = 48-bit value set to zero -> 2 bytes data reference index = short unsigned index from 'dref' box - there are other sample descriptions available in the Apple QT format dev docs -> 2 bytes QUICKTIME video encoding version = short hex version - default = 0 ; audio data size before decompression = 1 -> 2 bytes QUICKTIME video encoding revision level = byte hex version - default = 0 ; video can revise this value -> 4 bytes QUICKTIME video encoding vendor = long ASCII text string - default = 0 -> 4 bytes QUICKTIME video temporal quality = long unsigned value (0 to 1024) -> 4 bytes QUICKTIME video spatial quality = long unsigned value (0 to 1024) - some quality values are lossless = 1024 ; maximum = 1023 ; high = 768 - some quality values are normal = 512 ; low = 256 ; minimum = 0 -> 4 bytes video frame pixel size = short unsigned width + short unsigned height -> 8 bytes video resolution = long fixed point horizontal + long fixed point vertical - defaults to 72.0 dpi -> 4 bytes QUICKTIME video data size = long value set to zero -> 2 bytes video frame count = short unsigned total (set to 1) -> 1 byte video encoding name string length = byte unsigned length -> 31 bytes video encoder name string -> NOTE: if video encoder name string < 31 chars then pad with zeros -> 2 bytes video pixel depth = short unsigned bit depth - colors are 1 (Monochrome), 2 (4), 4 (16), 8 (256) - colors are 16 (1000s), 24 (Ms), 32 (Ms+A) - grays are 33 (B/W), 34 (4), 36 (16), 40(256) -> 2 bytes QUICKTIME video color table id = short integer value (no table = -1) -> optional QUICKTIME color table data if above set to 0 (see color table atom below for layout) OR -> 4 bytes description length = long unsigned length -> 4 bytes description audio format = long ASCII text string 'mp4a' - if encrypted to ISO/IEC 14496-12 or 3GPP standards then use: -> 4 bytes description audio format = long ASCII text string 'enca' - if encoded to 3GPP GSM 6.10 AMR narrowband standards then use: -> 4 bytes description audio format = long ASCII text string 'samr' - if encoded to 3GPP GSM 6.10 AMR wideband standards then use: -> 4 bytes description audio format = long ASCII text string 'sawb' -> 6 bytes reserved = 48-bit value set to zero -> 2 bytes data reference index = short unsigned index from 'dref' box -> 2 bytes QUICKTIME audio encoding version = short hex version - default = 0 ; audio data size before decompression = 1 -> 2 bytes QUICKTIME audio encoding revision level = byte hex version - default = 0 ; video can revise this value -> 4 bytes QUICKTIME audio encoding vendor = long ASCII text string - default = 0 -> 2 bytes audio channels = short unsigned count (mono = 1 ; stereo = 2) -> 2 bytes audio sample size = short unsigned value (8 or 16) -> 2 bytes QUICKTIME audio compression id = short integer value - default = 0 -> 2 bytes QUICKTIME audio packet size = short value set to zero -> 4 bytes audio sample rate = long unsigned fixed point rate OR -> 4 bytes description length = long unsigned length -> 4 bytes description system format = long ASCII text string 'mp4s' - if encrypted to ISO/IEC 14496-12 standards then use: -> 4 bytes description system format = long ASCII text string 'encs' -> 6 bytes reserved = 48-bit value set to zero -> 2 bytes data reference index = short unsigned index from 'dref' box * 8+ bytes ISO/IEC 14496-12/3GPP encryption scheme info box = long unsigned offset + long ASCII text string 'sinf' - if stream encrypted to ISO/IEC 14496-12 standards * 8+ bytes ISO/IEC 14496-12/3GPP/QUICKTIME original format box = long unsigned offset + long ASCII text string 'frma' -> 4 bytes description format = long ASCII text string - formats are MPEG-4 visual = 'mp4v' ; MPEG-4 AVC = 'avc1' - formats are MPEG-4 audio = 'mp4a' ; MPEG-4 system = 'mp4s' - 3GPP formats are H.253 = 's263' ; AMR narrow = 'samr' - 3GPP format is AMR wide = 'sawb' * 8+ bytes optional ISO/IEC 14496-12 IPMP info box = long unsigned offset + long ASCII text string 'imif' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> IPMP descriptors = hex dump from IPMP part of ES Descriptor box * 8+ bytes optional ISO/IEC 14496-12/3GPP scheme type box = long unsigned offset + long ASCII text string 'schm' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0 ; contains URI if flags = 0x000001) -> 4 bytes encryption type = long ASCII text string - types are 128-bit AES counter = 'ACM1' ; 128-bit AES FS = 'AFS1' - types are NULL algorithm = 'ENUL' ; 160-bit HMAC-SHA-1 = 'SHM2' - types are RTCP = 'ANUL' ; private scheme = ' ' -> 2 bytes encryption version = short unsigned version -> optional scheme URI string = UTF-8 text string (eg. web site) -> 1 byte optional scheme URI string end = byte padding set to zero * 8+ bytes ISO/IEC 14496-12/3GPP scheme data box = long unsigned offset + long ASCII text string 'schi' -> encryption related key = hex dump * 8+ bytes 3GPP H.263v1 decode config box = long unsigned offset + long ASCII text string 'd263' -> 4 bytes encoder vendor = long ASCII text string -> 1 byte encoder version = 8-bit unsigned revision -> 1 byte H.263 level = 8-bit unsigned stream level -> 1 byte H.263 profile = 8-bit unsigned stream profile * 8+ bytes optional 3GPP H.263v1 bit rate box = long unsigned offset + long ASCII text string 'bitr' -> 4 bytes average bit rate = 32-bit unsigned value -> 4 bytes maximum bit rate = 32-bit unsigned value * 8+ bytes 3GPP GSM 6.10 AMR decode config box = long unsigned offset + long ASCII text string 'damr' -> 4 bytes encoder vendor = long ASCII text string -> 1 byte encoder version = 8-bit unsigned revision -> 2 byte packet modes = 16-bit unsigned bit mode index -> 1 byte number of packet mode changes = 8-bit unsigned value -> 1 byte samples per packet = 8-bit unsigned value * 8+ bytes ISO/IEC 14496-10 or 3GPP AVC decode config box = long unsigned offset + long ASCII text string 'avcC' -> 1 byte version = 8-bit hex version (current = 1) -> 1 byte H.264 profile = 8-bit unsigned stream profile -> 1 byte H.264 compatible profiles = 8-bit hex flags -> 1 byte H.264 level = 8-bit unsigned stream level -> 1 1/2 nibble reserved = 6-bit unsigned value set to 63 -> 1/2 nibble NAL length = 2-bit length byte size type - 1 byte = 0 ; 2 bytes = 1 ; 4 bytes = 3 -> 1 byte number of SPS = 8-bit unsigned total -> 2+ bytes SPS length = short unsigned length -> + SPS NAL unit = hexdump -> 1 byte number of PPS = 8-bit unsigned total -> 2+ bytes PPS length = short unsigned length -> + PPS NAL unit = hexdump * 8+ bytes vers. 2 ES Descriptor box = long unsigned offset + long ASCII text string 'esds' - if encoded to ISO/IEC 14496-10 AVC standards then optionally use: = long unsigned offset + long ASCII text string 'm4ds' -> 4 bytes version/flags = 8-bit hex version + 24-bit hex flags (current = 0) -> 1 byte ES descriptor type tag = 8-bit hex value 0x03 -> 3 bytes extended descriptor type tag string = 3 * 8-bit hex value - types are Start = 0x80 ; End = 0xFE - NOTE: the extended start tags may be left out -> 1 byte descriptor type length = 8-bit unsigned length -> 2 bytes ES ID = 16-bit unsigned value -> 1 byte stream priority = 8-bit unsigned value - Defaults to 16 and ranges from 0 through to 31 -> 1 byte decoder config descriptor type tag = 8-bit hex value 0x04 -> 3 bytes extended descriptor type tag string = 3 * 8-bit hex value - types are Start = 0x80 ; End = 0xFE - NOTE: the extended start tags may be left out -> 1 byte descriptor type length = 8-bit unsigned length -> 1 byte object type ID = 8-bit unsigned value - type IDs are system v1 = 1 ; system v2 = 2 - type IDs are MPEG-4 video = 32 ; MPEG-4 AVC SPS = 33 - type IDs are MPEG-4 AVC PPS = 34 ; MPEG-4 audio = 64 - type IDs are MPEG-2 simple video = 96 - type IDs are MPEG-2 main video = 97 - type IDs are MPEG-2 SNR video = 98 - type IDs are MPEG-2 spatial video = 99 - type IDs are MPEG-2 high video = 100 - type IDs are MPEG-2 4:2:2 video = 101 - type IDs are MPEG-4 ADTS main = 102 - type IDs are MPEG-4 ADTS Low Complexity = 103 - type IDs are MPEG-4 ADTS Scalable Sampling Rate = 104 - type IDs are MPEG-2 ADTS = 105 ; MPEG-1 video = 106 - type IDs are MPEG-1 ADTS = 107 ; JPEG video = 108 - type IDs are private audio = 192 ; private video = 208 - type IDs are 16-bit PCM LE audio = 224 ; vorbis audio = 225 - type IDs are dolby v3 (AC3) audio = 226 ; alaw audio = 227 - type IDs are mulaw audio = 228 ; G723 ADPCM audio = 229 - type IDs are 16-bit PCM Big Endian audio = 230 - type IDs are Y'CbCr 4:2:0 (YV12) video = 240 ; H264 video = 241 - type IDs are H263 video = 242 ; H261 video = 243 -> 6 bits stream type = 3/4 byte hex value - type IDs are object descript. = 1 ; clock ref. = 2 - type IDs are scene descript. = 4 ; visual = 4 - type IDs are audio = 5 ; MPEG-7 = 6 ; IPMP = 7 - type IDs are OCI = 8 ; MPEG Java = 9 - type IDs are user private = 32 -> 1 bit upstream flag = 1/8 byte hex value -> 1 bit reserved flag = 1/8 byte hex value set to 1 -> 3 bytes buffer size = 24-bit unsigned value -> 4 bytes maximum bit rate = 32-bit unsigned value -> 4 bytes average bit rate = 32-bit unsigned value -> 1 byte decoder specific descriptor type tag = 8-bit hex value 0x05 -> 3 bytes extended descriptor type tag string = 3 * 8-bit hex value - types are Start = 0x80 ; End = 0xFE - NOTE: the extended start tags may be left out -> 1 byte descriptor type length = 8-bit unsigned length -> ES header start codes = hex dump -> 1 byte SL config descriptor type tag = 8-bit hex value 0x06 -> 3 bytes extended descriptor type tag string = 3 * 8-bit hex value - types are Start = 0x80 ; End = 0xFE - NOTE: the extended start tags may be left out -> 1 byte descriptor type length = 8-bit unsigned length -> 1 byte SL value = 8-bit hex value set to 0x02 * 8+ bytes QUICKTIME video gamma atom = long unsigned offset + long ASCII text string 'gama' -> 4 bytes decimal level = long fixed point level * 8+ bytes QUICKTIME video field order atom = long unsigned offset + long ASCII text string 'fiel' -> 2 bytes field count/order = byte integer total + byte integer order * 8+ bytes QUICKTIME video m-jpeg quantize table atom = long unsigned offset + long ASCII text string 'mjqt' -> quantization table = hex dump * 8+ bytes QUICKTIME video m-jpeg huffman table atom = long unsigned offset + long ASCII text string 'mjht' -> huffman table = hex dump * 8+ bytes time to sample (frame timing) box = long unsigned offset + long ASCII text string 'stts' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 4 bytes number of times = long unsigned total -> 8+ bytes time per frame = long unsigned frame count + long unsigned duration - multiple durations means variable framing rate - single duration means fixed framing rate - calculate framing (fps): media units / (average) duration * 8+ bytes optional sync sample (key/intra frame) box = long unsigned offset + long ASCII text string 'stss' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 4 bytes number of key frames = long unsigned total -> 4+ bytes key/intra frame location = long unsigned framing time - key/intra frame location according to sample/framing time * 8+ bytes sample/framing to chunk/block box = long unsigned offset + long ASCII text string 'stsc' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 4 bytes number of blocks = long unsigned total -> 8+ bytes frames per block = long unsigned first/next block + long unsigned # of frames -> 4+ bytes samples description id = long unsigned description number * 8+ bytes sample (block byte) size box = long unsigned offset + long ASCII text string 'stsz' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 4 bytes block byte size for all = 32-bit integer byte value (different sizes = 0) -> 4 bytes number of block sizes = long unsigned total -> 4+ bytes block byte sizes = long unsigned byte values * 8+ bytes chunk/block offset box = long unsigned offset + long ASCII text string 'stco' -> 4 bytes number of offsets = long unsigned total -> 4+ bytes block offsets = long unsigned byte values * 8+ bytes larger chunk/block offset box = long unsigned offset + long ASCII text string 'co64' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 4 bytes number of offsets = long unsigned total -> 8+ bytes larger block offsets = 64-bit unsigned byte values * 8+ bytes optional user data (any custom info) atom = long unsigned offset + long ASCII text string 'udta' (copyright and MPEG-7 meta data related to element tracks) * 8+ bytes optional copyright notice box = long unsigned offset + long ASCII text string 'cprt' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 1/8 byte ISO language padding = 1-bit value set to 0 -> 1 7/8 bytes content language = 3 * 5-bits ISO 639-2 language code less 0x60 - example code for english = 0x15C7 -> annotation string = ASCII text string -> 1 byte annotation c string end = byte value set to zero * 8+ bytes optional ISO/IEC 14496-12 element meta data box = long unsigned offset + long ASCII text string 'meta' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) * 8+ bytes ISO/IEC 14496-12 handler reference box = long unsigned offset + long ASCII text string 'hdlr' - this box must be toward the start of the meta box -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 4 bytes QUICKTIME type = long ASCII text string (eg. Media Handler = 'mhlr') -> 4 bytes subtype/meta data type = long ASCII text string - types are MPEG-7 XML = 'mp7t' ; MPEG-7 binary XML = 'mp7b' - type is APPLE meta data iTunes reader = 'mdir' -> 4 bytes QUICKTIME manufacturer reserved = long ASCII text string (eg. Apple = 'appl' or 0) -> 4 bytes QUICKTIME component reserved flags = long hex flags (none = 0) -> 4 bytes QUICKTIME component reserved flags mask = long hex mask (none = 0) -> component type name ASCII string (eg. "Meta Data Handler" - no name = zero length string) -> 1 byte component name string end = byte padding set to zero - note: the quicktime spec uses a Pascal string instead of the above C string * 8+ bytes optional ISO/IEC 14496-12 MPEG-7 XML box = long unsigned offset + long ASCII text string 'xml ' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> MPEG-7 XML meta data = text dump * 8+ bytes optional ISO/IEC 14496-12 MPEG-7 binary XML box = long unsigned offset + long ASCII text string 'bxml' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> MPEG-7 encoded XML meta data = hex dump * 8+ bytes optional ISO/IEC 14496-12 item location box = long unsigned offset + long ASCII text string 'iloc' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 1 nibble size of access offsets = 4 bits one byte multiples - 8-bit offset = 0 ; 32-bit offset = 4 ; 64-bit offset = 8 -> 1 nibble size of data lengths = 4 bits one byte multiples - 8-bit offset = 0 ; 32-bit offset = 4 ; 64-bit offset = 8 -> 1 nibble size of starting offset = 4 bits one byte multiples - 8-bit offset = 0 ; 32-bit offset = 4 ; 64-bit offset = 8 -> 1 nibble reserved = 4 bits set to zero -> 2 bytes number of locations = short unsigned index total -> 2+ bytes item reference = short unsigned id -> 2+ bytes stream data reference = short unsigned index from 'dref' box - if meta data item in same file set to zero -> 1-8+ bytes starting offset = byte - dlong unsigned offset -> 2+ bytes number of access points = short unsigned index total -> 1-8+ bytes access offset = byte - dlong unsigned relative offset (relative to starting offset) -> 1-8+ bytes data length = byte - dlong unsigned length * 8+ bytes optional ISO/IEC 14496-12 primary item box = long unsigned offset + long ASCII text string 'pitm' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 2 bytes main item reference = short unsigned id * 8+ bytes optional ISO/IEC 14496-12 item encryption box = long unsigned offset + long ASCII text string 'ipro' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 2 bytes number of encryption boxes = short unsigned index total * 8+ bytes ISO/IEC 14496-12 encryption scheme info box = long unsigned offset + long ASCII text string 'sinf' - if meta data encrypted to ISO/IEC 14496-12 standards * 8+ bytes ISO/IEC 14496-12 original format box = long unsigned offset + long ASCII text string 'frma' -> 4 bytes description format = long ASCII text string * 8+ bytes optional ISO/IEC 14496-12 IPMP info box = long unsigned offset + long ASCII text string 'imif' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> IPMP descriptors = hex dump from IPMP part of ES Descriptor box * 8+ bytes optional ISO/IEC 14496-12 scheme type box = long unsigned offset + long ASCII text string 'schm' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0 ; contains URI if flags = 0x000001) -> 4 bytes encryption type = long ASCII text string - types are 128-bit AES counter = 'ACM1' ; 128-bit AES FS = 'AFS1' - types are NULL algorithm = 'ENUL' ; 160-bit HMAC-SHA-1 = 'SHM2' - types are RTCP = 'ANUL' ; private scheme = ' ' -> 2 bytes encryption version = short unsigned version -> optional scheme URI string = UTF-8 text string (eg. web site) -> 1 byte optional scheme URI string end = byte padding set to zero * 8+ bytes ISO/IEC 14496-12 scheme data box = long unsigned offset + long ASCII text string 'schi' -> encryption related key = hex dump * 8+ bytes optional ISO/IEC 14496-12 item information box = long unsigned offset + long ASCII text string 'pitm' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 2 bytes main item reference = short unsigned id -> 2 bytes encryption box array value = short unsigned index -> item name or URL string = UTF-8 text string -> 1 byte name or URL c string end = byte value set to zero -> item mime type string = UTF-8 text string -> 1 byte mime type c string end = byte value set to zero -> optional item transfer encoding string = UTF-8 text string -> 1 byte optional transfer encoding c string end = byte value set to zero * 8+ bytes optional user data (any custom info) box = long unsigned offset + long ASCII text string 'udta' * 8+ bytes optional copyright notice box = long unsigned offset + long ASCII text string 'cprt' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 1/8 byte ISO language padding = 1-bit value set to 0 -> 1 7/8 bytes content language = 3 * 5-bits ISO 639-2 language code less 0x60 - example code for english = 0x15C7 -> annotation string = UTF text string -> 1 byte annotation c string end = byte value set to zero * 8+ bytes optional 3GPP notice box = long unsigned offset + long ASCII text string - box types are title = 'titl'; author = 'auth'; description = 'dscp' - box types are performers = 'perf'; genre = 'gnre' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 1/8 byte ISO language padding = 1-bit value set to 0 -> 1 7/8 bytes content language = 3 * 5-bits ISO 639-2 language code less 0x60 - example code for english = 0x15C7 -> annotation string = UTF text string -> 1 byte annotation c string end = byte value set to zero * 8+ bytes optional ISO/IEC 14496-12 presentation meta data box = long unsigned offset + long ASCII text string 'meta' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) * 8+ bytes ISO/IEC 14496-12 handler reference box = long unsigned offset + long ASCII text string 'hdlr' - this box must be toward the start of the meta box -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 4 bytes QUICKTIME type = long ASCII text string (eg. Media Handler = 'mhlr') -> 4 bytes subtype/meta data type = long ASCII text string - types are MPEG-7 XML = 'mp7t' ; MPEG-7 binary XML = 'mp7b' - type is APPLE meta data for iTunes reader = 'mdir' -> 4 bytes QUICKTIME manufacturer reserved = long ASCII text string (eg. Apple = 'appl' or 0) -> 4 bytes QUICKTIME component reserved flags = long hex flags (none = 0) -> 4 bytes QUICKTIME component reserved flags mask = long hex mask (none = 0) -> component type name ASCII string (eg. "Meta Data Handler" - no name = zero length string) -> 1 byte component name string end = byte padding set to zero - note: the quicktime spec uses a Pascal string instead of the above C string * 8+ bytes optional ISO/IEC 14496-12 MPEG-7 XML box = long unsigned offset + long ASCII text string 'xml ' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> MPEG-7 XML meta data = text dump * 8+ bytes optional ISO/IEC 14496-12 MPEG-7 binary XML box = long unsigned offset + long ASCII text string 'bxml' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> MPEG-7 encoded XML meta data = hex dump * 8+ bytes optional ISO/IEC 14496-12 item location box = long unsigned offset + long ASCII text string 'iloc' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 1 nibble size of access offsets = 4 bits one byte multiples - 8-bit offset = 0 ; 32-bit offset = 4 ; 64-bit offset = 8 -> 1 nibble size of data lengths = 4 bits one byte multiples - 8-bit offset = 0 ; 32-bit offset = 4 ; 64-bit offset = 8 -> 1 nibble size of starting offset = 4 bits one byte multiples - 8-bit offset = 0 ; 32-bit offset = 4 ; 64-bit offset = 8 -> 1 nibble reserved = 4 bits set to zero -> 2 bytes number of locations = short unsigned index total -> 2+ bytes item reference = short unsigned id -> 2+ bytes stream data reference = short unsigned index from 'dref' box - if meta data item in same file set to zero -> 1-8+ bytes starting offset = byte - dlong unsigned offset -> 2+ bytes number of access points = short unsigned index total -> 1-8+ bytes access offset = byte - dlong unsigned relative offset (relative to starting offset) -> 1-8+ bytes data length = byte - dlong unsigned length * 8+ bytes optional ISO/IEC 14496-12 primary item box = long unsigned offset + long ASCII text string 'pitm' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 2 bytes main item reference = short unsigned id * 8+ bytes optional ISO/IEC 14496-12 item encryption box = long unsigned offset + long ASCII text string 'ipro' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 2 bytes number of encryption boxes = short unsigned index total * 8+ bytes ISO/IEC 14496-12 encryption scheme info box = long unsigned offset + long ASCII text string 'sinf' - if meta data encrypted to ISO/IEC 14496-12 standards * 8+ bytes ISO/IEC 14496-12 original format box = long unsigned offset + long ASCII text string 'frma' -> 4 bytes description format = long ASCII text string * 8+ bytes optional ISO/IEC 14496-12 IPMP info box = long unsigned offset + long ASCII text string 'imif' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> IPMP descriptors = hex dump from IPMP part of ES Descriptor box * 8+ bytes optional ISO/IEC 14496-12 scheme type box = long unsigned offset + long ASCII text string 'schm' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0 ; contains URI if flags = 0x000001) -> 4 bytes encryption type = long ASCII text string - types are 128-bit AES counter = 'ACM1' ; 128-bit AES FS = 'AFS1' - types are NULL algorithm = 'ENUL' ; 160-bit HMAC-SHA-1 = 'SHM2' - types are RTCP = 'ANUL' ; private scheme = ' ' -> 2 bytes encryption version = short unsigned version -> optional scheme URI string = UTF-8 text string (eg. web site) -> 1 byte optional scheme URI string end = byte padding set to zero * 8+ bytes ISO/IEC 14496-12 scheme data box = long unsigned offset + long ASCII text string 'schi' -> encryption related key = hex dump * 8+ bytes optional ISO/IEC 14496-12 item information box = long unsigned offset + long ASCII text string 'pitm' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current = 0) -> 2 bytes main item reference = short unsigned id -> 2 bytes encryption box array value = short unsigned index -> item name or URL string = UTF-8 text string -> 1 byte name or URL c string end = byte value set to zero -> item mime type string = UTF-8 text string -> 1 byte mime type c string end = byte value set to zero -> optional item transfer encoding string = UTF-8 text string -> 1 byte optional transfer encoding c string end = byte value set to zero * 8+ bytes optional APPLE item list box = long unsigned offset + long ASCII text string 'ilst' * 8+ bytes optional APPLE annotation box = long unsigned offset + 0xA9 + 24-bit ASCII text string - box types are full name = 'nam' ; comment = 'cmt' ; content created year = 'day' - box types are artist = 'ART'; track = 'trk'; album = 'alb'; composer = 'com' - box types are composer = 'wrt'; encoder = 'too'; album = 'alb'; composer = 'com' OR = long unsigned offset + 32-bit ASCII text string - box types are genre = 'gnre' ; CD set number = 'disk' ; track number = 'trkn' - box types are beats per minute = 'tmpo' ; compilation = 'cpil' - box types are cover art = 'covr' ; itunes specific info = '----' * 8+ bytes APPLE item data box = long unsigned offset + long ASCII text string 'data' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current version = 0 ; contains text flag = 0x000001 contains data flag = 0x000000 ; for tmpo/cpil flag = 0x000015 contains image data flag = 0x00000D) -> 4 bytes reserved = 32-bit value set to zero -> annotation text or data values = text or hex dump (NOTE: Genre is either text or a 16-bit short ID3 value and most other non-text data are short unsigned values with the exception of compilation which is a byte flag) * 8+ bytes optional APPLE additional info box = long unsigned offset + long ASCII text string - box types are Java style app name = 'mean' ; item name = 'name' -> 4 bytes version/flags = byte hex version + 24-bit hex flags (current version = 0 ; current flags = 0x000000) -> string text = ASCII text dump -> 4 bytes compatibility utda end = long value set to zero