User:Ryan Cooley/MPEG1: Difference between revisions
imported>Rcooley m (15) |
imported>Rcooley (14 Long, nightmarish, brutally tedious cleanup) |
||
Line 2: | Line 2: | ||
This is my mind-dump and accommodating others before I'm done will just make much, much more work for me. Put any suggestions on the Talk page, and I will eventually address them. -RC | This is my mind-dump and accommodating others before I'm done will just make much, much more work for me. Put any suggestions on the Talk page, and I will eventually address them. -RC | ||
'''MPEG-1''' was an early [[standard]] for [[lossy]] compression of [[video]] and [[audio]]. It was designed to compress raw video and CD audio from about 43 Mbit/s down to 1. | '''MPEG-1''' was an early [[standard]] for [[lossy]] compression of [[video]] and [[audio]]. It was designed to compress raw video and CD audio from about 43 Mbit/s down to 1.5 Mbit/s without obvious (discernible) quality loss, making [[Video CD]]s and [[Digital Video Broadcasting]] possible. <ref>http://www.chiariglione.org/mpeg/meetings/kurihama89/kurihama_press.htm</ref> | ||
MPEG-1 is used extensively, in a large number of products and technologies. Perhaps the most well-known part of the MPEG-1 standard today is the MP3 audio format it introduced. | MPEG-1 is used extensively, in a large number of products and technologies. Perhaps the most well-known part of the MPEG-1 standard today is the MP3 audio format it introduced. | ||
Despite it's age, MPEG-1 is not necessarily obsolete or | Despite it's age, MPEG-1 is not necessarily obsolete or substantially inferior to newer technologies. According to [[Leonardo Chiariglione]] (co-founder of [[MPEG]]): "the idea that compression technology keeps on improving is a myth." <ref name=opensource>http://www.chiariglione.org/leonardo/publications/linux/linux00.htm</ref> | ||
The MPEG-1 standard is published as [[ISO/IEC | The MPEG-1 standard is published as '''[[ISO]]/[[IEC]]-11172'''. | ||
== History == | == History == | ||
Modeled on the successful collaborative approach and the compression technologies developed by the [[Joint | Modeled on the successful collaborative approach and the compression technologies developed by the [[Joint Photographic Experts Group]] and [[CCITT]]'s [[Experts Group on Telephony]] (creators of the [[JPEG]] image compression standard and the [[H.261]] standard for [[video conferencing]] respectively) the [[MPEG]] working group was established in January 1988. MPEG was formed to address the need for [[standard]] video and audio encoding formats, and build on H.261 to get better quality through the use of more complex (non-[[real time]]) encoding methods. <ref>http://www.cis.temple.edu/~vasilis/Courses/CIS750/Papers/mpeg_6.pdf pp.2</ref> <ref name=faq>http://bmrc.berkeley.edu/research/mpeg/faq/mpeg2-v38/faq_v38.html</ref> | ||
Development of the MPEG-1 standard began in [[May 1988]]. 14 video and 14 audio codec proposals were submitted by individual companies and institutions for evaluation. The codecs were extensively tested for computational complexity and subjective (human perceived) quality, at | Development of the MPEG-1 standard began in [[May 1988]]. 14 video and 14 audio codec proposals were submitted by individual companies and institutions for evaluation. The codecs were extensively tested for [[computational complexity]] and [[Subjectivity|subjective]] (human perceived) quality, at data rates of 1.5 Mbit/s. This specific bitrate was chosen for transmission over [[Digital Signal 1|T-1]]/[[E-carrier|E-1]] lines and the approximate data rate of [[CDDA|audio CDs]]. <ref name=opensource /> The codecs that excelled in this testing were utilized as the basis for the standard and refined further, with additional features and other improvements being incorporated in the process. <ref>http://www.chiariglione.org/mpeg/meetings/santa_clara90/santa_clara_press.htm</ref> | ||
After 20 meetings of the full group in various cities around the world, and 4 <sup>1</sup>/<sub>2</sub> years of development and testing, the final standard was approved in early [[November 1992]]. <ref>http://www.chiariglione.org/mpeg/meetings.htm</ref> The completion date | After 20 meetings of the full group in various cities around the world, and 4 <sup>1</sup>/<sub>2</sub> years of development and testing, the final standard was approved in early [[November 1992]] and published a few months later. <ref>http://www.chiariglione.org/mpeg/meetings.htm</ref> The completion date for the MPEG-1 standard, as commonly reported, varies greatly because a largely complete draft standard was produced in [[September 1990]], and from that point on, only minor changes were introduced. In [[July 1990]], before the first draft of the MPEG-1 standard had even been written, work began on a second standard, [[MPEG-2]], intended to extend MPEG-1 technology to provide full broadcast-quality video at high bitrates (3 - 15 Mbit/s), and support for [[interlaced]] video. <ref>http://www.chiariglione.org/mpeg/meetings/london/london_press.htm</ref> Due in part to the similarity between the two codecs, the MPEG-2 standard included full backwards compatibility with MPEG-1 video, so any MPEG-2 decoder can play MPEG-1 videos. | ||
Notably, the MPEG-1 standard very strictly defines the [[bitstream]], and decoder function, but does not define how MPEG-1 encoding is to be performed (although they did provide a reference implementation: [[ISO/IEC 11172-5 | Notably, the MPEG-1 standard very strictly defines the [[bitstream]], and decoder function, but does not define how MPEG-1 encoding is to be performed (although they did provide a reference implementation: '''[[ISO]]/[[IEC]]-11172-5'''). This means that MPEG-1 [[coding efficiency]] can drastically vary depending on the encoder used, and generally means that newer encoders perform significantly better than their predecessors. | ||
== Applications == | == Applications == | ||
*Today, MPEG-1 has become by far the most widely compatible lossy audio/video format in the world. | *Today, MPEG-1 has become by far the most widely compatible lossy audio/video format in the world. | ||
*MPEG-1 Video and Layer I/II audio can be implemented without payment of license fees. <ref>http://www.emedialive.com/Articles/ReadArticle.aspx?ArticleID=12165</ref> <ref>http://www.extremetech.com/article2/0,1697,1153916,00.asp</ref> <ref>http://www.snazzizone.com/TP09.html</ref> <ref>http://213.130.34.82/resources/technical/mpegcompared/index.htm</ref> | *MPEG-1 Video and Layer I/II audio can be implemented without payment of license fees. <ref>http://www.emedialive.com/Articles/ReadArticle.aspx?ArticleID=12165</ref> <ref>http://www.extremetech.com/article2/0,1697,1153916,00.asp</ref> <ref>http://www.snazzizone.com/TP09.html</ref> <ref>http://213.130.34.82/resources/technical/mpegcompared/index.htm</ref> Due to its age, many of the patents on the technology have expired. | ||
*Most computer software for video playback includes MPEG-1 decoding, in addition to any other supported formats. | *Most computer software for video playback includes MPEG-1 decoding, in addition to any other supported formats. | ||
*The | *The popularity of [[MP3]] audio has established a massive [[installed base]] of hardware that can playback MPEG-1 audio (all 3 layers). | ||
*Millions of portable [[digital audio]] [[digital audio players|players]] | *Millions of portable [[digital audio]] [[digital audio players|players]] can playback MPEG-1 audio. | ||
*The widespread popularity of MPEG-2 | *The widespread popularity of MPEG-2 with broadcasters means MPEG-1 is playable by most digital cable/satellite set-top boxes, and digital disc and tape players, due to backwards compatibility. | ||
*MPEG-1 video and audio | *MPEG-1 is the exclusive video and audio format used on [[Video CD]]s (VCD), the first [[consumer]] digital video format, and still a very popular format around the world. | ||
*The [[Super Video CD]] standard, based on VCD, uses MPEG-1 audio exclusively, as well as MPEG-2 video. | *The [[Super Video CD]] standard, based on VCD, uses MPEG-1 audio exclusively, as well as MPEG-2 video. | ||
*[[DVD | *[[DVD Video]] uses MPEG-2 video primarily, but MPEG-1 support is explicitly defined/specified in the standard. | ||
*The [[DVD | *The [[DVD Video]] standard originally required MPEG-1 Layer II audio for PAL countries, but was changed to allow AC-3/[[Dolby Digital]]-only discs. MPEG-1 Layer II audio is still allowed on DVDs, although newer extensions to the format like [[MPEG Multichannel]] and [[variable bitrate]] (VBR), are rarely supported. | ||
*Most DVD players also support [[Video CD]] and [[MP3 CD]] playback, which use MPEG-1. | |||
*The international [[Digital Video Broadcasting]] (DVB) standard primarily uses MPEG-1 Layer II audio, as well as MPEG-2 video. | *The international [[Digital Video Broadcasting]] (DVB) standard primarily uses MPEG-1 Layer II audio, as well as MPEG-2 video. | ||
*The international [[Digital Audio Broadcasting]] (DAB) standard uses MPEG-1 Layer II audio exclusively, due to error resilience and low complexity of decoding. | *The international [[Digital Audio Broadcasting]] (DAB) standard uses MPEG-1 Layer II audio exclusively, due to error resilience and low complexity of decoding. | ||
*MPEG-1 Layer II audio, with [[MPEG | *MPEG-1 Layer II audio, with [[MPEG Multichannel]] extensions, was proposed for use in the [[ATSC]] [[digital TV]] broadcasting standard, but [[Dolby Digital]] (aka. AC-3, A/52) was chosen instead. This is a matter of significant controversy, as it has been revealed that the organizations (The [[Massachusetts Institute of Technology]] and [[Zenith Electronics]]) behind 2 of the 4 voting board members received tens of millions of dollars of compensation from secret deals with [[Dolby Laboratories]] in exchange for their votes. <ref>http://www-tech.mit.edu/V122/N54/54hdtv.54n.html</ref> | ||
== Video == | == Video == | ||
Part 2 of the MPEG-1 standard covers video and is defined in ISO/[[IEC 11172-2 | Part 2 of the MPEG-1 standard covers video and is defined in '''[[ISO]]/[[IEC]]-11172-2'''. | ||
=== Color Space === | === Color Space === | ||
Before encoding video to MPEG-1 the color-space is transformed to Y'CbCr (Y'=Luma, Cb=Chroma Blue, Cr=Chroma Red). Luma (brightness/resolution) is stored separately from chroma (color, hue, phase) and even further separated into red and blue components. The chroma is also subsampled to 4:2:0, meaning it is [[decimated]] by half vertically and half horizontally, to just one quarter the resolution of the video. | Before encoding video to MPEG-1 the color-space is transformed to [[Y'CbCr]] (Y'=Luma, Cb=Chroma Blue, Cr=Chroma Red). [[Luma (video)|Luma]] (brightness/resolution) is stored separately from [[Chrominance|chroma]] (color, hue, phase) and even further separated into red and blue components. The chroma is also subsampled to [[4:2:0]], meaning it is [[Decimation (signal processing)|decimated]] by half vertically and half horizontally, to just one quarter the resolution of the video. | ||
Y'CbCr is often inaccurately called [[YUV]] which is actually only used in the domain of analog video signals. Similarly, the terms [[luminance]] and [[chrominance]] are often used instead of the more accurate terms | Y'CbCr is often inaccurately called [[YUV]] which is actually only used in the domain of analog video signals. Similarly, the terms [[luminance]] and [[chrominance]] are often used instead of the more accurate/appropriate terms luma and chroma. | ||
Because the human eye is much less sensitive to small changes in color than in brightness, [[chroma subsampling]] is a very effective way to reduce the amount of video data that needs to be compressed. On videos with fine ( | Because the human eye is much less sensitive to small changes in color than in brightness, [[chroma subsampling]] is a very effective way to reduce the amount of video data that needs to be compressed. On videos with fine details (high [[Spatial_frequency#Visual_perception|spatial complexity]]) this can manifest as chroma [[aliasing]] artifacts. Compared to other digital [[compression artifact]]s, this issue seems to be very rarely a source of annoyance. | ||
Because of subsampling, Y'CbCr video must always be stored | Because of subsampling, Y'CbCr video must always be stored using even dimensions ([[divisible]] by 2), otherwise chroma mismatch ("ghosts") will occur, and it will appear as if the color is ahead of, or behind the rest of the video, much like a shadow. | ||
=== Resolution === | === Resolution === | ||
MPEG-1 supports resolutions up to 4095×4095, and bitrates up to 100 Mbit/sec. <ref name=faq /> | MPEG-1 supports resolutions up to 4095×4095, and bitrates up to 100 Mbit/sec. <ref name=faq /> | ||
MPEG-1 videos are most commonly found using | MPEG-1 videos are most commonly found using [[Source Input Format]] (SIF) resolutions: 352x240, 352x288, or 320x240. These low resolutions, combined with a bitrate less than 1.5 Mbit/s, makes up what is known as a [[constrained parameters bitstream]]. This is the minimum video specifications any [[decoder]] should be able to play, to be considered MPEG-1 [http://en.wiktionary.org/wiki/compliant compliant]. This was selected to provide a good balance between quality and performance, allowing the use of reasonably inexpensive hardware of the time. | ||
=== I-Frames === | === I-Frames === | ||
MPEG-1 has several frame and picture types. The first, most important, yet simplest are '''I-frames'''. | MPEG-1 has several frame and picture types. The first, most important, yet simplest are '''I-frames'''. | ||
I-frame is an abbreviation for '''Intra-frame'''. They may also be known as I-pictures, or | I-frame is an abbreviation for '''[http://en.wiktionary.org/wiki/intra- Intra-frame]''', so-called because they can be decoded independently of any other frames. They may also be known as '''I-pictures''', or '''keyframes''' due to their somewhat similar function to the [[key frame]]s used in animation. I-frames can be considered effectively identical to [[JPEG]] images. | ||
I- | High-speed seeking through an MPEG-1 video is only possible to the nearest I-frame. When cutting a video it is not possible to start playback of a segment of video before the first I-frame in the segment (at least not without computationally-intensive re-encoding). For this reason, I-frame only MPEG videos are used in editing applications. | ||
I-frame only compression is very fast, but produces very large file sizes, on the order of 2 - 14 × larger than normally encoded MPEG-1 video. <ref>mostly mathematical fact; verifiable using eg. [[libavcodec]]: compare MPEG1 encoding using vqscale=X:keyint=1 to vqscale=X:keyint=15)</ref> I-frame only MPEG-1 video is very similar to [[MJPEG]] video, so much so that very high-speed and nearly lossless (except rounding errors) conversion can be made from one format to the other, provided a couple restrictions (color space and quantizer table) are followed in the creation of the original bitstream. <ref>http://citeseer.ist.psu.edu/acharya98compressed.html ''Compressed Domain Transcoding of MPEG'' (Requires clever reading: says quantization tables differ, but those are user selectable)</ref> | |||
I- | The length between I-frames is known as the [[group of pictures]] (GOP) size. MPEG-1 most commonly uses a GOP size of 12-15. ie. 1 I-frame for every 11-14 non-I-frames (some combination of P-frames and B-frames). With more intelligent encoders, GOP size is dynamically chosen, up to some pre-selected maximum limit. | ||
Limits are placed on the maximum number of frames between I-frames due to decoding complexing, decoder buffer size, seeking ability, and and accumulation of IDCT errors in low-precision implementations common in hardware decoders. | |||
Limits are placed on the maximum number of frames between I-frames due to | |||
=== P-frames === | === P-frames === | ||
'''P-frame''' is an abbreviation for '''Predicted-frame'''. They may also be known as '''forward-predicted frames''', or '''[http://en.wiktionary.org/wiki/intra- intra-frames]''' (B-frames are also intra-frames). | |||
P-frames | P-frames exist to improve compression by exploiting the [http://en.wiktionary.org/wiki/temporal temporal] (over time) [http://en.wiktionary.org/wiki/redundancy redundancy] in a video. P-frames store only the ''difference'' in image from the frame (either an I-frame or P-frame) immediately preceding it (this reference frame is also called the ''[[Anchoring|anchor]] frame''). This is called [[conditional replenishment]]. | ||
The difference between a P-frame and it's anchor frame is calculated using ''motion vectors'' on each ''macroblock'' of the frame (see below). Motion vector data will be embedded in the P-frame for use by the decoder. | |||
If a video | If a reasonable match is found, the block from the previous frame is used, and any ''prediction error'' (difference between the predicted block and the actual video) is encoded and stored in the P-frame. If a reasonably close match from the previous frame for a block cannot be found, the block will be intra-coded (storing the block as an image, in full). A P-frame can contain any number of intra-coded blocks, in addition to any forward-predicted blocks. | ||
If a video drastically changes from one frame to the next (such as a [[scene change]]), it can be more efficient to encode it as an I-frame. | |||
=== B-frames === | === B-frames === | ||
'''B-frame''' stands for '''bidirectional-frame'''. They may also be known as B-pictures or backwards-predicted frames. | |||
B-frames are quite similar to P-frames, except they can make predictions using either, or both of, the previous and future frames (ie. two anchor frames). | |||
It is therefore necessary for the player to first decode the next I- or P- anchor frame sequentially after the B-frame, before the B-frame can be decoded and displayed. This makes B-frames very computationally complex, requires larger [[data buffer]]s, and causes a delay of around half a second <ref>FPS / 60 * (number of sequential B-frames) == X seconds delay</ref> on decoding, and more during encoding. This also necessitates the display time stamps (DTS) feature in the container/system stream (see below). As such, B-frames have long been subject of much controversy, and are often avoided in videos, and have limited/restricted supported in hardware. | |||
Because B-frames | No other frames are predicted from a B-frame. Because of this, a very low bitrate B-frame can be inserted, where needed, to help control the bitrate. If this was done with a P-frame, future P-frames would be predicted from it and this will lower the quality of the rest of the GOP. However, similarly, the future P-frame must still encode all the changes between it and the previous I-frame or P-frame anchor frame, a second time, in addition to much of the changes being coded in the B-frame. B-frames can also be beneficial in videos where the background behind an object is being revealed over several frames, and in fading transitions, such as from one scene to the next. | ||
A B-frame can contain any number of intra-coded blocks and forward-predicted blocks, in addition to backwards-predicted, | A B-frame can contain any number of intra-coded blocks and forward-predicted blocks, in addition to backwards-predicted, or bidirectionally predicted blocks. | ||
=== D-frames === | === D-frames === | ||
MPEG-1 has a unique frame type not found in later video standards. D-frames or DC-pictures are independent images ( | MPEG-1 has a unique frame type not found in later video standards. '''D-frames''' or '''DC-pictures''' are independent images (intra-frames) that have been encoded DC-only (AC coefficients are removed—see DCT below) and hence are very low quality. D-frames are never referenced by I-, P- or B- frames. D-frames are only used for fast previews of video, for instance when seeking through a video at high speed. | ||
Given moderately higher-performance decoding equipment, this feature can be approximated by decoding | Given moderately higher-performance decoding equipment, this feature can be approximated by decoding I-frames instead. This provides higher quality previews, and without the need for D-frames taking up space in the stream, yet not contributing to overall quality. | ||
=== Macroblocks === | === Macroblocks === | ||
MPEG-1 operate on video in a series of 8x8 blocks for quantization, motion estimation, etc. However, because chroma is subsampled by 4, 1 chroma block corresponds to 4 luma blocks. This gives us the 16x16 macroblock as the smallest independent unit in MPEG-1 video. | |||
MPEG-1 operate on video in a series of 8x8 blocks for quantization, motion estimation, etc. | |||
It is very important to maintain video resolutions that are [[multiple]]s of 16. See Motion Vectors for more reasons. | It is very important to maintain video resolutions that are [[multiple]]s of 16. See Motion Vectors for more reasons. | ||
Line 120: | Line 104: | ||
=== DCT === | === DCT === | ||
Each 8x8 block is encoded using the ''Forward'' Discrete | Each 8x8 block is encoded using the ''Forward'' [[Discrete Cosine Transform]] (FDCT). This process by itself is practically lossless (there are some rounding errors), and is reversed by the ''Inverse'' DCT ([[IDCT]]) upon playback to produce the original values. | ||
The FDCT process converts the 64 uncompressed pixel values (brightness) into 64 different ''frequency'' values. One (large) value that is the average of the entire 8x8 block (the '''DC coefficient''') and 63 smaller, positive or negative values (the '''AC coefficients'''), that are relative to the value of the DC coefficient. | The FDCT process converts the 64 uncompressed pixel values (brightness) into 64 different ''frequency'' values. One (large) value that is the average of the entire 8x8 block (the '''DC coefficient''') and 63 smaller, positive or negative values (the '''AC coefficients'''), that are relative to the value of the DC coefficient. | ||
The | The DC coefficient remains mostly consistent from one block to the next, and can be compressed quite effectively with [[DPCM]] so only the amount of difference between each DC value needs to be stored. Also, a significant number of the AC coefficients will be near 0, (known as [http://en.wiktionary.org/wiki/sparse sparse] data) which can then be more efficiently compressed in a later step. Additionally, this DCT frequency conversion is necessary for quantization (see below). | ||
=== Quantization === | === Quantization === | ||
[[Quantization (image processing)|Quantization]] (of digital data) is, essentially, the process of reducing the accuracy of a signal. | [[Quantization (image processing)|Quantization]] (of digital data) is, essentially, the process of reducing the accuracy of a signal, by dividing it into some larger step size (eg. finding the nearest multiple, and discarding the remainder/modulus). | ||
The frame-level quantizer is a number from 1 to 31 (although encoders will often omit/disable some of the extreme values) which determines how much information will be removed from a given frame. The frame-level quantizer is either dynamically selected by the encoder to maintain a certain specified bitrate, or (much less commonly) specified by the user. | The frame-level quantizer is a number from 1 to 31 (although encoders will often omit/disable some of the extreme values) which determines how much information will be removed from a given frame. The frame-level quantizer is either dynamically selected by the encoder to maintain a certain specified bitrate, or (much less commonly) specified by the user. | ||
Contrary to popular belief, a fixed quantizer | Contrary to popular belief, a fixed quantizer (set by the user) does not deliver a constant level of quality. Instead, it is an arbitrary metric, that will provide a somewhat varying level of quality depending on the contents of each frame. Given two files of identical sizes, the one encoded by setting the bitrate should look better than the one with a set quantizer. Constant quantizer encoding can be used, however, to determine the minimum and maximum bitrates possible for encoding a given video. | ||
A '''quantization table''' is a string of 64-numbers (0-255) that tells the encoder how relatively important or unimportant each piece of visual information is. Each number in the table corresponds to a certain frequency component of the video image. | A '''quantization table''' is a string of 64-numbers (0-255) that tells the encoder how relatively important or unimportant each piece of visual information is. Each number in the table corresponds to a certain frequency component of the video image. | ||
Line 139: | Line 123: | ||
This quantization process usually reduces a significant number of the ''AC coefficients'' to zero, which improves the effectiveness of entropy coding (lossless compression) in the next step. | This quantization process usually reduces a significant number of the ''AC coefficients'' to zero, which improves the effectiveness of entropy coding (lossless compression) in the next step. | ||
Quantization eliminates a large amount of data, and is the main lossy processing step in MPEG-1 video encoding. This is also the primary source of most MPEG-1 video [[compression artifacts]], like blockiness, [[color banding]], noise, [[ringing]], discoloration, et al. when video is encoded with an insufficient bitrate | Quantization eliminates a large amount of data, and is the main lossy processing step in MPEG-1 video encoding. This is also the primary source of most MPEG-1 video [[compression artifacts]], like [[blockiness]], [[color banding]], [[noise]], [[ringing]], [[discoloration]], et al. when video is encoded with an insufficient bitrate, therefore being forced to use high frame-level quantizers (''strong quantization'') through much of the video. | ||
=== Entropy Coding === | === Entropy Coding === | ||
Several steps in the encoding of MPEG-1 video are lossless, meaning they will be reversed on decoding to produce exactly the same values. Since these lossless data compression steps don't add noise into or otherwise change the video (unlike quantization), it is often referred to as [[noiseless coding]] in the context of lossy codecs. Since lossless compression aims to remove as much redundancy as possible, it is also known as [[entropy coding]] in [[information theory]]. | Several steps in the encoding of MPEG-1 video are lossless, meaning they will be reversed on decoding to produce exactly the same values. Since these lossless data compression steps don't add noise into or otherwise change the video (unlike quantization), it is often referred to as [[Source coding theorem|noiseless coding]] in the context of lossy codecs. Since lossless compression aims to remove as much redundancy as possible, it is also known as [[entropy coding]] in [[information theory]]. | ||
=== RLE === | === RLE === | ||
[[Run-length encoding]] (RLE) is a very simple method of compressing repetition. A sequential string of characters, no matter how long, can be replaced with a few bytes, noting the value that repeats, and how many times. For example, if someone is to say "five nines", you would know they mean the number 99999. | [[Run-length encoding]] (RLE) is a very simple method of compressing repetition. A sequential string of characters, no matter how long, can be replaced with a few bytes, noting the value that repeats, and how many times. For example, if someone is to say "five nines", you would know they mean the number 99999. | ||
RLE is particularly effective after quantization, as a significant number of the AC coefficients are now zero, and can be represented with just a couple bytes | RLE is particularly effective after quantization, as a significant number of the AC coefficients are now zero (called [http://en.wiktionary.org/wiki/sparse sparse] data), and can be represented with just a couple bytes, in a special 2-[[dimensional]] Huffman table that codes the run-length and the run-ending character. | ||
=== Huffman Coding === | === Huffman Coding === | ||
The data is then analyzed to find strings that repeat often. Those strings are then put into a special table, with the most frequently repeating data assigned the shortest code to keep the data as small as possible. | The data is then analyzed to find strings that repeat often. Those strings are then put into a special [[Huffman]] table, with the most frequently repeating data assigned the shortest code to keep the data as small as possible. | ||
Once the table is constructed, those strings | Once the table is constructed, those strings in the data are replaced with their (much smaller) codes, which references the appropriate entry in the table. | ||
=== Motion Vectors === | === Motion Vectors === | ||
Line 171: | Line 155: | ||
Quantization* | Quantization* | ||
Ringing (large coefficients in high frequency sub-bands) | Ringing* (large coefficients in high frequency sub-bands) | ||
zigzag | zigzag | ||
Motion Vectors/Estimation | Motion Vectors/Estimation | ||
Line 185: | Line 167: | ||
Blockiness | Blockiness | ||
CBR/VBR | CBR/VBR | ||
Spacial Complexity | Spacial Complexity* | ||
Temporal Complexity | Temporal Complexity* | ||
== Audio == | == Audio == | ||
Part 3 of the MPEG-1 standard covers audio and is defined in '''[[ISO]]/[[IEC]] 11172-3'''. | |||
MPEG-1 audio utilizes [[psychoacoustic]]s to significantly reduce the data rate required by an audio stream. It reduces or completely discards certain parts of the audio that the human ear can't ''hear'', either because they are in frequencies where the ear has limited [[sensitivity]], or are ''[[masked]]'' by other, typically louder, sounds. | |||
MPEG-1 audio utilizes [[psychoacoustic]]s to significantly reduce the data rate required by an audio stream. It reduces or completely discards certain parts of the audio that the human ear can't ''hear'', either because they are in frequencies where the ear has limited sensitivity, or are ''masked'' by other, typically louder, sounds. | |||
Channel Encoding: | |||
*Mono | *Mono | ||
*Joint Stereo (intensity encoded) | *Joint Stereo (intensity encoded) | ||
Line 206: | Line 186: | ||
*Frame size (fixed): 1152 samples (coefficients) | *Frame size (fixed): 1152 samples (coefficients) | ||
=== Layer I === | |||
MPEG-1 Layer I is nothing more than a simplified version of Layer II, designed for even lower delay and lower complexity to facilitate [[real-time]] encoding on the hardware available circa [[1990]], for applications like teleconferencing and studio editing. With the substantial performance improvements in digital processing since, Layer I has long been obsolete. | |||
It saw limited adoption in it's time, and most notably was used on the defunct [[Philips]] [[Digital Compact Cassette]] at 384 kbps. Layer I audio files typically use the extension '''.mp1''' | |||
MPEG-1 Layer | === Layer II === | ||
MPEG-1 Layer II is a time-domain encoder. It uses a low-delay 32 sub-band [[Polyphase quadrature filter|polyphased]] [[filter bank]] for time-frequency mapping; having overlapping ranges (ie. polyphased) to prevent aliasing. The psychoacoustic model is based on [[auditory masking]] / [[simultaneous masking]] effects and the [[absolute threshold of hearing]] (ATH) / [[global masking threshold]]. | |||
[[Time domain]] refers to how the psychoacoustic model is applied: to short, discrete samples/chunks of the audio waveform. This offers low-delay as a small number of samples are analyzed before encoding, as opposed to [[frequency domain]] encoding (like MP3) which must analyze a large number of samples before it can decide how to transform and output encoded audio. This also offers higher performance on complex, random and [[Transient (acoustics)|transient]] impulses (such as percussive instruments, and applause), allowing avoidance of artifacts like pre-echo. | |||
The 32 sub-band filter bank returns 32 [[amplitude]] [http://en.wiktionary.org/wiki/coefficient coefficients], one for each equal-sized frequency band/segment of the audio, which is about 700Hz wide. The encoder then utilizes the psychoacoustic model to determine which sub-band contains audible information that is less important, and so, where quantization will be in-audible, or at least much less noticeable (higher [[masking threshold]]). | |||
The psychoacoustic model is applied using a 1024-point [[Fast Fourier Transform]] (FFT). 128 of the 1152 samples are ignored (discarded as presumably insignificant) for this analysis. The psychoacoustic model determines which sub-bands contribute more to the [[masking threshold]], and the available bits are assigned to each sub-band accordingly. "Noise" components are typically more important to than "tonal" components. | |||
Typically, sub-bands are less important if they contain quieter sounds (smaller coefficient) than a neighboring (ie. similar frequency) sub-band with louder sounds (larger coefficient). The less significant sub-band is then reduced in accuracy by, basically, compressing the frequency range/amplitude, (aka. raising the noise floor), and computing an amplification factor to re-expand it to the proper frequency range for playback/decoding. <ref>http://citeseer.ist.psu.edu/257196.html pp.7</ref> <ref>http://www.twolame.org/doc/psycho.html</ref> | |||
Additionally, Layer II can use intensity stereo coding. This means that both channels are down-mixed into one single (mono) channel, but the information on the relative intensity (volume, amplitude) of each channel is preserved and encoded into the bitstream separately. On playback, the single channel is played through left and right speakers, with the intensity information applied to each channel to give the illusion of stereo sound. This can allow further reduction of the audio bitrate without much perceivable loss of fidelity. | |||
Subjective audio testing by experts, in the most critical conditions ever implemented, has shown MP2 to offer transparent audio compression at 256kbps for 16-bit 44.1khz [[CD]] audio. <ref>http://www.faqs.org/faqs/mpeg-faq/part1/ "You can compress the same stereo program down to 256 Kbits/s with no loss in discernible quality." (the original papers would be much, much better refs, but I can't seem to find them! This just proves they exist!)</ref> That (approximately) 1:6 compression ratio for CD audio is particularly impressive since it | Subjective audio testing by experts, in the most critical conditions ever implemented, has shown MP2 to offer transparent audio compression at 256kbps for 16-bit 44.1khz [[CD]] audio. <ref>http://www.faqs.org/faqs/mpeg-faq/part1/ "You can compress the same stereo program down to 256 Kbits/s with no loss in discernible quality." (the original papers would be much, much better refs, but I can't seem to find them! This just proves they exist!)</ref> That (approximately) 1:6 compression ratio for CD audio is particularly impressive since it is quite close to upper theoretical limit of [[perceptual entropy]], at just over 1:8. <ref>J. Johnston, ''Estimation of Perceptual Entropy Using Noise Masking Criteria,'' in Proc. ICASSP-88, pp. 2524-2527, May 1988.</ref> | ||
<ref>6. J. Johnston, ''Transform Coding of Audio Signals Using Perceptual Noise Criteria,'' IEEE J. Sel. Areas in Comm., pp. 314-323, Feb. 1988.</ref> | <ref>6. J. Johnston, ''Transform Coding of Audio Signals Using Perceptual Noise Criteria,'' IEEE J. Sel. Areas in Comm., pp. 314-323, Feb. 1988.</ref> | ||
Achieving much higher compression is simply not possible without discarding some perceptible information. | Achieving much higher compression is simply not possible without discarding some perceptible information. | ||
Line 228: | Line 210: | ||
MPEG-1 Layer II was derived from the Musicam audio codec (developed by Philips-needs ref). Most key features were directly inherited, including the filter bank, time-domain processing, frame sizes, etc. However, improvements were made, and the actual Musicam algorithm was not used in the final Layer II standard. The widespread usage of the term Musicam to refer to Layer II is entirely incorrect and discouraged for both technical and legal reasons. <ref>http://www.chiariglione.org/MPEG/faq/mp1-aud/mp1-aud.htm#16</ref> | MPEG-1 Layer II was derived from the Musicam audio codec (developed by Philips-needs ref). Most key features were directly inherited, including the filter bank, time-domain processing, frame sizes, etc. However, improvements were made, and the actual Musicam algorithm was not used in the final Layer II standard. The widespread usage of the term Musicam to refer to Layer II is entirely incorrect and discouraged for both technical and legal reasons. <ref>http://www.chiariglione.org/MPEG/faq/mp1-aud/mp1-aud.htm#16</ref> | ||
Despite some 20 years of progress in the field of digital audio coding, MP2 remains the preeminent lossy audio coding standard due to its especially high audio coding performances on highly critical audio material such as castanet, symphonic orchestra, male and female voices and particularly complex and high energy transients (impulses) like percussive sounds: triangle, glockenspiel and audience applause. More recent testing (of multichannel audio codecs) has shown that [[MPEG | Despite some 20 years of progress in the field of digital audio coding, MP2 remains the preeminent lossy audio coding standard due to its especially high audio coding performances on highly critical audio material such as castanet, symphonic orchestra, male and female voices and particularly complex and high energy transients (impulses) like percussive sounds: triangle, glockenspiel and audience applause. More recent testing (of multichannel audio codecs) has shown that [[MPEG Multichannel]] (based on MP2), despite being compromised by an inferior matrixed mode, rates just slightly lower than much more recent audio codecs, such as [[Dolby Digital]] AC-3 and [[Advanced Audio Coding]] (AAC) (mostly within the margin of error, actually — and still superior in some cases, namely audience applause).<ref>Wustenhagen et al, ''Subjective Listening Test of Multi-channel Audio Codecs'', AES 105th Convention Paper 4813, San Francisco 1998</ref> <ref>http://www.ebu.ch/CMSimages/en/tec_doc_t3324-2007_tcm6-53801.pdf</ref> | ||
This is one reason that MP2 audio continues to be used extensively. MP2's especially high quality, low decoder performance requirements, and tolerance of errors makes it a popular choice for applications like [[digital audio broadcasting]] (DAB). | |||
Layer II audio files typically use the extension '''.mp2''' or sometimes '''.m2a''' | |||
=== Layer III/MP3 === | === Layer III/MP3 === | ||
Line 249: | Line 228: | ||
MP3 does not benefit from the 32 sub-band filter bank, instead just MDCT transforming the data again, and processing it in the frequency domain in much smaller pieces. In fact, being forced to use the filter bank (to fit in the MPEG-1 audio standard) wastes processing time and compromises MP3 quality. | MP3 does not benefit from the 32 sub-band filter bank, instead just MDCT transforming the data again, and processing it in the frequency domain in much smaller pieces. In fact, being forced to use the filter bank (to fit in the MPEG-1 audio standard) wastes processing time and compromises MP3 quality. | ||
The Layer II 1024 point (FFT) window for spectral estimation is too small | The Layer II 1024 point (FFT) analysis window for spectral estimation is too small to cover all 1152 samples, so MP3 utilize two sequential passes, reducing performance and increasing delay. | ||
MP3 outputs 1152 samples, but spreads the larger MP3 frames over a varying number of several Layer I/II-sized frames, making editing much more difficult, and proving more vulnerable to errors. | MP3 outputs 1152 samples, but spreads the larger MP3 frames over a varying number of several Layer I/II-sized frames, making editing much more difficult, and proving more vulnerable to errors. | ||
Line 255: | Line 234: | ||
Unlike Layers I/II, MP3 uses [[Huffman coding]] (after perceptual) to (losslessly) further reduce the bitrate, without any further quality loss, making MP3 further affected by small transmission errors. | Unlike Layers I/II, MP3 uses [[Huffman coding]] (after perceptual) to (losslessly) further reduce the bitrate, without any further quality loss, making MP3 further affected by small transmission errors. | ||
MP3 benefits greatly from being able to divide the audio into 576 frequency components using the (overlapping) MDCT transform. This allows MP3 to more accurately apply psychoacoustic rules (than can Layer II with just 32 sub-bands), particularly in the [[critical bands]] and providing much better low-bitrate performance. | MP3 benefits greatly from being able to divide the audio into 576 frequency components using the (overlapping) MDCT transform. This allows MP3 to more accurately apply psychoacoustic rules (than can Layer II with just 32 sub-bands), particularly in the [[critical bands]] ("barks") and providing much better low-bitrate performance. | ||
The frequency domain (MDCT) design of MP3 imposes some limitations as well. It causes a factor of 12 - 36 times worse temporal resolution than MP2, which can cause artifacts due to | The frequency domain (MDCT) design of MP3 imposes some limitations as well. It causes a factor of 12 - 36 times worse temporal resolution than MP2, which can cause artifacts due to transient sounds like percussive events with artifacts spread over a larger window. This results in audible smearing and [[pre-echo]]. <ref>http://www.cs.columbia.edu/~coms6181/slides/6R/mpegaud.pdf pp.8</ref> | ||
This hybrid design also introduces aliasing artifacts | This hybrid design also introduces additional aliasing artifacts. MP3 has an aliasing compensation stage to mask this, but instead producing frequency domain energy that is pushed to the top of the frequency range, and results in audible high frequency distortion. | ||
Because of these issues, MP2 sound quality is actually superior to MP3 at high bitrates (at the VERY LEAST, above 112 kbps/channel) <!--uncited facts are bad, but it can be clearly inferred from a combination of a few other citations.--> | Because of these issues, MP2 sound quality is actually superior to MP3 at high bitrates (at the VERY LEAST, above 112 kbps/channel) <!--uncited facts are bad, but it can be clearly inferred from a combination of a few other citations.--> | ||
Line 266: | Line 245: | ||
No scale factor band for frequencies above 15.5/15.8 kHz" | No scale factor band for frequencies above 15.5/15.8 kHz" | ||
ASPEC (Fraunhoffer)* | |||
ASPEC (Fraunhoffer) | |||
entropy coding (Huffman)* | entropy coding (Huffman)* | ||
Hybrid filtering* | Hybrid filtering* | ||
aliasing issues* | aliasing issues* | ||
"aliasing compensation"? need more details | "aliasing compensation"?* need more details! | ||
mid/side (or intensity) joint stereo | mid/side (or intensity) joint stereo* | ||
"If there is a transient, 192 samples are taken instead of 576 to limit the temporal spread of quantization noise" | "If there is a transient, 192 samples are taken instead of 576 to limit the temporal spread of quantization noise" | ||
psychoacoustic model and frame format from MP1/2* | psychoacoustic model and frame format from MP1/2* | ||
Line 280: | Line 258: | ||
== Systems == | == Systems == | ||
Part 1 of the MPEG-1 standard covers ''systems'' which is the logical layout of the encoded audio, video, and other bitstream data. | Part 1 of the MPEG-1 standard covers ''systems'', which is the logical layout of the encoded audio, video, and other bitstream data, and is defined in '''[[ISO]]/[[IEC]] 11172-3'''. | ||
"The MPEG-1 Systems design is essentially identical to the MPEG-2 Program Stream structure." <ref>http://www.chiariglione.org/mpeg/faq/mp1-sys/mp1-sys.htm</ref> | "The MPEG-1 Systems design is essentially identical to the MPEG-2 Program Stream structure." <ref>http://www.chiariglione.org/mpeg/faq/mp1-sys/mp1-sys.htm</ref> | ||
ES (Elementary Stream) | |||
PES (Packetized ES) | |||
PES | |||
SCR | SCR | ||
PTS | PTS (Packet Time Stamp) | ||
Wrap-around | Wrap-around | ||
DTS | DTS (Display Time Stamp) | ||
Timebase correction | Timebase correction | ||
Program Stream | |||
Interleaving | |||
Pixel/Display Aspect Ratio | Pixel/Display Aspect Ratio | ||
Line 299: | Line 279: | ||
*[[MPEG]] The Moving Picture Experts Group, developers of the MPEG-1 format | *[[MPEG]] The Moving Picture Experts Group, developers of the MPEG-1 format | ||
*[[MP3]] More details on MPEG-1 Layer III audio | *[[MP3]] More details on MPEG-1 Layer III audio | ||
*[[MPEG | *[[MPEG Multichannel]] Backwards compatible 5.1 channel [[surround sound]] extension to Layer II | ||
*[[MPEG-2]] The direct successor to the MPEG-1 standard. | *[[MPEG-2]] The direct successor to the MPEG-1 standard. | ||
;Implementations | ;Implementations | ||
*[[Libavcodec]] includes MPEG-1 video/audio encoders and decoders | *[[Libavcodec]] includes MPEG-1 video/audio encoders and decoders | ||
*[ | *[http://mjpeg.sourceforge.net/ Mjpegtools] MPEG-1/2 video/audio encoders | ||
*[ | *[http://www.twolame.org/ Twolame] high quality MPEG-1 Layer II audio encoder based on [[Lame]] psychoacoustic models | ||
*[[Musepack]] high quality audio format originally based on MPEG-1 Layer II, with significant incompatible changes and improvements | *[[Musepack]] high quality audio format originally based on MPEG-1 Layer II, with significant incompatible changes and improvements |
Revision as of 09:40, 31 March 2008
Do not make any changes to this page for now. This is my mind-dump and accommodating others before I'm done will just make much, much more work for me. Put any suggestions on the Talk page, and I will eventually address them. -RC
MPEG-1 was an early standard for lossy compression of video and audio. It was designed to compress raw video and CD audio from about 43 Mbit/s down to 1.5 Mbit/s without obvious (discernible) quality loss, making Video CDs and Digital Video Broadcasting possible. [1]
MPEG-1 is used extensively, in a large number of products and technologies. Perhaps the most well-known part of the MPEG-1 standard today is the MP3 audio format it introduced.
Despite it's age, MPEG-1 is not necessarily obsolete or substantially inferior to newer technologies. According to Leonardo Chiariglione (co-founder of MPEG): "the idea that compression technology keeps on improving is a myth." [2]
The MPEG-1 standard is published as ISO/IEC-11172.
History
Modeled on the successful collaborative approach and the compression technologies developed by the Joint Photographic Experts Group and CCITT's Experts Group on Telephony (creators of the JPEG image compression standard and the H.261 standard for video conferencing respectively) the MPEG working group was established in January 1988. MPEG was formed to address the need for standard video and audio encoding formats, and build on H.261 to get better quality through the use of more complex (non-real time) encoding methods. [3] [4]
Development of the MPEG-1 standard began in May 1988. 14 video and 14 audio codec proposals were submitted by individual companies and institutions for evaluation. The codecs were extensively tested for computational complexity and subjective (human perceived) quality, at data rates of 1.5 Mbit/s. This specific bitrate was chosen for transmission over T-1/E-1 lines and the approximate data rate of audio CDs. [2] The codecs that excelled in this testing were utilized as the basis for the standard and refined further, with additional features and other improvements being incorporated in the process. [5]
After 20 meetings of the full group in various cities around the world, and 4 1/2 years of development and testing, the final standard was approved in early November 1992 and published a few months later. [6] The completion date for the MPEG-1 standard, as commonly reported, varies greatly because a largely complete draft standard was produced in September 1990, and from that point on, only minor changes were introduced. In July 1990, before the first draft of the MPEG-1 standard had even been written, work began on a second standard, MPEG-2, intended to extend MPEG-1 technology to provide full broadcast-quality video at high bitrates (3 - 15 Mbit/s), and support for interlaced video. [7] Due in part to the similarity between the two codecs, the MPEG-2 standard included full backwards compatibility with MPEG-1 video, so any MPEG-2 decoder can play MPEG-1 videos.
Notably, the MPEG-1 standard very strictly defines the bitstream, and decoder function, but does not define how MPEG-1 encoding is to be performed (although they did provide a reference implementation: ISO/IEC-11172-5). This means that MPEG-1 coding efficiency can drastically vary depending on the encoder used, and generally means that newer encoders perform significantly better than their predecessors.
Applications
- Today, MPEG-1 has become by far the most widely compatible lossy audio/video format in the world.
- MPEG-1 Video and Layer I/II audio can be implemented without payment of license fees. [8] [9] [10] [11] Due to its age, many of the patents on the technology have expired.
- Most computer software for video playback includes MPEG-1 decoding, in addition to any other supported formats.
- The popularity of MP3 audio has established a massive installed base of hardware that can playback MPEG-1 audio (all 3 layers).
- Millions of portable digital audio players can playback MPEG-1 audio.
- The widespread popularity of MPEG-2 with broadcasters means MPEG-1 is playable by most digital cable/satellite set-top boxes, and digital disc and tape players, due to backwards compatibility.
- MPEG-1 is the exclusive video and audio format used on Video CDs (VCD), the first consumer digital video format, and still a very popular format around the world.
- The Super Video CD standard, based on VCD, uses MPEG-1 audio exclusively, as well as MPEG-2 video.
- DVD Video uses MPEG-2 video primarily, but MPEG-1 support is explicitly defined/specified in the standard.
- The DVD Video standard originally required MPEG-1 Layer II audio for PAL countries, but was changed to allow AC-3/Dolby Digital-only discs. MPEG-1 Layer II audio is still allowed on DVDs, although newer extensions to the format like MPEG Multichannel and variable bitrate (VBR), are rarely supported.
- Most DVD players also support Video CD and MP3 CD playback, which use MPEG-1.
- The international Digital Video Broadcasting (DVB) standard primarily uses MPEG-1 Layer II audio, as well as MPEG-2 video.
- The international Digital Audio Broadcasting (DAB) standard uses MPEG-1 Layer II audio exclusively, due to error resilience and low complexity of decoding.
- MPEG-1 Layer II audio, with MPEG Multichannel extensions, was proposed for use in the ATSC digital TV broadcasting standard, but Dolby Digital (aka. AC-3, A/52) was chosen instead. This is a matter of significant controversy, as it has been revealed that the organizations (The Massachusetts Institute of Technology and Zenith Electronics) behind 2 of the 4 voting board members received tens of millions of dollars of compensation from secret deals with Dolby Laboratories in exchange for their votes. [12]
Video
Part 2 of the MPEG-1 standard covers video and is defined in ISO/IEC-11172-2.
Color Space
Before encoding video to MPEG-1 the color-space is transformed to Y'CbCr (Y'=Luma, Cb=Chroma Blue, Cr=Chroma Red). Luma (brightness/resolution) is stored separately from chroma (color, hue, phase) and even further separated into red and blue components. The chroma is also subsampled to 4:2:0, meaning it is decimated by half vertically and half horizontally, to just one quarter the resolution of the video.
Y'CbCr is often inaccurately called YUV which is actually only used in the domain of analog video signals. Similarly, the terms luminance and chrominance are often used instead of the more accurate/appropriate terms luma and chroma.
Because the human eye is much less sensitive to small changes in color than in brightness, chroma subsampling is a very effective way to reduce the amount of video data that needs to be compressed. On videos with fine details (high spatial complexity) this can manifest as chroma aliasing artifacts. Compared to other digital compression artifacts, this issue seems to be very rarely a source of annoyance.
Because of subsampling, Y'CbCr video must always be stored using even dimensions (divisible by 2), otherwise chroma mismatch ("ghosts") will occur, and it will appear as if the color is ahead of, or behind the rest of the video, much like a shadow.
Resolution
MPEG-1 supports resolutions up to 4095×4095, and bitrates up to 100 Mbit/sec. [4]
MPEG-1 videos are most commonly found using Source Input Format (SIF) resolutions: 352x240, 352x288, or 320x240. These low resolutions, combined with a bitrate less than 1.5 Mbit/s, makes up what is known as a constrained parameters bitstream. This is the minimum video specifications any decoder should be able to play, to be considered MPEG-1 compliant. This was selected to provide a good balance between quality and performance, allowing the use of reasonably inexpensive hardware of the time.
I-Frames
MPEG-1 has several frame and picture types. The first, most important, yet simplest are I-frames.
I-frame is an abbreviation for Intra-frame, so-called because they can be decoded independently of any other frames. They may also be known as I-pictures, or keyframes due to their somewhat similar function to the key frames used in animation. I-frames can be considered effectively identical to JPEG images.
High-speed seeking through an MPEG-1 video is only possible to the nearest I-frame. When cutting a video it is not possible to start playback of a segment of video before the first I-frame in the segment (at least not without computationally-intensive re-encoding). For this reason, I-frame only MPEG videos are used in editing applications.
I-frame only compression is very fast, but produces very large file sizes, on the order of 2 - 14 × larger than normally encoded MPEG-1 video. [13] I-frame only MPEG-1 video is very similar to MJPEG video, so much so that very high-speed and nearly lossless (except rounding errors) conversion can be made from one format to the other, provided a couple restrictions (color space and quantizer table) are followed in the creation of the original bitstream. [14]
The length between I-frames is known as the group of pictures (GOP) size. MPEG-1 most commonly uses a GOP size of 12-15. ie. 1 I-frame for every 11-14 non-I-frames (some combination of P-frames and B-frames). With more intelligent encoders, GOP size is dynamically chosen, up to some pre-selected maximum limit.
Limits are placed on the maximum number of frames between I-frames due to decoding complexing, decoder buffer size, seeking ability, and and accumulation of IDCT errors in low-precision implementations common in hardware decoders.
P-frames
P-frame is an abbreviation for Predicted-frame. They may also be known as forward-predicted frames, or intra-frames (B-frames are also intra-frames).
P-frames exist to improve compression by exploiting the temporal (over time) redundancy in a video. P-frames store only the difference in image from the frame (either an I-frame or P-frame) immediately preceding it (this reference frame is also called the anchor frame). This is called conditional replenishment.
The difference between a P-frame and it's anchor frame is calculated using motion vectors on each macroblock of the frame (see below). Motion vector data will be embedded in the P-frame for use by the decoder.
If a reasonable match is found, the block from the previous frame is used, and any prediction error (difference between the predicted block and the actual video) is encoded and stored in the P-frame. If a reasonably close match from the previous frame for a block cannot be found, the block will be intra-coded (storing the block as an image, in full). A P-frame can contain any number of intra-coded blocks, in addition to any forward-predicted blocks.
If a video drastically changes from one frame to the next (such as a scene change), it can be more efficient to encode it as an I-frame.
B-frames
B-frame stands for bidirectional-frame. They may also be known as B-pictures or backwards-predicted frames.
B-frames are quite similar to P-frames, except they can make predictions using either, or both of, the previous and future frames (ie. two anchor frames).
It is therefore necessary for the player to first decode the next I- or P- anchor frame sequentially after the B-frame, before the B-frame can be decoded and displayed. This makes B-frames very computationally complex, requires larger data buffers, and causes a delay of around half a second [15] on decoding, and more during encoding. This also necessitates the display time stamps (DTS) feature in the container/system stream (see below). As such, B-frames have long been subject of much controversy, and are often avoided in videos, and have limited/restricted supported in hardware.
No other frames are predicted from a B-frame. Because of this, a very low bitrate B-frame can be inserted, where needed, to help control the bitrate. If this was done with a P-frame, future P-frames would be predicted from it and this will lower the quality of the rest of the GOP. However, similarly, the future P-frame must still encode all the changes between it and the previous I-frame or P-frame anchor frame, a second time, in addition to much of the changes being coded in the B-frame. B-frames can also be beneficial in videos where the background behind an object is being revealed over several frames, and in fading transitions, such as from one scene to the next.
A B-frame can contain any number of intra-coded blocks and forward-predicted blocks, in addition to backwards-predicted, or bidirectionally predicted blocks.
D-frames
MPEG-1 has a unique frame type not found in later video standards. D-frames or DC-pictures are independent images (intra-frames) that have been encoded DC-only (AC coefficients are removed—see DCT below) and hence are very low quality. D-frames are never referenced by I-, P- or B- frames. D-frames are only used for fast previews of video, for instance when seeking through a video at high speed.
Given moderately higher-performance decoding equipment, this feature can be approximated by decoding I-frames instead. This provides higher quality previews, and without the need for D-frames taking up space in the stream, yet not contributing to overall quality.
Macroblocks
MPEG-1 operate on video in a series of 8x8 blocks for quantization, motion estimation, etc. However, because chroma is subsampled by 4, 1 chroma block corresponds to 4 luma blocks. This gives us the 16x16 macroblock as the smallest independent unit in MPEG-1 video.
It is very important to maintain video resolutions that are multiples of 16. See Motion Vectors for more reasons.
Black Bars Cropped macroblocks Noise around edges
DCT
Each 8x8 block is encoded using the Forward Discrete Cosine Transform (FDCT). This process by itself is practically lossless (there are some rounding errors), and is reversed by the Inverse DCT (IDCT) upon playback to produce the original values.
The FDCT process converts the 64 uncompressed pixel values (brightness) into 64 different frequency values. One (large) value that is the average of the entire 8x8 block (the DC coefficient) and 63 smaller, positive or negative values (the AC coefficients), that are relative to the value of the DC coefficient.
The DC coefficient remains mostly consistent from one block to the next, and can be compressed quite effectively with DPCM so only the amount of difference between each DC value needs to be stored. Also, a significant number of the AC coefficients will be near 0, (known as sparse data) which can then be more efficiently compressed in a later step. Additionally, this DCT frequency conversion is necessary for quantization (see below).
Quantization
Quantization (of digital data) is, essentially, the process of reducing the accuracy of a signal, by dividing it into some larger step size (eg. finding the nearest multiple, and discarding the remainder/modulus).
The frame-level quantizer is a number from 1 to 31 (although encoders will often omit/disable some of the extreme values) which determines how much information will be removed from a given frame. The frame-level quantizer is either dynamically selected by the encoder to maintain a certain specified bitrate, or (much less commonly) specified by the user.
Contrary to popular belief, a fixed quantizer (set by the user) does not deliver a constant level of quality. Instead, it is an arbitrary metric, that will provide a somewhat varying level of quality depending on the contents of each frame. Given two files of identical sizes, the one encoded by setting the bitrate should look better than the one with a set quantizer. Constant quantizer encoding can be used, however, to determine the minimum and maximum bitrates possible for encoding a given video.
A quantization table is a string of 64-numbers (0-255) that tells the encoder how relatively important or unimportant each piece of visual information is. Each number in the table corresponds to a certain frequency component of the video image.
Each of the 64 frequency values of the DCT block are divided by the frame-level quantizer, then divided by their corresponding values in the quantization table, and rounded off. This reduces or completely eliminates the information in some frequency components of the video, deemed less visually important. Typically, high frequency information is less visually important, and so high frequencies are much more strongly quantized (ie. reduced or removed). MPEG-1 actually uses two separate quantization tables, one for intra-blocks (I-blocks) and one for inter-block (P-/B-blocks) so quantization of different block types can be done independently.
This quantization process usually reduces a significant number of the AC coefficients to zero, which improves the effectiveness of entropy coding (lossless compression) in the next step.
Quantization eliminates a large amount of data, and is the main lossy processing step in MPEG-1 video encoding. This is also the primary source of most MPEG-1 video compression artifacts, like blockiness, color banding, noise, ringing, discoloration, et al. when video is encoded with an insufficient bitrate, therefore being forced to use high frame-level quantizers (strong quantization) through much of the video.
Entropy Coding
Several steps in the encoding of MPEG-1 video are lossless, meaning they will be reversed on decoding to produce exactly the same values. Since these lossless data compression steps don't add noise into or otherwise change the video (unlike quantization), it is often referred to as noiseless coding in the context of lossy codecs. Since lossless compression aims to remove as much redundancy as possible, it is also known as entropy coding in information theory.
RLE
Run-length encoding (RLE) is a very simple method of compressing repetition. A sequential string of characters, no matter how long, can be replaced with a few bytes, noting the value that repeats, and how many times. For example, if someone is to say "five nines", you would know they mean the number 99999.
RLE is particularly effective after quantization, as a significant number of the AC coefficients are now zero (called sparse data), and can be represented with just a couple bytes, in a special 2-dimensional Huffman table that codes the run-length and the run-ending character.
Huffman Coding
The data is then analyzed to find strings that repeat often. Those strings are then put into a special Huffman table, with the most frequently repeating data assigned the shortest code to keep the data as small as possible.
Once the table is constructed, those strings in the data are replaced with their (much smaller) codes, which references the appropriate entry in the table.
Motion Vectors
Conditional Replenishment
P and B frames
P-frames can use up to 1 motion vector per macroblock, while B-frames can use 2, one from the previous frame, one from the next frame. [16]
prediction error encoded
Macroblock multiples of 16
Cropped macroblocks
The same problem can be seen where black bars do not fall on a macroblock boundary.
Quantization* Ringing* (large coefficients in high frequency sub-bands) zigzag Motion Vectors/Estimation Black borders/Noise pel precision (half pixel IIRC) Two MV per macroblock (forward/backward pred) Prediction error DPCM encoded, just like DC coeffs Blockiness CBR/VBR Spacial Complexity* Temporal Complexity*
Audio
Part 3 of the MPEG-1 standard covers audio and is defined in ISO/IEC 11172-3.
MPEG-1 audio utilizes psychoacoustics to significantly reduce the data rate required by an audio stream. It reduces or completely discards certain parts of the audio that the human ear can't hear, either because they are in frequencies where the ear has limited sensitivity, or are masked by other, typically louder, sounds.
Channel Encoding:
- Mono
- Joint Stereo (intensity encoded)
- Stereo
- Dual (two uncorrelated mono channels)
- Sampling rates: 32, 44.1 and 48 kHz
- Bitrates: 32, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320 and 384 kbit/s
- Frame size (fixed): 1152 samples (coefficients)
Layer I
MPEG-1 Layer I is nothing more than a simplified version of Layer II, designed for even lower delay and lower complexity to facilitate real-time encoding on the hardware available circa 1990, for applications like teleconferencing and studio editing. With the substantial performance improvements in digital processing since, Layer I has long been obsolete.
It saw limited adoption in it's time, and most notably was used on the defunct Philips Digital Compact Cassette at 384 kbps. Layer I audio files typically use the extension .mp1
Layer II
MPEG-1 Layer II is a time-domain encoder. It uses a low-delay 32 sub-band polyphased filter bank for time-frequency mapping; having overlapping ranges (ie. polyphased) to prevent aliasing. The psychoacoustic model is based on auditory masking / simultaneous masking effects and the absolute threshold of hearing (ATH) / global masking threshold.
Time domain refers to how the psychoacoustic model is applied: to short, discrete samples/chunks of the audio waveform. This offers low-delay as a small number of samples are analyzed before encoding, as opposed to frequency domain encoding (like MP3) which must analyze a large number of samples before it can decide how to transform and output encoded audio. This also offers higher performance on complex, random and transient impulses (such as percussive instruments, and applause), allowing avoidance of artifacts like pre-echo.
The 32 sub-band filter bank returns 32 amplitude coefficients, one for each equal-sized frequency band/segment of the audio, which is about 700Hz wide. The encoder then utilizes the psychoacoustic model to determine which sub-band contains audible information that is less important, and so, where quantization will be in-audible, or at least much less noticeable (higher masking threshold).
The psychoacoustic model is applied using a 1024-point Fast Fourier Transform (FFT). 128 of the 1152 samples are ignored (discarded as presumably insignificant) for this analysis. The psychoacoustic model determines which sub-bands contribute more to the masking threshold, and the available bits are assigned to each sub-band accordingly. "Noise" components are typically more important to than "tonal" components.
Typically, sub-bands are less important if they contain quieter sounds (smaller coefficient) than a neighboring (ie. similar frequency) sub-band with louder sounds (larger coefficient). The less significant sub-band is then reduced in accuracy by, basically, compressing the frequency range/amplitude, (aka. raising the noise floor), and computing an amplification factor to re-expand it to the proper frequency range for playback/decoding. [17] [18]
Additionally, Layer II can use intensity stereo coding. This means that both channels are down-mixed into one single (mono) channel, but the information on the relative intensity (volume, amplitude) of each channel is preserved and encoded into the bitstream separately. On playback, the single channel is played through left and right speakers, with the intensity information applied to each channel to give the illusion of stereo sound. This can allow further reduction of the audio bitrate without much perceivable loss of fidelity.
Subjective audio testing by experts, in the most critical conditions ever implemented, has shown MP2 to offer transparent audio compression at 256kbps for 16-bit 44.1khz CD audio. [19] That (approximately) 1:6 compression ratio for CD audio is particularly impressive since it is quite close to upper theoretical limit of perceptual entropy, at just over 1:8. [20] [21] Achieving much higher compression is simply not possible without discarding some perceptible information.
MPEG-1 Layer II was derived from the Musicam audio codec (developed by Philips-needs ref). Most key features were directly inherited, including the filter bank, time-domain processing, frame sizes, etc. However, improvements were made, and the actual Musicam algorithm was not used in the final Layer II standard. The widespread usage of the term Musicam to refer to Layer II is entirely incorrect and discouraged for both technical and legal reasons. [22]
Despite some 20 years of progress in the field of digital audio coding, MP2 remains the preeminent lossy audio coding standard due to its especially high audio coding performances on highly critical audio material such as castanet, symphonic orchestra, male and female voices and particularly complex and high energy transients (impulses) like percussive sounds: triangle, glockenspiel and audience applause. More recent testing (of multichannel audio codecs) has shown that MPEG Multichannel (based on MP2), despite being compromised by an inferior matrixed mode, rates just slightly lower than much more recent audio codecs, such as Dolby Digital AC-3 and Advanced Audio Coding (AAC) (mostly within the margin of error, actually — and still superior in some cases, namely audience applause).[23] [24]
This is one reason that MP2 audio continues to be used extensively. MP2's especially high quality, low decoder performance requirements, and tolerance of errors makes it a popular choice for applications like digital audio broadcasting (DAB).
Layer II audio files typically use the extension .mp2 or sometimes .m2a
Layer III/MP3
MP3 is a frequency domain transform encoder that utilizes a dynamic psychoacoustic model. Layer III audio files use the extension .mp3
Based on Optimum Coding in the Frequency Domain (OCF) the Ph.D thesis by Karlheinz Brandenburg, which was the primary basis for Adaptive Spectral Perceptual Entropy Coding (ASPEC) developed by Fraunhofer, which was adapted to fit in with Layer II, to become MP3.
Even though it utilizes some of the lower layer functions, MP3 is quite different from MP2.
In addition to intensity encoded joint stereo, Layer III can alternatively use mid/side joint stereo, which means, in addition to intensity information, a small range of certain key frequencies is stored separately for each channel as well.
MP3 does not benefit from the 32 sub-band filter bank, instead just MDCT transforming the data again, and processing it in the frequency domain in much smaller pieces. In fact, being forced to use the filter bank (to fit in the MPEG-1 audio standard) wastes processing time and compromises MP3 quality.
The Layer II 1024 point (FFT) analysis window for spectral estimation is too small to cover all 1152 samples, so MP3 utilize two sequential passes, reducing performance and increasing delay.
MP3 outputs 1152 samples, but spreads the larger MP3 frames over a varying number of several Layer I/II-sized frames, making editing much more difficult, and proving more vulnerable to errors.
Unlike Layers I/II, MP3 uses Huffman coding (after perceptual) to (losslessly) further reduce the bitrate, without any further quality loss, making MP3 further affected by small transmission errors.
MP3 benefits greatly from being able to divide the audio into 576 frequency components using the (overlapping) MDCT transform. This allows MP3 to more accurately apply psychoacoustic rules (than can Layer II with just 32 sub-bands), particularly in the critical bands ("barks") and providing much better low-bitrate performance.
The frequency domain (MDCT) design of MP3 imposes some limitations as well. It causes a factor of 12 - 36 times worse temporal resolution than MP2, which can cause artifacts due to transient sounds like percussive events with artifacts spread over a larger window. This results in audible smearing and pre-echo. [25]
This hybrid design also introduces additional aliasing artifacts. MP3 has an aliasing compensation stage to mask this, but instead producing frequency domain energy that is pushed to the top of the frequency range, and results in audible high frequency distortion.
Because of these issues, MP2 sound quality is actually superior to MP3 at high bitrates (at the VERY LEAST, above 112 kbps/channel)
"Frequency resolution is limited by the small long block window size, decreasing coding efficiency No scale factor band for frequencies above 15.5/15.8 kHz"
ASPEC (Fraunhoffer)* entropy coding (Huffman)* Hybrid filtering* aliasing issues* "aliasing compensation"?* need more details! mid/side (or intensity) joint stereo* "If there is a transient, 192 samples are taken instead of 576 to limit the temporal spread of quantization noise" psychoacoustic model and frame format from MP1/2* ringing CBR/VBR
Systems
Part 1 of the MPEG-1 standard covers systems, which is the logical layout of the encoded audio, video, and other bitstream data, and is defined in ISO/IEC 11172-3.
"The MPEG-1 Systems design is essentially identical to the MPEG-2 Program Stream structure." [26]
ES (Elementary Stream) PES (Packetized ES) SCR PTS (Packet Time Stamp) Wrap-around DTS (Display Time Stamp) Timebase correction Program Stream Interleaving Pixel/Display Aspect Ratio
See Also
- MPEG The Moving Picture Experts Group, developers of the MPEG-1 format
- MP3 More details on MPEG-1 Layer III audio
- MPEG Multichannel Backwards compatible 5.1 channel surround sound extension to Layer II
- MPEG-2 The direct successor to the MPEG-1 standard.
- Implementations
- Libavcodec includes MPEG-1 video/audio encoders and decoders
- Mjpegtools MPEG-1/2 video/audio encoders
- Twolame high quality MPEG-1 Layer II audio encoder based on Lame psychoacoustic models
- Musepack high quality audio format originally based on MPEG-1 Layer II, with significant incompatible changes and improvements
References
- ↑ http://www.chiariglione.org/mpeg/meetings/kurihama89/kurihama_press.htm
- ↑ 2.0 2.1 http://www.chiariglione.org/leonardo/publications/linux/linux00.htm
- ↑ http://www.cis.temple.edu/~vasilis/Courses/CIS750/Papers/mpeg_6.pdf pp.2
- ↑ 4.0 4.1 http://bmrc.berkeley.edu/research/mpeg/faq/mpeg2-v38/faq_v38.html
- ↑ http://www.chiariglione.org/mpeg/meetings/santa_clara90/santa_clara_press.htm
- ↑ http://www.chiariglione.org/mpeg/meetings.htm
- ↑ http://www.chiariglione.org/mpeg/meetings/london/london_press.htm
- ↑ http://www.emedialive.com/Articles/ReadArticle.aspx?ArticleID=12165
- ↑ http://www.extremetech.com/article2/0,1697,1153916,00.asp
- ↑ http://www.snazzizone.com/TP09.html
- ↑ http://213.130.34.82/resources/technical/mpegcompared/index.htm
- ↑ http://www-tech.mit.edu/V122/N54/54hdtv.54n.html
- ↑ mostly mathematical fact; verifiable using eg. libavcodec: compare MPEG1 encoding using vqscale=X:keyint=1 to vqscale=X:keyint=15)
- ↑ http://citeseer.ist.psu.edu/acharya98compressed.html Compressed Domain Transcoding of MPEG (Requires clever reading: says quantization tables differ, but those are user selectable)
- ↑ FPS / 60 * (number of sequential B-frames) == X seconds delay
- ↑ http://www.hpl.hp.com/personal/Susie_Wee/PAPERS/hpidc97/hpidc97.html
- ↑ http://citeseer.ist.psu.edu/257196.html pp.7
- ↑ http://www.twolame.org/doc/psycho.html
- ↑ http://www.faqs.org/faqs/mpeg-faq/part1/ "You can compress the same stereo program down to 256 Kbits/s with no loss in discernible quality." (the original papers would be much, much better refs, but I can't seem to find them! This just proves they exist!)
- ↑ J. Johnston, Estimation of Perceptual Entropy Using Noise Masking Criteria, in Proc. ICASSP-88, pp. 2524-2527, May 1988.
- ↑ 6. J. Johnston, Transform Coding of Audio Signals Using Perceptual Noise Criteria, IEEE J. Sel. Areas in Comm., pp. 314-323, Feb. 1988.
- ↑ http://www.chiariglione.org/MPEG/faq/mp1-aud/mp1-aud.htm#16
- ↑ Wustenhagen et al, Subjective Listening Test of Multi-channel Audio Codecs, AES 105th Convention Paper 4813, San Francisco 1998
- ↑ http://www.ebu.ch/CMSimages/en/tec_doc_t3324-2007_tcm6-53801.pdf
- ↑ http://www.cs.columbia.edu/~coms6181/slides/6R/mpegaud.pdf pp.8
- ↑ http://www.chiariglione.org/mpeg/faq/mp1-sys/mp1-sys.htm
External Links
- http://www.chiariglione.org/mpeg/ Official Home Page of the Moving Picture Experts Group (MPEG) a working group of ISO/IEC