JPEG - Idea and Practice/The header part

The markers

The header part of a JPEG file is divided into segments, and each segment starts with a marker, identifying the segment. Usually a JPEG file contains 7 different markers. A marker is a pair of bytes, the first is 255 and the second is different from 0 and 255. We identify a marker by its second byte. Two markers stand alone (and thus do not open a segment): the marker which opens the file SOI (Start Of Image) = 216 and the marker which closes the file EOI (End Of Image) = 217. (There is one more type of marker which stands alone, but this is not used in the sequential DCT mode which we restrict ourselves to here: it marks a restart of a scanning and it is indexed by one of the numbers 0, 1, ..., 7: RST0, ..., RST7 (ReSTart) = 208, ..., 215). The other markers open a segment, and in this case the following pair of bytes (b1, b2) states the length l of the segment (including these two bytes): l = b1 * 256 + b2. The following sequence of l - 2 bytes is the content of the segment. There are the following types of segments (identified with their markers):

APP0, APP1, ..., APP15 (APPlication) 224-239

COM (COMment) 254

SOF (Start Of Frame) 192-207, except 196, 200 and 204

DHT (Define Huffman Table) 196

DQT (Define Quantization Table) 219

SOS (Start Of Scan) 218

(and a few more, which are not used here: DNL (Define Number of Lines = 220), DRI (Define Restart Interval = 221), DHP (Define Hierarchical Progression = 222), EXP (EXPand reference component(s) = 223), DAC (Define Arithmetic Coding conditioning(s) = 204), TEM (for TEMporary use in arithmetic coding = 1) and besides some reserved markers: JPG (reserved for JPeG extensions = 200, 240, 241, ..., 253) and RES (REServed = 2, ..., 191))

The first two - APP and COM - specify things that lie outside the proper JPEG procedure. Usually only a single APP segment is present (namely APP0), specifying the implementation. An APP segment can also contain information on camera type and on when the picture was taken. COM can state the program used to make the file, the chosen quality per cent, etc.

The frame segment SOF

The point of departure of the JPEG procedure is a "picture", and a picture can be defined as a (rectangular) matrix of either numbers, pairs of numbers, triples of numbers or quadruples of numbers. That is, a picture is a matrix of arrays having one of the numbers 1-4 as length. A grey scale picture is a matrix of bytes. A colour picture is a matrix of RGB triples (of bytes) or of YCbCr triples (of signed bytes). A picture can thus be regarded as consisting of one or more (at most four) matrices of integers, and such a matrix is called a component of the picture. To each component is assigned a component identifier (byte): for instance 0 for the (one) component of a grey scale picture, and 0, 1 and 2 for the three components of a colour picture.

The dimensions of the picture, the component identifiers and the order of the components are specified in the frame segment SOF, along with how the components are to be handled in relation to each other. Because the colours usually only alter slowly from place to place (and as we are not very good at distinguishing small alterations in colours), for the two colour components, we can, for instance, divide the picture up in 2x2-squares of pixels and take the average values, so that we regard such a square as one pixel and thus deal with colour pictures that are four times as small. We can also restrict ourselves to two pixels, either lying horizontally or vertically. A pair of numbers (Hi, Vi) for each component determines how the components are to be scanned in relation to each other. Hi and Vi can go from 1 to 4 (Hi and Vi must be rather small: the sum of their products must not exceed 10). Let H and V be the maximum Hi and Vi value, respectively. These maximum values are usually linked to the Y component, and this ((Hi, Vi) = (H, V)) means that the pixels are taken as they are: there are as many samples horizontally as the width of the picture, and there are as many horizontal lines as the height of the picture. If a (colour) component has the pair (Hi, Vi), the number of samples in a horizontal line is (Hi/H) times width, and the number of sampling lines is (Vi/V) times height, that is, small rectangles of (H/Hi)x(V/Vi) pixels are collected (and regarded as one pixel). Usually (Hi, Vi) = (1, 1) for the two colour components, and (Hi, Vi) = (1, 1) or (2, 1) or (1, 2) or (2, 2) for the Y component. (Hi, Vi) = (2, 2) means that four colour pixels are collected and that "this" pixel is combined with four Y pixels. As the picture is divided up in 8x8-squares, this means that four 8x8-squares for the Y component are combined with one 8x8-square for the colour components. The coded data (the coded 64-arrays) for the four Y squares are written in the file in the usual scanning order: from left to right along the lines, and from top to bottom. Next comes the coded data (the coded 64-arrays) for the two colour components. The analogue procedure when only two pixels are collected (horizontally or vertically). Such a part of the data stream arising from all the components and the collected 8x8-squares is called a minimum coded unit (MCU).

This picture shows the drawing (pixel for pixel - and on an enlarged scale) when four Y component 8x8-squares are collected - you are to image four 8x8-squares in the centre, the two (uppermost) have been drawn, the third is being drawn:

The drawing

The two pictures below the following picture (which takes up 3.2 Kb) are this picture with every second vertical line drawn black, but scanned in different ways: for the colour components, two pixels are collected in the vertical and the horizontal direction, respectively (that is, (Hi, Vi) = (1, 1) for the colour components, and (Hi, Vi) = (1, 2) and (2, 1) for the Y component). In the first picture (which takes up 5.9 Kb) the colours are correct, in the second picture (which takes up 4.7 Kb) the colours are faded, because they are mixed with the black of the lines:

Original

Vertical subsampling

Horizontal subsampling

The frame segment SOF consists of the following bytes: the marker (255, b), where the byte b specifies the scanning mode. We assume here that b = 192, meaning the baseline sequential DCT mode; then the pair of bytes stating the length of the segment (including these two bytes), this pair is (0, 8 + 3 * the number of components); then a byte stating the number of bits of the colour values, here set to 8 (meaning that the colour values are bytes), but it is 12 for the extended mode; then a pair of bytes (b1, b2) stating the height (= b1 * 256 + b2) of the picture and a pair of bytes stating the width; and finally a byte stating the number of components (1-4), and for each component these bytes: the component identifier (byte), Hi (½ byte) and Vi (½ byte)(byte = Hi * 16 + Vi) and the quantization table destination selector (byte).

The pair (Hi, Vi) is here (1, 1) for the colour components and (1, 1), (1, 2), (2, 1) or (2, 2) for the Y component. The quantization table destination selector is one of the numbers 0-3, for instance 0 for the Y component and 1 for the colour components.

The Huffman table segment DHT

Usually there are two Huffman table segments in the file for a grey scale picture and four for a colour picture: for each component the DC and the AC numbers are coded differently, and the Y component and the two colour components are coded differently. In a Huffman segment the information (after the marker and the pair of bytes stating the length) is arranged in this way: the first half byte is 0 if the Huffman tables are for DC numbers and 1 if they are for the AC numbers. The next half byte is the Huffman table destination identifier (0 or 1), for instance 0 for the Y component and 1 for the colour components (to be referred to in the scan segment SOS where the Huffman tables are specified). The following sequence of 16 bytes is the list BITS, stating for i = 1, ..., 16 the number of codes of length i. And then comes the list HUFFVAL of Huffman values: for each code length different from zero, there will be just as many values as there are codes of this length. If we call the number of Huffman values nhv, the number of bytes in the segment (including the pair stating the length) is 19 + nhv.

The Quantization table segment DQT

A quantization table is an 8x8 matrix of bytes ordered after the zigzag principle. There are usually different quantization tables for the Y component and for the colour components. In the annex "Examples and guidelines" of T.81 you can find the following for respectively the Y component and the colour components:

16 11 10 16 24 40 51 61

12 12 14 19 26 58 60 55

14 13 16 24 40 57 69 56

14 17 22 29 51 87 80 62

18 22 37 56 68 109 103 77

24 35 55 64 81 104 113 92

49 64 78 87 103 121 120 101

72 92 95 98 112 100 103 99

17 18 24 47 99 99 99 99

18 21 26 66 99 99 99 99

24 26 56 99 99 99 99 99

47 66 99 99 99 99 99 99

99 99 99 99 99 99 99 99

It is mentioned that "If these quantization values are divided by 2, the resulting reconstructed image is usually nearly indistinguishable from the source image". With our program "JPEG_File" you can see the tables for a picture (using the sequential DCT procedure and) given the name "pict". In our program to produce a (true) JPEG file we have chosen another table for the Y component than the above, namely the following used in an image editing program (IrfanView), by setting the quality to 70 per cent:

10 7 6 10 14 24 31 37

7 8 11 16 35 36 33

8 8 10 14 24 34 41 34

10 13 17 31 52 48 37

11 13 22 34 41 65 62 46

14 21 33 38 49 62 68 55

29 38 47 52 62 73 72 61

43 55 57 59 67 60 62 59

A quantization table is specified in a DQT segment. A DQT segment begins with the marker DQT = 219 and the length, which is (0, 67). Then comes a byte the first half of which here is 0, meaning that the table consists of bytes (8 bit numbers - for the extended mode it is 1, meaning that the table consists of words, 16 bit numbers), and the last half of which is the destination identifier of the table (0-3), for instance 0 for the Y component and 1 for the colour components. Next follow the 64 numbers of the table (bytes).

The scan segment SOS

Just after the scan segment SOS come the encoded data of the picture, and the scan segment specifies the Huffman tables to be used for the components. The segment begins with the marker SOS = 218 and the length, which is (0, 6 + 2 * the number of components). Then comes a byte stating the number of components (1-4), and then for each component two bytes, the first is the component identifier (defined in the frame segment) and the second is divided up in two parts, the first stating the destination selector of the DC Huffman table and the second the destination selector of the AC Huffman table (for instance 0 for the Y component and 1 for the colour components). The segment closes with three bytes which in our case (sequential DCT) are 0, 63 and 0 (the last divided in two half bytes).