POIFS File System Internals
POIFS File System Internals
by Marc Johnson
1.1. Introduction
POIFS file systems are essentially normal files stored on a Java-compatible platform's native file system. They are typically identified by names ending in a four character extension noting what type of data they contain. For example, a file ending in ".xls" would likely contain spreadsheet data, and a file ending in ".doc" would probably contain a word processing document. POIFS file systems are called "file system", because they contain multiple embedded files in a manner similar to traditional file systems. Along functional lines, it would be more accurate to call these POIFS archives. For the remainder of this document it is referred to as a file system in order to avoid confusion with the "files" it contains. POIFS file systems are compatible with those document formats used by a well-known software company's popular office productivity suite and programs outputting compatible data. Because the POIFS file system does not provide compression, encryption or any other worthwhile feature, its not a good choice unless you require interoperability with these programs. The POIFS file system does not encode the documents themselves. For example, if you had a word processor file with the extension ".doc", you would actually have a POIFS file system with a document file archived inside of that file system. Note - this document is a good overview and explanation of the file format, but for the very nitty-gritty details, you should refer to [MS-CFB].pdf in the (now public) Microsoft Documentation.
Page 1
Copyright 2002-2011 The Apache Software Foundation All rights reserved.
A byte is an 8 bit signed integer ranging from -128 to 127. A short is a 16 bit signed integer ranging from -32768 to 32767 An int is a 32 bit signed integer ranging from -2147483648 to 2147483647 A long is a 64 bit signed integer ranging from -9.22E18 to 9.22E18. The Java Language Specification spells out a number of other types that are not referred to by this document. Where this document makes references to "endian conversion" it is referring to the byte order of stored numbers. Numbers in "little-endian order" are stored with the least significant byte first. In order to properly read a short, for example, you'd read two bytes and then shift the second byte 8 bits to the left before performing an or operation to it against the first byte. The following code illustrates this method:
public int getShort (byte[] rec) { return ((rec[1] << 8) | (rec[0] & 0x00ff)); }
Page 2
Copyright 2002-2011 The Apache Software Foundation All rights reserved.
The property table is essentially the directory storage for the file system. It consists of the name of the file or directory, its start block in both the file system and BAT, and its actual size. The first property in the property table is the root element. It has two purposes: to be a directory entry (the root of the directory tree, to be specific), and to hold the start block for the small block data. Small block data is a special file that contains the data for small files (less than 4K bytes). It subdivides its blocks into smaller blocks and there is a special small block allocation table that, like the main BAT for larger files, is used to map a small file to its small blocks.
the BAT blocks enumerated in the header block are BAT blocks 0 through 108, the BAT blocks enumerated in the first XBAT block are BAT blocks 109 through 235, the BAT blocks enumerated in the second XBAT block are BAT blocks 236 through 362, and so on. While a normal BAT block holds 128 entries, each XBAT only references 127 BAT blocks. The last, 128th entry in an XBAT is the offset to the next XBAT block in the chain (or -1 if this is the last XBAT). Through the use of XBAT blocks, the limit on the overall document size is that imposed by the 4-byte block indices; if the indices are unsigned ints, the maximum file size is 2 terabytes, 1 terabyte if the indices are treated as signed ints. Either way, I have yet to see a disk drive large enough to accommodate such a file on the shelves at the local office supply stores. 1.4.3. SBATs If a file contained in a POIFS archive is smaller than 4096 bytes, it is stored in small blocks. Small blocks are 64 bytes in length and are contained within big blocks, up to 8 to a big block. As the main BAT is used to navigate the array of big blocks, so the small block allocation table is used to navigate the array of small blocks. The SBAT's start block index is found at offset 0x3C of the header block, and remaining blocks constituting the SBAT are found by walking the main BAT as if it were an ordinary file in the POIFS file system (this process is described below). 1.4.4. Property Table Start Index An integer at address 0x30 specifies the start index of the property table. This integer is specified as a "block index". The Property Table is stored, as is almost everything in a POIFS file system, in big blocks and walked via the BAT. The Property Table is described below.
Page 4
Copyright 2002-2011 The Apache Software Foundation All rights reserved.
following rules: 1. The root of the tree is always black 2. Two consecutive nodes cannot both be red 3. A property is less than another property if its name length is less than the other property's name length 4. If two properties have the same name length, the sort order is determined by the sort order of the properties' names. At offset 0x44 is the index (int) of the previous property. At offset 0x48 is the index (int) of the next property. At offset 0x4C is the index (int) of the first directory entry. This is used by directory entries. At offset 0x74 is an integer giving the start block for the file described by this property. This index corresponds to an index in the array of indices that is the Block Allocation Table (or the Small Block Allocation Table) as well as the index of the first block in the file. This is used by files and the root entry. At offset 0x78 is an integer giving the total actual size of the file pointed at by this property. If the file size is less than 4096, the file is stored in small blocks and the SBAT is used to walk the small blocks making up the file. If the file size is 4096 or larger, the file is stored in big blocks and the main BAT is used to walk the big blocks making up the file. The exception to this rule is the Root Entry, which, regardless of its size, is always stored in big blocks and the main BAT is used to walk the big blocks making up this special file.
Page 5
Copyright 2002-2011 The Apache Software Foundation All rights reserved.
property set Each directory entry (i.e., a property whose type is directory or root entry) uses its CHILD_PROP field to point to one of its subordinate (child) properties. It doesn't seem to matter which of its children it points to. Thus in the previous drawing, the Root Entry's CHILD_PROP field may contain 1, 4, or the index of one of its other children. Similarly, the directory node (index 1) may have, in its CHILD_PROP field, 2, 3, or the index of one of its other children. The children of a given directory property point to each other in a similar fashion by using their NEXT_PROP and PREVIOUS_PROP fields. Unused NEXT_PROP, PREVIOUS_PROP, and CHILD_PROP fields contain the marker value of -1. All file properties have a value of -1 for their CHILD_PROP fields for example.
The BAT blocks are pointed at by the bat array contained in the header and supplemented, if necessary, by the XBAT blocks. These blocks form a large table of integers. These integers are block numbers. The Block Allocation Table holds chains of integers. These chains are terminated with -2. The elements in these chains refer to blocks in the files. The starting block of a file is NOT specified in the BAT. It is specified by the property for a given file. The elements in this BAT are both the block number (within the file minus the header) and the number of the next BAT element in the chain. This can be thought of as a linked list of blocks. The BAT array contains the links from one block to the next, including the end of chain marker. Here's an example: Let's assume that the BAT begins as follows: BAT[ 0 ] = 2 BAT[ 1 ] = 5 BAT[ 2 ] = 3 BAT[ 3 ] = 4 BAT[ 4 ] = 6 BAT[ 5 ] = -2 BAT[ 6 ] = 7 BAT[ 7 ] = -2 ... Now, if we have a file whose Property Table entry says it begins with index 0, we walk the BAT array and see that the file consists of blocks 0 (because the start block is 0), 2 (because BAT[ 0 ] is 2), 3 (BAT[ 2 ] is 3), 4 (BAT[ 3 ] is 4), 6 (BAT[ 4 ] is 6), and 7 (BAT[ 6 ] is 7). It ends at block 7 because BAT[ 7 ] is -2, which is the end of chain marker. Similarly, a file beginning at index 1 consists of blocks 1 and 5. Other special numbers in a BAT array are: -1, which indicates an unused block -3, which indicates a "special" block, such as a block used to make up the Small Block Array, the Property Table, the main BAT, or the SBAT
Page 7
Copyright 2002-2011 The Apache Software Foundation All rights reserved.
Magic number 0x0000 identifying this as a POIFS file system. Unknown constant Unknown Constant Unknown Constant Unknown Constant (revision?) Unknown Constant (version?) Unknown Constant 0x0008 0x000C 0x0014 0x0018
0 0 0 0x003B
UK5
0x001A
Short
0x0003
UK6
0x001C
LOG_2_BIG_BLOCK_SIZE Log, base 2, of 0x001E the big block size LOG_2_SMALL_BLOCK_SIZE Log, base 2, of 0x0020 the small block size UK7 UK8 BAT_COUNT Unknown Constant Unknown Constant 0x0024 0x0028
Integer
required
Page 8
Copyright 2002-2011 The Apache Software Foundation All rights reserved.
property table UK9 UK10 SBAT_START Unknown Constant Unknown Constant 0x0034 0x0038 Integer Integer Integer 0 0x00001000 -2
Block index of 0x003C first big block containing the small block allocation table (SBAT)
SBAT_Block_Count Number of big 0x0040 blocks holding the SBAT XBAT_START Block index of the 0x0044 first block in the Extended Block Allocation Table (XBAT) Number of 0x0048 elements in the Extended Block Allocation Table (to be added to the BAT)
Integer
Integer
-2
XBAT_COUNT
Integer
BAT_ARRAY
Array of block 0x004C, 0x0050, Integer[] indices 0x0054 ... constituting the 0x01FC Block Allocation Table (BAT) Header block N/A data not otherwise described in this table N/A
N/A
Page 9
Copyright 2002-2011 The Apache Software Foundation All rights reserved.
BAT_ELEMENT
Any given 0x0000, 0x0004, Integer element in the 0x0008, ... BAT block 0x01FC
-1 = unused -2 = end of chain -3 = special (e.g., BAT block) All other values point to the next element in the chain and the next index of a block composing the file.
A unicode 0x00, 0x02, 0x04, Short[] null-terminated ... 0x3E uncompressed 16bit string (lose the high bytes) containing the name of the property. Number of 0x40 characters in the NAME field Short
NAME_SIZE
Required
PROPERTY_TYPE Property type 0x42 (directory, file, or root) NODE_COLOR Node color 0x43
Byte
Byte Integer
Page 10
Copyright 2002-2011 The Apache Software Foundation All rights reserved.
Next index
property 0x48
-1 -1 0
First child 0x4c property index Seconds 0x64 component of the created timestamp? Days component 0x68 of the created timestamp? Seconds 0x6C component of the modified timestamp? Days component 0x70 of the modified timestamp? Starting block of 0x74 the file, used as the first block in the file and the pointer to the next block from the BAT Actual size of the 0x78 file this property points to. (used to truncate the blocks to the real size).
DAYS_1
Integer
SECONDS_2
Integer
DAYS_2
Integer
START_BLOCK
Integer
Required
SIZE
Integer
Page 11
Copyright 2002-2011 The Apache Software Foundation All rights reserved.