You are on page 1of 16

Ext3/4 on-disk layout

Kalpak Shah, Lustre Group

GEEP - geeksofpune.org

AGENGA
Layout of EXT3/4 Essential on-disk data structures New features in ext4
Extents, uninit_bg, nanosecond timestamps, 48-bit support, preallocation, mballoc, flex_bg, journal checksums Its effects on on-disk layout

Crash recovery Latest filesystem design layouts

Basic layout of EXT2/3/4 partition

All block groups are of same size and stored sequentially. Superblock and group descriptors are duplicated in multiple block groups as per SPARSE_SUPER feature. Block sizes starting from 512 bytes upto 8KB are supported.
3

Creating an ext3 fs
mkfs.ext3 -b 4096 -I 512 -i 8192 -J size=256 /dev/sda1 Blocksize consideration Number of inodes and inode sizes Journal size For example, consider an 8GB ext3 fs with a 4KB blocksize. In this case, each 4KB block bitmap describes 32K data blocks that is, 128MB. Therefore 64 block groups will be present in this fs.

EXT3 superblock
The ext3 superblock is stored in an ext3_super_block structure. Some important fields are listed here: s_inodes_count, s_blocks_count, s_free_blocks_count, s_free_inodes_count, s_inode_size blocks_per_group, inodes_per_group s_mnt_count, s_max_mnt_count s_feature_{compat, incompat, rocompat} s_uuid, s_volume_name s_journal_inum, s_journal_dev s_state, s_errors

Group Descriptors
Each block group has its own group descriptor, represented by ext3_group_desc structure, which has these fields: bg_block_bitmap bg_inode_bitmap bg_inode_table bg_free_{blocks,inode}_count bg_used_dirs_count Most field are useful for inode/block allocator

EXT3/4 inode
The on-disk ext3/4 inode structure has these fields: i_mode, i_uid, i_gid i_size, i_blocks i_atime, i_mtime, i_dtime, i_ctime i_links_count i_block[EXT2_N_BLOCKS(15)] i_version (for NFS) i_file_acl i_dir_acl (i_size_high) New in ext4: i_extra_isize i_size_hi, i_size_high, l_i_file_acl_high i_{ctime,mtime,atime,crtime}_extra i_version_hi
7

Directory layout
EXT3/4 implements directories using a special kind of file whose data blocks store filenames along with corresponding inode numbers. Such data blocks basically contain structures of type ext3_dir_entry_2. This structure contains the following fields: Inode number Directory entry length Name length Filetype Name Directories entries are stored using a 2-level hashing for fast retrieval.

EXT4 features - EXTENTS


Replaces traditional indirect block mapping scheme which causes high metadata overhead and poor performance with large files. An extent is a single descriptor that represents a range of contiguous blocks:
struct ext4_extent { __le32 ee_block; /* first logical block */ __le16 ee_len; /* no of blocks */ __le16 ee_start_hi; /* high 16 bits of phy blk */ __le32 ee_start_lo; /* low 32 bits of phy blk */ }; Extents tree leads to efficient lookups and improves performance on sequential IO as well as mail server workloads. Ext4 supports both extents and indirect mapping schemes and files can be converted between the two formats.

EXTENTS

10

EXT4 features
UNINIT_BG
For very large filesystems, e2fsck times are starting to become unacceptable. The uninitialized block groups feature uses flags in the group descriptor to indicate of the block group is initialized or not. Efsck can just ignore block groups that are marked as uninitialized . The flags marking the block group uninitialized and the high watermark are checksummed so we can detect corruption. We have seen 2-10x speedup for e2fsck in many cases.

Nanosecond timestamp support


Using the i_{atime, ctime, mtime, crtime}_extra fields.
11

EXT4 features
Large FS support
Ext3 used 32-bit block numbers and with 4KB blocksize, the filesystem is limited to maximum 16TB size. Ext4 uses 48-bit block numbers. All on-disk structures needed to be changed to support the 48-bit block number.

Persistent preallocation (fallocate support)


Apps such as large databases often write zeros to a file for guaranteed and contiguous space reservation. Ext4 improves this scenario by skipping the zeroout and marking the extents as uninitialized instead.

12

EXT4 features
Multi-block-allocator Allocates multiple blocks at once using buddy data structure. Includes inode and group preallocation Includes special allocation modes for small files and GOAL blocks. flex_bg This feature groups meta-data(inode,block bitmap and indoe table) from a series of groups at the beginning of a flex group in order to improve performance during heavy meta-data operations.

13

Crash recovery - JBD/2


First a copy of the blocks to be written is stored in the journal. Then, when the I/O transfer to the journal is completed (commit block is written), the blocks are written (replayed) in the filesystem.

Journaling modes:
Journal All data and metadata is journaled. Ordered Only metadata changes are journaled. Data blocks are written to disk before the metadata to avoid data corruption. Writeback Only metadata is journaled. Fastest mode.

Journal checksums
All blocks in a transaction are checksummed and the checksum is stored in the commit header. While replaying the transaction(either by e2fsck or ext4), this checksum ensures that corrupt or partial transactions are not written to disk.
14

Latest filesystem design layouts


Trees
Latest filesystems like ZFS, BtrFS, Tux3 use indexed trees for efficient directory layouts, blocks, objects(inodes, EAs) and snapshots. With 64-bit or 128-bit pointers, we literally end all limits imposed on filesystems no of inodes, EA sizes, no of files within directories.

Checksumming
All data/metadata is checksummed for early detection and possible correction.

In-built VM
Volume manager and filesystem are tightly coupled to take advantage of mirroring and RAID like functionality.

In-built encryption, compression


15

QUESTIONS?

16

You might also like