Category Archives: filesystems

Understanding Inodes

Understanding inodes is crucial to understanding Unix filesystems. Files contain data and metadata. Metadata is information about the file. Metadata is stored in an inode. The contents of an inode are:

  1. Inode Number
  2. Uid
  3. Gid
  4. Size
  5. Atime
  6. Mtime
  7. Ctime
  8. Blocksize
  9. Mode
  10. Number of links
  11. ACLs

Inodes are usually 256 bytes in size. Filenames are not stored in inodes, instead they are stored in the data portion of a directory. Usually filenames are stored in a linear manner, that is why searching for a filename can take a long time. Ext4 and XFS use more efficient Btrees to store filenames in directories, this allows for constant lookup times instead of linear lookup times.

Dentry is short for directory entry and is used to keep track of inode and filename information in a directory.

An inode can contain direct or indirect points to blocks of data for a given file. Direct block means that the inode contains the block number of a block that contains the actual file data. Indirect block means that the inode contains the block number of a block that then contains further block numbers to read data from.

Ext filesystem creates a fixed number of inodes when the filesystem is formatted. If you run out of inodes you have to format the filesystem. XFS does not contain a fixed number of inodes, they are created on demand.

When you delete a file the unlink() system call removes the directory entry for the inode and marks it available. The data blocks themselves are not deleted.

The number of links to a file is maintained in an inode. Each time a hard link is created the number of links increases. Soft links do not increase the number of links to a file or directory.

Superblock contains metadata about a filesystem. A filesystem typically stores many copies of a superblock in case one of them gets damaged. Some of the information in a superblock is:

– Filesystem size
– Block size
– Empty and filled blocks
– Size and location of inode table
– Disk block map

You can read superblock information using the command ‘dumpe2fs /dev/mount | grep -i superblock’.

Sparse Files

Sparse files are files whose metadata reports one size, but the file itself takes less space on the filesystem.
Spare files are a common way to efficiently use disk space. They can be created using the ‘truncate’ command.
Or you can create them by opening a file programmatically, seeking to an offset and then closing the file, without writing anything.

ls -l reports the length of the file, so the same file of 2 bytes will be reported as 2 bytes.
ls -s reports the size based on blocks, so for a 2 byte file, ls -s will report a size of 4K, since the block size is 4K.

du reports size based on blocks being used. For instance, if a file is 2 bytes, and the block size is 4096, du will report the file being 4K.
du -b will report the same size of a file as ls -l, since -b means apparent size.

Both ls -l and du -b do not take into account spare files. If a file if sparse, du -b and ls -l report it as though it is not sparse.

When using the ‘cp’ command, use the ‘cp –sparse=always’ option to keep sparse files as sparse.

‘scp’ is not sparse aware and if you use scp to copy a file that is spare it will take up “more” room on the destination host. Instead if you use rsync with -S option, spare files will be maintained as sparse.

tar is not sparse file smart by default. If you tar a sparse file, the tar file itself and when you untar the tar file, both will fill the sparse areas of the sparse file with zeros, resulting in more disk blocks being used. You should use the ‘-S’ option with tar to make it sparse file smart.

# create a sparse file of size 1GB
$ truncate -s +1G test

# The first number shows 0, which is block based size
$ ls -lsh test
total 1.G
0 -rw-rw-r-- 1 orion orion 1.0G Jan  7 14:07 test

# create tar file
$ tar -cvf test.tar test

# test.tar now really takes up 1GB
$ ls -ls test.tar
1.1G -rw-rw-r-- 1 orion orion 1.1G Jan  7 14:08 test.tar

# untarring now shows that the file is now using 1GB, before it was using 0GB
$ rm test
$ tar xvf test.tar
$ ls -lsh test
1.0G -rw-rw-r-- 1 orion orion 1.0G Jan  7 14:07 test

With the -S option, tar is smarter and the file continues to be still sparse.

# create a sparse file of size 1GB
$ truncate -s +1G test

# The first number shows 0, which is block based size
$ ls -lsh test
total 1.G
0 -rw-rw-r-- 1 orion orion 1.0G Jan  7 14:07 test

# create tar file with -S
$ tar -S -cvf test.tar test

# test.tar allocated size based on blocks is now 12
$ ls -ls test.tar
12 -rw-rw-r-- 1 orion orion      10240 Jan  7 14:19 test.tar

# untarring now shows that the file is still sparse
$ rm test
$ tar xvf test.tar
$ ls -lsh test
0 -rw-rw-r-- 1 orion orion 1073741824 Jan  7 14:19 test

When we do a ‘stat’ on a spare file, we see it it taking up no space in terms of blocks.

$ stat test
  File: `test'
  Size: 1073741824	Blocks: 0          IO Block: 4096   regular file
Device: fd07h/64775d	Inode: 1046835     Links: 1
Access: (0664/-rw-rw-r--)  Uid: (  500/   orion)   Gid: (  500/   orion)
Access: 2015-01-07 14:19:53.957911258 -0800
Modify: 2015-01-07 14:17:58.000000000 -0800
Change: 2015-01-07 14:19:53.957623281 -0800

We can also measure the extents used.

$ filefrag test
test: 0 extents found

Copying files in Linux

Copying files should be simple, yet there are a number of ways of transferring files.
Some of the ways that I could think of are listed here.

# -a=archive mode; equals -rlptgoD
# -r=recurse into directories
# -l=copy symlinks as symlinks
# -p=preserve permissions
# -t=preserve modification times
# -g=preserve group
# -o=preserve owner (super-user only)
# -D=preserve device files (super-user only) and special files
# -v=verbose
# -P=keep partially transferred files and show progress
# -H=preserve hardlinks
# -A=preserve ACLs
# -X=preserve selinux and other extended attributes
$ rsync -avPHAX /source /destination

# cross systems using ssh
# -z=compress
# -e=specify remote shell to use
$ rsync -azv -e ssh /source user@destinationhost:/destination-dir

# -xdev=Don’t  descend  directories on other filesystems
# -print=print the filenames found
# -p=Run in copy-pass mode
# -d=make directories
# -m=preserve-modification-time
# -v=verbose
$ find /source -xdev -path | cpio -pdmv /destination

# let's not forget good old cp
# -r=recursive
# -p=preserve mode,ownership,timestamps
# -v=verbose
$ cp -rpv --sparse=always /source /destination

# tar
# -c=create a new archive
# -v=verbose
# -f=use archive file
$ tar cvf - /source | (cd /destination && tar xvf -)

# scp
$ scp -r /source user@destinationhost:/destination-dir

# copy an entire partition
$ dd if=/dev/source-partition of=/dev/destination-partition bs=<block-size>

Common Mount Options

async -> Allows the asynchronous input/output operations on the file system.
auto -> Allows the file system to be mounted automatically using the mount -a command.
defaults -> Provides an alias for async,auto,dev,exec,nouser,rw,suid.
exec -> Allows the execution of binary files on the particular file system.
loop -> Mounts an image as a loop device.
noauto -> Default behavior disallows the automatic mount of the file system using the mount -a command.
noexec -> Disallows the execution of binary files on the particular file system.
nouser -> Disallows an ordinary user (that is, other than root) to mount and unmount the file system.
remount -> Remounts the file system in case it is already mounted.
ro -> Mounts the file system for reading only.
rw -> Mounts the file system for both reading and writing.
user -> Allows an ordinary user (that is, other than root) to mount and unmount the file system.

Understanding proc filesystem

Understanding the /proc filesystem.

Proc is a pseudo filesystem that is generally mounted as /proc. It provides an interface into the kernel data structures. Proc contains a directory of each of the process-id’s running on the system. Inside each of the process-id directory you can find out additional information about the process. For instance if you need to know about all the file descriptors which a process has open, you can ls /proc//fd . For instance, let’s say you look at rsyslog:

$ pgrep rsyslog
1405
$ ls -l /proc/1405/fd
total 0
lrwx------ 1 root root 64 May  6 00:20 0 -> socket:[11792]
l-wx------ 1 root root 64 May  6 00:20 1 -> /var/log/messages
l-wx------ 1 root root 64 May  6 00:20 2 -> /var/log/cron
lr-x------ 1 root root 64 May  6 00:20 3 -> /proc/kmsg
l-wx------ 1 root root 64 May  6 00:20 4 -> /var/log/secure

Rsyslog is running with process-id 1405, when I do an ‘ls -l /proc/1405/fd’ I can see that rsyslog has /var/log/messages open, which makes sense since rsyslog writes messages to /var/log/messages.

If you want to know which environment variables a process is running with you can ‘ (cat /proc//environ; echo) | tr ’00’ ‘\n’. In the above example, let’s say I want to see which environment variables rsyslog started with:

$  (cat /proc/1405/environ; echo) | tr '\000' '\n'
TERM=linux
PATH=/sbin:/usr/sbin:/bin:/usr/bin
runlevel=3
RUNLEVEL=3
LANGSH_SOURCED=1
PWD=/
LANG=en_US.UTF-8
previous=N
PREVLEVEL=N
CONSOLETYPE=serial
SHLVL=3
UPSTART_INSTANCE=
UPSTART_EVENTS=runlevel
UPSTART_JOB=rc
_=/sbin/rsyslogd

If you want to know the path of the binary that was executed try the below:

$ ls -l /proc/1405/exe
lrwxrwxrwx 1 root root 0 May  5 22:36 /proc/1405/exe -> /sbin/rsyslogd

Each process has certain limits that are generally defined in /etc/security/limits.conf. Some of these can be viewed in /proc//limits file. In the rsyslog example here is the output of the file:

[root@hv1 proc]# cat 1405/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            10485760             unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             127212               127212               processes 
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       127212               127212               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        

XFS vs Ext4 performance

I wanted to test XFS vs Ext4 performance, so I created two partitions /dev/mapper/vg_hv1-lv_vm1 which is xfs based and /dev/mapper/vg_hv1-lv_vm2 which is ext4 based. Both partitions are on a single RAID-1 disk.

[hv ~]$ sudo hdparm -Tt /dev/mapper/vg_hv1-lv_vm1

/dev/mapper/vg_hv1-lv_vm1:
Timing cached reads: 24786 MB in 2.00 seconds = 12413.11 MB/sec
Timing buffered disk reads: 370 MB in 3.01 seconds = 123.01 MB/sec
[hv ~]$ sudo hdparm -Tt /dev/mapper/vg_hv1-lv_vm1

/dev/mapper/vg_hv1-lv_vm1:
Timing cached reads: 24602 MB in 2.00 seconds = 12320.66 MB/sec
Timing buffered disk reads: 366 MB in 3.00 seconds = 121.80 MB/sec
[hv ~]$ sudo hdparm -Tt /dev/mapper/vg_hv1-lv_vm1

/dev/mapper/vg_hv1-lv_vm1:
Timing cached reads: 24300 MB in 2.00 seconds = 12169.27 MB/sec
Timing buffered disk reads: 374 MB in 3.01 seconds = 124.37 MB/sec
[hv ~]$ sudo hdparm -Tt /dev/mapper/vg_hv1-lv_vm2

/dev/mapper/vg_hv1-lv_vm2:
Timing cached reads: 24566 MB in 2.00 seconds = 12302.76 MB/sec
Timing buffered disk reads: 392 MB in 3.01 seconds = 130.37 MB/sec
[hv ~]$ sudo hdparm -Tt /dev/mapper/vg_hv1-lv_vm2

/dev/mapper/vg_hv1-lv_vm2:
Timing cached reads: 24576 MB in 2.00 seconds = 12307.80 MB/sec
Timing buffered disk reads: 366 MB in 3.01 seconds = 121.42 MB/sec
[hv ~]$ sudo hdparm -Tt /dev/mapper/vg_hv1-lv_vm2

/dev/mapper/vg_hv1-lv_vm2:
Timing cached reads: 24322 MB in 2.00 seconds = 12180.78 MB/sec
Timing buffered disk reads: 396 MB in 3.01 seconds = 131.41 MB/sec

I would expect the timing cached reads results to be very similar, since timining cached reads is a measure of processor, cache and memory. Timing cached reads basically reads from the Linux buffer cache without disk access.

The timing buffered disk read on the other hand flushes the Linux buffer caches and reads through it directly from the disk. These numbers were also very similar.

I was hoping that dd numbers would be significantly different for XFS and Ext4, but as you can see below there is minimal difference in the write operations:
[hv ~]$ sudo dd bs=1M count=128 if=/dev/zero of=/vm1/test conv=fdatasync
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 1.12586 s, 119 MB/s
[hv ~]$ sudo dd bs=1M count=128 if=/dev/zero of=/vm1/test conv=fdatasync
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 1.12256 s, 120 MB/s
[hv ~]$ sudo dd bs=1M count=128 if=/dev/zero of=/vm1/test conv=fdatasync
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 1.15067 s, 117 MB/s
[hv ~]$ sudo dd bs=1M count=128 if=/dev/zero of=/vm1/test conv=fdatasync
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 1.13103 s, 119 MB/s
[hv ~]$ sudo dd bs=1M count=128 if=/dev/zero of=/vm2/test conv=fdatasync
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 1.18037 s, 114 MB/s
[hv ~]$ sudo dd bs=1M count=128 if=/dev/zero of=/vm2/test conv=fdatasync
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 1.09832 s, 122 MB/s
[hv ~]$ sudo dd bs=1M count=128 if=/dev/zero of=/vm2/test conv=fdatasync
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 1.10921 s, 121 MB/s

The fdatasync option to dd tells dd to physically write output file data before finishing, which is a more realistic approach for testing. What are your thoughts on the above results?