Sparse Files

Sparse files are files whose metadata reports one size, but the file itself takes less space on the filesystem.
Spare files are a common way to efficiently use disk space. They can be created using the ‘truncate’ command.
Or you can create them by opening a file programmatically, seeking to an offset and then closing the file, without writing anything.

ls -l reports the length of the file, so the same file of 2 bytes will be reported as 2 bytes.
ls -s reports the size based on blocks, so for a 2 byte file, ls -s will report a size of 4K, since the block size is 4K.

du reports size based on blocks being used. For instance, if a file is 2 bytes, and the block size is 4096, du will report the file being 4K.
du -b will report the same size of a file as ls -l, since -b means apparent size.

Both ls -l and du -b do not take into account spare files. If a file if sparse, du -b and ls -l report it as though it is not sparse.

When using the ‘cp’ command, use the ‘cp –sparse=always’ option to keep sparse files as sparse.

‘scp’ is not sparse aware and if you use scp to copy a file that is spare it will take up “more” room on the destination host. Instead if you use rsync with -S option, spare files will be maintained as sparse.

tar is not sparse file smart by default. If you tar a sparse file, the tar file itself and when you untar the tar file, both will fill the sparse areas of the sparse file with zeros, resulting in more disk blocks being used. You should use the ‘-S’ option with tar to make it sparse file smart.

# create a sparse file of size 1GB
$ truncate -s +1G test

# The first number shows 0, which is block based size
$ ls -lsh test
total 1.G
0 -rw-rw-r-- 1 orion orion 1.0G Jan  7 14:07 test

# create tar file
$ tar -cvf test.tar test

# test.tar now really takes up 1GB
$ ls -ls test.tar
1.1G -rw-rw-r-- 1 orion orion 1.1G Jan  7 14:08 test.tar

# untarring now shows that the file is now using 1GB, before it was using 0GB
$ rm test
$ tar xvf test.tar
$ ls -lsh test
1.0G -rw-rw-r-- 1 orion orion 1.0G Jan  7 14:07 test

With the -S option, tar is smarter and the file continues to be still sparse.

# create a sparse file of size 1GB
$ truncate -s +1G test

# The first number shows 0, which is block based size
$ ls -lsh test
total 1.G
0 -rw-rw-r-- 1 orion orion 1.0G Jan  7 14:07 test

# create tar file with -S
$ tar -S -cvf test.tar test

# test.tar allocated size based on blocks is now 12
$ ls -ls test.tar
12 -rw-rw-r-- 1 orion orion      10240 Jan  7 14:19 test.tar

# untarring now shows that the file is still sparse
$ rm test
$ tar xvf test.tar
$ ls -lsh test
0 -rw-rw-r-- 1 orion orion 1073741824 Jan  7 14:19 test

When we do a ‘stat’ on a spare file, we see it it taking up no space in terms of blocks.

$ stat test
  File: `test'
  Size: 1073741824	Blocks: 0          IO Block: 4096   regular file
Device: fd07h/64775d	Inode: 1046835     Links: 1
Access: (0664/-rw-rw-r--)  Uid: (  500/   orion)   Gid: (  500/   orion)
Access: 2015-01-07 14:19:53.957911258 -0800
Modify: 2015-01-07 14:17:58.000000000 -0800
Change: 2015-01-07 14:19:53.957623281 -0800

We can also measure the extents used.

$ filefrag test
test: 0 extents found

Copying files in Linux

Copying files should be simple, yet there are a number of ways of transferring files.
Some of the ways that I could think of are listed here.

# -a=archive mode; equals -rlptgoD
# -r=recurse into directories
# -l=copy symlinks as symlinks
# -p=preserve permissions
# -t=preserve modification times
# -g=preserve group
# -o=preserve owner (super-user only)
# -D=preserve device files (super-user only) and special files
# -v=verbose
# -P=keep partially transferred files and show progress
# -H=preserve hardlinks
# -A=preserve ACLs
# -X=preserve selinux and other extended attributes
$ rsync -avPHAX /source /destination

# cross systems using ssh
# -z=compress
# -e=specify remote shell to use
$ rsync -azv -e ssh /source user@destinationhost:/destination-dir

# -xdev=Don’t  descend  directories on other filesystems
# -print=print the filenames found
# -p=Run in copy-pass mode
# -d=make directories
# -m=preserve-modification-time
# -v=verbose
$ find /source -xdev -path | cpio -pdmv /destination

# let's not forget good old cp
# -r=recursive
# -p=preserve mode,ownership,timestamps
# -v=verbose
$ cp -rpv --sparse=always /source /destination

# tar
# -c=create a new archive
# -v=verbose
# -f=use archive file
$ tar cvf - /source | (cd /destination && tar xvf -)

# scp
$ scp -r /source user@destinationhost:/destination-dir

# copy an entire partition
$ dd if=/dev/source-partition of=/dev/destination-partition bs=<block-size>

Useful Linux Memory Calculation Commands

Calculate total memory used by apps

ps aux | awk '{sum+=$6} END {print sum / 1024}’

To free pagecache:

    echo 1 > /proc/sys/vm/drop_caches

To free dentries and inodes:

    echo 2 > /proc/sys/vm/drop_caches

To free pagecache, dentries and inodes:

    echo 3 > /proc/sys/vm/drop_caches

dd physical memory being used by a user, in this it’s user daemon

ps aux |awk '{if($1 ~ "daemon"){Total+=$6}} END {print Total/1024" MB"}'

see how many processes are using swap:

grep Swap /proc/[1-9]*/smaps | grep -v '\W0 kB'

list the top 10 processes using the most swap:

ps ax | sed "s/^ *//" > /tmp/ps_ax.output 
for x in $(grep Swap /proc/[1-9]*/smaps | grep -v '\W0 kB' | tr -s ' ' | cut -d' ' -f-2 | sort -t' ' -k2 -n | tr -d ' ' | tail -10); do 
    swapusage=$(echo $x | cut -d: -f3)
    pid=$(echo $x | cut -d/ -f3)
    procname=$(cat /tmp/ps_ax.output | grep ^$pid)
    echo "============================" 
    echo "Process   : $procname" 
    echo "Swap usage: $swapusage kB"; done

Common Mount Options

async -> Allows the asynchronous input/output operations on the file system.
auto -> Allows the file system to be mounted automatically using the mount -a command.
defaults -> Provides an alias for async,auto,dev,exec,nouser,rw,suid.
exec -> Allows the execution of binary files on the particular file system.
loop -> Mounts an image as a loop device.
noauto -> Default behavior disallows the automatic mount of the file system using the mount -a command.
noexec -> Disallows the execution of binary files on the particular file system.
nouser -> Disallows an ordinary user (that is, other than root) to mount and unmount the file system.
remount -> Remounts the file system in case it is already mounted.
ro -> Mounts the file system for reading only.
rw -> Mounts the file system for both reading and writing.
user -> Allows an ordinary user (that is, other than root) to mount and unmount the file system.

TCP/IP Tuning

Improving Network Performance

When it comes to network performance, unless there is an explicit problem, leave the network optimization to the kernel.
If you see a network problem, then study the problem in detail and come up with a solution for that problem, and leave the rest alone.

Linux handles incoming network packets in the following way:

1. Hardware reception, frame arrives into the network
2. NIC sends a hard IRQ to the CPU about the frame
3. Soft IRQ state, wherein the kernel removes the frame from the NIC, passing it to the application
4. The application recevies the frame and handles it via standard POSIX calls such as read, recv, recvfrom

First, do the obvious, check for errors:

– Check duplex, full/half/auto (ethtool)
– Check speed, 10/100/1000/auto
– Check for errors using ‘netstat -i’

Next, try some NIC optimizations:

* Input Traffic
* Queue Depth
* Application Call Frequency
* Change socket queues
* RSS: Receive Side Scaling
* RFS: Receive Flow Steering
* RPS: Receive Packet Steering
* XPS: Transmit Packet Steering

One option is to slow down incoming traffic. You can do this by lowering the NICs device weight.Input traffic can be slowed down by reducing the value of /proc/sys/net/core/dev_weight.

Also try to increase the physical NIC queue depth to the maximum supported as shown below.

# ethtool --show-ring em1
Ring parameters for em1:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511
Current hardware settings:
RX:		200
RX Mini:	0
RX Jumbo:	0
TX:		511

# ethtool --set-ring em1 rx 511

# ethtool --show-ring em1
Ring parameters for em1:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511
Current hardware settings:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511

Next, increase the frequency at which the application calls recv or read.

Socket Queue can be veiewed using the below command. Pruned packets or collapsed packets indicate queue issues.

netstat -s | grep socket
    3339 resets received for embryonic SYN_RECV sockets
    2 packets pruned from receive queue because of socket buffer overrun
    99617 TCP sockets finished time wait in fast timer
    6 delayed acks further delayed because of locked socket
    39 packets collapsed in receive queue due to low socket buffe

You can increase the socket queue by changing the values of receive and send window as shown below. The values are min, default and max.

# cat /proc/sys/net/ipv4/tcp_wmem
4096	65536	16777216
# cat /proc/sys/net/ipv4/tcp_rmem
4096	87380	16777216

Receive-Side Scaling (RSS), also known as multi-queue receive, distributes network receive processing across several hardware-based receive queues, allowing inbound network traffic to be processed by multiple CPUs. RSS can be used to relieve bottlenecks in receive interrupt processing caused by overloading a single CPU, and to reduce network latency.
RSS should be enabled when latency is a concern or whenever receive interrupt processing forms a bottleneck. RSS may be enabled by default, depending upon your NIC driver.
To check if it is enabled, look in /proc/interrupts. For instance: ‘# egrep ‘CPU|eth0′ /proc/interrupts’. If only one entry is shown, then your NIC does not support RSS.

Receive Flow Steering (RFS) extends RPS behavior to increase the CPU cache hit rate and thereby reduce network latency. Where RPS forwards packets based solely on queue length, RFS uses the RPS backend to calculate the most appropriate CPU, then forwards packets based on the location of the application consuming the packet. This increases CPU cache efficiency.
RFS is disabled by default. To enable RFS, you must edit two files: /proc/sys/net/core/rps_sock_flow_entries and /sys/class/net//queues/rx-queue/rps_flow_cnt.
For /proc/sys/net/core/rps_sock_flow_entries, a value of 32768 is recommended for moderate server loads.
For /sys/class/net//queues/rx-queue/rps_flow_cnt value of rps_sock_flow_entries/N where N is number of receive queues on device. Number of receive queues is defined in the file
/sys/class/net//queues/rx-0/rps_cpus.

Receive Packet Steering is prefferred over RSS, since it is the software implementation of RSS and does not require NIC card support.
Check /sys/class/net//queues/rx-0/rps_cpus to see how many queues are configured. The rps_cpus files use comma-delimited CPU bitmaps. Therefore, to allow a CPU to handle interrupts for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to handle interrupts with CPUs 0, 1, 2, and 3, set the value of rps_cpus to 00001111 (1+2+4+8), or f (the hexadecimal value for 15). To monitor which CPU is receiving network interrupts, looks for NET_RX in ‘watch -n1 cat /proc/softirqs’

Transmit Packet Steering is a mechanism for intelligently selecting which transmit queue to use when transmitting a packet on a multi-queue
device. To accomplish this, a mapping from CPU to hardware queue(s) is recorded. XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
default for SMP). The functionality remains disabled until explicitly configured. For a network device with a single transmission queue, XPS configuration
has no effect, since there is no choice in this case. To enable XPS, the bitmap of CPUs that may use a transmit
queue is configured using the sysfs file entry: /sys/class/net//queues/tx-/xps_cpus

Some of the above material came from:
https://www.kernel.org/doc/Documentation/networking/scaling.txt and also RedHat Tuning Guide.

Filesystem Tuning

Tips for RedHat filesystem tuning.

– Optimize file system block size.
Block size is selected at mkfs time. If you plan on storing mostly smaller or larger files than the default block size,
then decrease or increase the block size of the filesystem during mkfs time. This will allow for less wasted space and
also for faster reads when the file size fits in a single block.

# tune2fs -l /dev/mapper/vg_hv1-lv_home | grep 'Block size'
Block size:               4096 (bytes)

– Match filesystem geometry to storage subsystem.
During mkfs time, align the geometry of the filesystem with that of the RAID sub-system.
This might be set automatically, but it does not hurt to check. As per RedHat “For striped block devices (for example, RAID5 arrays), the stripe geometry can be specified at the time of file system creation. Using proper stripe geometry greatly enhances the performance of an ext4 file system…
For example, to create a file system with a 64k stride (that is, 16 x 4096) on a 4k-block file system, use the following command:”

# mkfs.ext4 -E stride=16,stripe-width=64 /dev/device

– External journals
If you are using a journaling filesystem such as ext4, move the journal to a faster disk that the data.
This will speed up journal writes and make your writes faster. Journals are specified at mkfs time.

# mke2fs -O journal_dev

– Use the ‘nobarrier’ option when mounting filesystems.
Barriers is a way of ensuring that filesystem metadata is written correctly on persistent storage.
This will slow down filesystems that have fsync() being used excessively.
If you disable barriers you risk corruption of data in case of power loss, however the risk might be worth it
if data is redundant on other systems.

– Disable access time, using the ‘noatime’ mount option in /etc/fstab.
File access time is updated when it is read, you may not need this to ‘feature’.
in RHEL relatime behavior updates atime only if atime is older than mtime or ctime.
Enabling noatime also enables nodiratime.

– Increased read-ahead support
Read-ahead speeds up file access by pre-fetching data into the page cache.
Get current value using ‘blockdev -getra device’.
Change the value using ‘blocdev -setra N device’.
N is the number of 512-byte sectors.

# blockdev --getra  /dev/mapper/vg_hv1-lv_root
1024

TCP State Diagram

TCP State Diagram

TCP goes through different states. The diagram below, which is from Richard Stevens TCP/IP Illustrated Volume 1, should help in understanding the states that TCP goes through.

TCP State Diagram

TCP State Diagram

TCP Header

TCP header is usually 20 bytes unless options are present. Maximum header length is 60 bytes. A typical TCP header contains the following:

– Source Port (16 bits)
– Destination Port (16 bits)
– Sequence bits (32 bits)
– Acknowledgement Number (32 bits) **
– Header Length (4 bits)
– Reserved (4 bits)
– CWR
– ECE **
– URG
– ACK **
– PSH
– RST
– SYN
– FIN
– Window size (16 bits) **
– TCP Checksum (16 bits)
– Urgent Pointer (16 bits)
– Options (variable)

The items with ** refer to data flowing from receiver to sender.
The rest of the items are from the sender to the receiver.

Source port and destination port are the ports used in the communications.

Sequence number identifies the byte in the stream of data. TCP numbers each byte of data. The sequence number is a 32 bit unsigned integer that loops around. Initial sequence number is often picked randomly.

Acknowledgement number is the next byte that the receiver is expecting.

Header length is between 20 bytes and 60 bytes.

CWR – Congestion Window Reduced (the sender reduced its sending rate)
ECE – ECN Echo (the sender received an earlier congestion notification)

URG, ACK, PSH, RST, SYN and FIN are explained in the TCP Flags post in my earlier blog.

Window size is the number of bytes starting with the acknowledgement number that the receiver is willing to accept. Since it’s 16 bits, it limited in size to 65,535 bytes.

TCP Checksum contains header and data checksum.

Urgent pointer is only valid if URG bit is set. This “pointer” is a positive offset that must be added to the Sequence Number field of the segment to yield the sequence number of the last byte of urgent data.

Options – Most common option is MSS or maximum segment size.
Each end of a connection normally specifies this option on the first segment it sends (the ones with the SYN bit field set to establish the connection).
The MSS option specifies the maximum-size segment that the sender of the option is willing to receive in the reverse direction.

(Some of the material above is from Richard Stevens book TCP/IP Illustrated, Volume 1)

IP Header Explained

Mininum 20 bytes, maximum 60 bytes.
Maximum size is 65,535 bytes for IP header + data.

– Version field (4 bits) for IPv4 or IPv6
– IHL or Internet Header Length (4 bits), it is the number of 32-bit words in the header
– DS field, (6 bits) or Differentiated Service
– ECN (2 bits) explicit congestion notification
– Total length (16-bit) indicates total length of IP datagram
– Identification (16 bits) – Helps identify each datagram, counter based, very imp for fragmentation.
– Flags (3 bits) – Used to indiciate fragmentation.
– Fragment offset (13 bits) – Used to indicate offset of fragmentation.
– TTL (8 bits) upper limit of routers the datagram can pass through. Normally set at 64, although 128 or 255 is also common.
– Protocol (8 bits) – Specifies protocol encapsulated. 6 is TCP and 17 is UDP.
– Header Checksum (16 bits) – Checksum of header only and not payload.
– Source IP (32 bit)
– Destination IP (32 bit)
– Options (Up to 320 bits/40 bytes)
– IP Data if any, up to 65,515 bytes

TCP Flags Explained

TCP Flags

TCP has six flags that can help you troubleshoot a connection. The flags are:

U – URG
A – ACK
P – PSH
R – RST
S – SYN
F – FIN

When using tcpdump command to troubleshoot network connections, you can view TCP conversations with these flags as follows:

# tcpdump 'tcp[13] & 32 != 0' #URG
# tcpdump 'tcp[13] & 16 != 0' #ACK
# tcpdump 'tcp[13] & 8  != 0' #PSH
# tcpdump 'tcp[13] & 4  != 0' #RST
# tcpdump 'tcp[13] & 2  != 0' #SYN
# tcpdump 'tcp[13] & 0  != 0' #FIN

Another way of expressing the values is ‘tcp-fin, tcp-syn, tcp-rst, tcp-push, tcp-ack, tcp-urg.’.

# tcpdump 'tcp[tcpflags] & tcp-urg != 0' #URG
# tcpdump 'tcp[tcpflags] & tcp-ack != 0' #ACK
# tcpdump 'tcp[tcpflags] & tcp-push != 0' #PSH
# tcpdump 'tcp[tcpflags] & tcp-rst != 0' #RST
# tcpdump 'tcp[tcpflags] & tcp-sync != 0' #SYN
# tcpdump 'tcp[tcpflags] & tcp-fin != 0' #FIN

A mnemonic to remember the above is ‘Unskilled Attackers Pester Real Security Folks’.

URG flag is used to indicate that the packet should be prioritized over other packets for processing.
This flag is not used often. I can only think of telnet that uses it.

#  tcpdump 'tcp[13] & 32 != 0'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br0, link-type EN10MB (Ethernet), capture size 65535 bytes
18:29:47.741814 IP destination > source.39697: Flags [P.U], seq 2342494158:2342494159, ack 1232430662, win 31, urg 1, options [nop,nop,TS val 877959588 ecr 703327131], length 1
18:29:51.293145 IP destination.telnet > source.39697: Flags [P.U], seq 12:13, ack 5, win 31, urg 1, options [nop,nop,TS val 877963140 ecr 703330673], length 1

SYN is used for starting a connection.
ACK is used to acknowledge packets received.
PSH is used to ask the receiving end not to buffer packets, but to process them as soon as they are received.
RST is used to denote that no service is listening on the given port.

# tcpdump -n -v 'tcp[tcpflags] & (tcp-rst) != 0'
tcpdump: listening on br0, link-type EN10MB (Ethernet), capture size 65535 bytes
18:44:11.087487 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    destination.ms-sql-s > source.x11: Flags [R.], cksum 0xb12d (correct), seq 0, ack 1435238401, win 0, length 0