TCP/IP Tuning

Improving Network Performance

When it comes to network performance, unless there is an explicit problem, leave the network optimization to the kernel.
If you see a network problem, then study the problem in detail and come up with a solution for that problem, and leave the rest alone.

Linux handles incoming network packets in the following way:

1. Hardware reception, frame arrives into the network
2. NIC sends a hard IRQ to the CPU about the frame
3. Soft IRQ state, wherein the kernel removes the frame from the NIC, passing it to the application
4. The application recevies the frame and handles it via standard POSIX calls such as read, recv, recvfrom

First, do the obvious, check for errors:

– Check duplex, full/half/auto (ethtool)
– Check speed, 10/100/1000/auto
– Check for errors using ‘netstat -i’

Next, try some NIC optimizations:

* Input Traffic
* Queue Depth
* Application Call Frequency
* Change socket queues
* RSS: Receive Side Scaling
* RFS: Receive Flow Steering
* RPS: Receive Packet Steering
* XPS: Transmit Packet Steering

One option is to slow down incoming traffic. You can do this by lowering the NICs device weight.Input traffic can be slowed down by reducing the value of /proc/sys/net/core/dev_weight.

Also try to increase the physical NIC queue depth to the maximum supported as shown below.

# ethtool --show-ring em1
Ring parameters for em1:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511
Current hardware settings:
RX:		200
RX Mini:	0
RX Jumbo:	0
TX:		511

# ethtool --set-ring em1 rx 511

# ethtool --show-ring em1
Ring parameters for em1:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511
Current hardware settings:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511

Next, increase the frequency at which the application calls recv or read.

Socket Queue can be veiewed using the below command. Pruned packets or collapsed packets indicate queue issues.

netstat -s | grep socket
    3339 resets received for embryonic SYN_RECV sockets
    2 packets pruned from receive queue because of socket buffer overrun
    99617 TCP sockets finished time wait in fast timer
    6 delayed acks further delayed because of locked socket
    39 packets collapsed in receive queue due to low socket buffe

You can increase the socket queue by changing the values of receive and send window as shown below. The values are min, default and max.

# cat /proc/sys/net/ipv4/tcp_wmem
4096	65536	16777216
# cat /proc/sys/net/ipv4/tcp_rmem
4096	87380	16777216

Receive-Side Scaling (RSS), also known as multi-queue receive, distributes network receive processing across several hardware-based receive queues, allowing inbound network traffic to be processed by multiple CPUs. RSS can be used to relieve bottlenecks in receive interrupt processing caused by overloading a single CPU, and to reduce network latency.
RSS should be enabled when latency is a concern or whenever receive interrupt processing forms a bottleneck. RSS may be enabled by default, depending upon your NIC driver.
To check if it is enabled, look in /proc/interrupts. For instance: ‘# egrep ‘CPU|eth0′ /proc/interrupts’. If only one entry is shown, then your NIC does not support RSS.

Receive Flow Steering (RFS) extends RPS behavior to increase the CPU cache hit rate and thereby reduce network latency. Where RPS forwards packets based solely on queue length, RFS uses the RPS backend to calculate the most appropriate CPU, then forwards packets based on the location of the application consuming the packet. This increases CPU cache efficiency.
RFS is disabled by default. To enable RFS, you must edit two files: /proc/sys/net/core/rps_sock_flow_entries and /sys/class/net//queues/rx-queue/rps_flow_cnt.
For /proc/sys/net/core/rps_sock_flow_entries, a value of 32768 is recommended for moderate server loads.
For /sys/class/net//queues/rx-queue/rps_flow_cnt value of rps_sock_flow_entries/N where N is number of receive queues on device. Number of receive queues is defined in the file

Receive Packet Steering is prefferred over RSS, since it is the software implementation of RSS and does not require NIC card support.
Check /sys/class/net//queues/rx-0/rps_cpus to see how many queues are configured. The rps_cpus files use comma-delimited CPU bitmaps. Therefore, to allow a CPU to handle interrupts for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to handle interrupts with CPUs 0, 1, 2, and 3, set the value of rps_cpus to 00001111 (1+2+4+8), or f (the hexadecimal value for 15). To monitor which CPU is receiving network interrupts, looks for NET_RX in ‘watch -n1 cat /proc/softirqs’

Transmit Packet Steering is a mechanism for intelligently selecting which transmit queue to use when transmitting a packet on a multi-queue
device. To accomplish this, a mapping from CPU to hardware queue(s) is recorded. XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
default for SMP). The functionality remains disabled until explicitly configured. For a network device with a single transmission queue, XPS configuration
has no effect, since there is no choice in this case. To enable XPS, the bitmap of CPUs that may use a transmit
queue is configured using the sysfs file entry: /sys/class/net//queues/tx-/xps_cpus

Some of the above material came from: and also RedHat Tuning Guide.

Filesystem Tuning

Tips for RedHat filesystem tuning.

– Optimize file system block size.
Block size is selected at mkfs time. If you plan on storing mostly smaller or larger files than the default block size,
then decrease or increase the block size of the filesystem during mkfs time. This will allow for less wasted space and
also for faster reads when the file size fits in a single block.

# tune2fs -l /dev/mapper/vg_hv1-lv_home | grep 'Block size'
Block size:               4096 (bytes)

– Match filesystem geometry to storage subsystem.
During mkfs time, align the geometry of the filesystem with that of the RAID sub-system.
This might be set automatically, but it does not hurt to check. As per RedHat “For striped block devices (for example, RAID5 arrays), the stripe geometry can be specified at the time of file system creation. Using proper stripe geometry greatly enhances the performance of an ext4 file system…
For example, to create a file system with a 64k stride (that is, 16 x 4096) on a 4k-block file system, use the following command:”

# mkfs.ext4 -E stride=16,stripe-width=64 /dev/device

– External journals
If you are using a journaling filesystem such as ext4, move the journal to a faster disk that the data.
This will speed up journal writes and make your writes faster. Journals are specified at mkfs time.

# mke2fs -O journal_dev

– Use the ‘nobarrier’ option when mounting filesystems.
Barriers is a way of ensuring that filesystem metadata is written correctly on persistent storage.
This will slow down filesystems that have fsync() being used excessively.
If you disable barriers you risk corruption of data in case of power loss, however the risk might be worth it
if data is redundant on other systems.

– Disable access time, using the ‘noatime’ mount option in /etc/fstab.
File access time is updated when it is read, you may not need this to ‘feature’.
in RHEL relatime behavior updates atime only if atime is older than mtime or ctime.
Enabling noatime also enables nodiratime.

– Increased read-ahead support
Read-ahead speeds up file access by pre-fetching data into the page cache.
Get current value using ‘blockdev -getra device’.
Change the value using ‘blocdev -setra N device’.
N is the number of 512-byte sectors.

# blockdev --getra  /dev/mapper/vg_hv1-lv_root

TCP State Diagram

TCP State Diagram

TCP goes through different states. The diagram below, which is from Richard Stevens TCP/IP Illustrated Volume 1, should help in understanding the states that TCP goes through.

TCP State Diagram

TCP State Diagram

TCP Header

TCP header is usually 20 bytes unless options are present. Maximum header length is 60 bytes. A typical TCP header contains the following:

– Source Port (16 bits)
– Destination Port (16 bits)
– Sequence bits (32 bits)
– Acknowledgement Number (32 bits) **
– Header Length (4 bits)
– Reserved (4 bits)
– ECE **
– ACK **
– Window size (16 bits) **
– TCP Checksum (16 bits)
– Urgent Pointer (16 bits)
– Options (variable)

The items with ** refer to data flowing from receiver to sender.
The rest of the items are from the sender to the receiver.

Source port and destination port are the ports used in the communications.

Sequence number identifies the byte in the stream of data. TCP numbers each byte of data. The sequence number is a 32 bit unsigned integer that loops around. Initial sequence number if often picked randomly.

Acknowledgement number is the next byte that the receiver is expecting.

Header length is between 20 bytes and 60 bytes.

CWR – Congestion Window Reduced (the sender reduced its sending rate)
ECE – ECN Echo (the sender received an earlier congestion notification)

URG, ACK, PSH, RST, SYN and FIN are explained in the TCP Flags post in my earlier blog.

Window size is the number of bytes starting with the acknowledgement number that the receiver is willing to accept. Since it’s 16 bits, it limited in size to 65,535 bytes.

TCP Checksum contains header and data checksum.

Urgent pointer is only valid if URG bit is set. This “pointer” is a positive offset that must be added to the Sequence Number field of the segment to yield the sequence number of the last byte of urgent data.

Options – Most common option is MSS or maximum segement size.
Each end of a connection normally specifies this option on the first segment it sends (the ones with the SYN bit field set to establish the connection).
The MSS option specifies the maximum-size segment that the sender of the option is willing to receive in the reverse direction.

(Some of the material above is from Richard Stevens book TCP/IP Illustrated, Volume 1)

IP Header Explained

Mininum 20 bytes, maximum 60 bytes.
Maximum size is 65,535 bytes for IP header + data.

– Version field (4 bits) for IPv4 or IPv6
– IHL or Internet Header Length (4 bits), it is the number of 32-bit words in the header
– DS field, (6 bits) or Differentiated Service
– ECN (2 bits) explicit congestion notification
– Total length (16-bit) indicates total length of IP datagram
– Identification (16 bits) – Helps identify each datagram, counter based, very imp for fragmentation.
– Flags (3 bits) – Used to indiciate fragmentation.
– Fragment offset (13 bits) – Used to indicate offset of fragmentation.
– TTL (8 bits) upper limit of routers the datagram can pass through. Normally set at 64, although 128 or 255 is also common.
– Protocol (8 bits) – Specifies protocol encapsulated. 6 is TCP and 17 is UDP.
– Header Checksum (16 bits) – Checksum of header only and not payload.
– Source IP (32 bit)
– Destination IP (32 bit)
– Options (Up to 320 bits/40 bytes)
– IP Data if any, up to 65,515 bytes

TCP Flags Explained

TCP Flags

TCP has six flags that can help you troubleshoot a connection. The flags are:


When using tcpdump command to troubleshoot network connections, you can view TCP conversations with these flags as follows:

# tcpdump 'tcp[13] & 32 != 0' #URG
# tcpdump 'tcp[13] & 16 != 0' #ACK
# tcpdump 'tcp[13] & 8  != 0' #PSH
# tcpdump 'tcp[13] & 4  != 0' #RST
# tcpdump 'tcp[13] & 2  != 0' #SYN
# tcpdump 'tcp[13] & 0  != 0' #FIN

Another way of expressing the values is ‘tcp-fin, tcp-syn, tcp-rst, tcp-push, tcp-act, tcp-urg.’.

# tcpdump 'tcp[tcpflags] & tcp-urg != 0' #URG
# tcpdump 'tcp[tcpflags] & tcp-ack != 0' #ACK
# tcpdump 'tcp[tcpflags] & tcp-push != 0' #PSH
# tcpdump 'tcp[tcpflags] & tcp-rst != 0' #RST
# tcpdump 'tcp[tcpflags] & tcp-sync != 0' #SYN
# tcpdump 'tcp[tcpflags] & tcp-fin != 0' #FIN

A mnemonic to remember the above is ‘Unskilled Attackers Pester Real Security Folks’.

URG flag is used to indicate that the packet should be prioritized over other packets for processing.
This flag is not used often. I can only think of telnet that uses it.

#  tcpdump 'tcp[13] & 32 != 0'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br0, link-type EN10MB (Ethernet), capture size 65535 bytes
18:29:47.741814 IP destination > source.39697: Flags [P.U], seq 2342494158:2342494159, ack 1232430662, win 31, urg 1, options [nop,nop,TS val 877959588 ecr 703327131], length 1
18:29:51.293145 IP destination.telnet > source.39697: Flags [P.U], seq 12:13, ack 5, win 31, urg 1, options [nop,nop,TS val 877963140 ecr 703330673], length 1

SYN is used for starting a connection.
ACK is used to acknowledge packets received.
PSH is used to ask the receiving end not to buffer packets, but to process them as soon as they are received.
RST is used to denote that no service is listening on the given port.

# tcpdump -n -v 'tcp[tcpflags] & (tcp-rst) != 0'
tcpdump: listening on br0, link-type EN10MB (Ethernet), capture size 65535 bytes
18:44:11.087487 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40) > source.x11: Flags [R.], cksum 0xb12d (correct), seq 0, ack 1435238401, win 0, length 0

More on TLS

TLS uses asymmetric and symmetric encryption. Asymmetric encryption is used for the initial communication, followed by faster symmetric key encryption.

Symmetric ciphers are stream based or block based. Stream based encrypt one message at a time. Block based take a number of bits, and encrypt them together as one. A few symmetric key encryption algorithms are:

– Blowfish
– RC4
– 3DES

A few asymmetric key encryption algorithms are:

– DH
– Elliptic Curve

A couple of message digest (MD) algorithms are:

– MD5

If you want to see which algorithms an SSL server supports, use the tool ‘sslscan’ which can be installed using ‘yum install sslscan -y’.
You might have to enable EPEL repository to install using yum. After installation, if you run ‘sslscan’ you will see a lot of very useful output, as show below. First you wil see the algorithms that sslscan supports, followed by the ones that accepts. The most important item section is the one below:

Preferred Server Cipher(s):
SSLv2 0 bits (NONE)
SSLv3 128 bits ECDHE-RSA-RC4-SHA
TLSv1 128 bits ECDHE-RSA-RC4-SHA
TLS11 128 bits ECDHE-RSA-AES128-SHA
TLS12 128 bits ECDHE-RSA-AES128-GCM-SHA256

This is showing that prefers SSLv3, TLSv1,1.1 and 1.2. The cipher suites preferred are ECDE-RSA-RC4-SHA.
EDCE is Elliptic Curve Ephemeral Diffie Hellman which supports PFS or Perfect Forward Secrecy.
Normally with RSA, a symmetric key is picked once as part of the SSL HELLO protocol. After that the key does not change.
This means that if the servers private key is compromised, then an attacker can get the symmetric key.
With EDCE and PFS, the symmetric key is changed every session, so even if one key is compromised, the other key will not be impacted.

You can configure Apache to prefer cipher suites, see and