Category Archives: tcp/ip

TCP/IP Tuning

Improving Network Performance

When it comes to network performance, unless there is an explicit problem, leave the network optimization to the kernel.
If you see a network problem, then study the problem in detail and come up with a solution for that problem, and leave the rest alone.

Linux handles incoming network packets in the following way:

1. Hardware reception, frame arrives into the network
2. NIC sends a hard IRQ to the CPU about the frame
3. Soft IRQ state, wherein the kernel removes the frame from the NIC, passing it to the application
4. The application recevies the frame and handles it via standard POSIX calls such as read, recv, recvfrom

First, do the obvious, check for errors:

– Check duplex, full/half/auto (ethtool)
– Check speed, 10/100/1000/auto
– Check for errors using ‘netstat -i’

Next, try some NIC optimizations:

* Input Traffic
* Queue Depth
* Application Call Frequency
* Change socket queues
* RSS: Receive Side Scaling
* RFS: Receive Flow Steering
* RPS: Receive Packet Steering
* XPS: Transmit Packet Steering

One option is to slow down incoming traffic. You can do this by lowering the NICs device weight.Input traffic can be slowed down by reducing the value of /proc/sys/net/core/dev_weight.

Also try to increase the physical NIC queue depth to the maximum supported as shown below.

# ethtool --show-ring em1
Ring parameters for em1:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511
Current hardware settings:
RX:		200
RX Mini:	0
RX Jumbo:	0
TX:		511

# ethtool --set-ring em1 rx 511

# ethtool --show-ring em1
Ring parameters for em1:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511
Current hardware settings:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511

Next, increase the frequency at which the application calls recv or read.

Socket Queue can be veiewed using the below command. Pruned packets or collapsed packets indicate queue issues.

netstat -s | grep socket
    3339 resets received for embryonic SYN_RECV sockets
    2 packets pruned from receive queue because of socket buffer overrun
    99617 TCP sockets finished time wait in fast timer
    6 delayed acks further delayed because of locked socket
    39 packets collapsed in receive queue due to low socket buffe

You can increase the socket queue by changing the values of receive and send window as shown below. The values are min, default and max.

# cat /proc/sys/net/ipv4/tcp_wmem
4096	65536	16777216
# cat /proc/sys/net/ipv4/tcp_rmem
4096	87380	16777216

Receive-Side Scaling (RSS), also known as multi-queue receive, distributes network receive processing across several hardware-based receive queues, allowing inbound network traffic to be processed by multiple CPUs. RSS can be used to relieve bottlenecks in receive interrupt processing caused by overloading a single CPU, and to reduce network latency.
RSS should be enabled when latency is a concern or whenever receive interrupt processing forms a bottleneck. RSS may be enabled by default, depending upon your NIC driver.
To check if it is enabled, look in /proc/interrupts. For instance: ‘# egrep ‘CPU|eth0′ /proc/interrupts’. If only one entry is shown, then your NIC does not support RSS.

Receive Flow Steering (RFS) extends RPS behavior to increase the CPU cache hit rate and thereby reduce network latency. Where RPS forwards packets based solely on queue length, RFS uses the RPS backend to calculate the most appropriate CPU, then forwards packets based on the location of the application consuming the packet. This increases CPU cache efficiency.
RFS is disabled by default. To enable RFS, you must edit two files: /proc/sys/net/core/rps_sock_flow_entries and /sys/class/net//queues/rx-queue/rps_flow_cnt.
For /proc/sys/net/core/rps_sock_flow_entries, a value of 32768 is recommended for moderate server loads.
For /sys/class/net//queues/rx-queue/rps_flow_cnt value of rps_sock_flow_entries/N where N is number of receive queues on device. Number of receive queues is defined in the file
/sys/class/net//queues/rx-0/rps_cpus.

Receive Packet Steering is prefferred over RSS, since it is the software implementation of RSS and does not require NIC card support.
Check /sys/class/net//queues/rx-0/rps_cpus to see how many queues are configured. The rps_cpus files use comma-delimited CPU bitmaps. Therefore, to allow a CPU to handle interrupts for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to handle interrupts with CPUs 0, 1, 2, and 3, set the value of rps_cpus to 00001111 (1+2+4+8), or f (the hexadecimal value for 15). To monitor which CPU is receiving network interrupts, looks for NET_RX in ‘watch -n1 cat /proc/softirqs’

Transmit Packet Steering is a mechanism for intelligently selecting which transmit queue to use when transmitting a packet on a multi-queue
device. To accomplish this, a mapping from CPU to hardware queue(s) is recorded. XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
default for SMP). The functionality remains disabled until explicitly configured. For a network device with a single transmission queue, XPS configuration
has no effect, since there is no choice in this case. To enable XPS, the bitmap of CPUs that may use a transmit
queue is configured using the sysfs file entry: /sys/class/net//queues/tx-/xps_cpus

Some of the above material came from:
https://www.kernel.org/doc/Documentation/networking/scaling.txt and also RedHat Tuning Guide.

TCP State Diagram

TCP State Diagram

TCP goes through different states. The diagram below, which is from Richard Stevens TCP/IP Illustrated Volume 1, should help in understanding the states that TCP goes through.

TCP State Diagram

TCP State Diagram

TCP Header

TCP header is usually 20 bytes unless options are present. Maximum header length is 60 bytes. A typical TCP header contains the following:

– Source Port (16 bits)
– Destination Port (16 bits)
– Sequence bits (32 bits)
– Acknowledgement Number (32 bits) **
– Header Length (4 bits)
– Reserved (4 bits)
– CWR
– ECE **
– URG
– ACK **
– PSH
– RST
– SYN
– FIN
– Window size (16 bits) **
– TCP Checksum (16 bits)
– Urgent Pointer (16 bits)
– Options (variable)

The items with ** refer to data flowing from receiver to sender.
The rest of the items are from the sender to the receiver.

Source port and destination port are the ports used in the communications.

Sequence number identifies the byte in the stream of data. TCP numbers each byte of data. The sequence number is a 32 bit unsigned integer that loops around. Initial sequence number is often picked randomly.

Acknowledgement number is the next byte that the receiver is expecting.

Header length is between 20 bytes and 60 bytes.

CWR – Congestion Window Reduced (the sender reduced its sending rate)
ECE – ECN Echo (the sender received an earlier congestion notification)

URG, ACK, PSH, RST, SYN and FIN are explained in the TCP Flags post in my earlier blog.

Window size is the number of bytes starting with the acknowledgement number that the receiver is willing to accept. Since it’s 16 bits, it limited in size to 65,535 bytes.

TCP Checksum contains header and data checksum.

Urgent pointer is only valid if URG bit is set. This “pointer” is a positive offset that must be added to the Sequence Number field of the segment to yield the sequence number of the last byte of urgent data.

Options – Most common option is MSS or maximum segment size.
Each end of a connection normally specifies this option on the first segment it sends (the ones with the SYN bit field set to establish the connection).
The MSS option specifies the maximum-size segment that the sender of the option is willing to receive in the reverse direction.

(Some of the material above is from Richard Stevens book TCP/IP Illustrated, Volume 1)

IP Header Explained

Mininum 20 bytes, maximum 60 bytes.
Maximum size is 65,535 bytes for IP header + data.

– Version field (4 bits) for IPv4 or IPv6
– IHL or Internet Header Length (4 bits), it is the number of 32-bit words in the header
– DS field, (6 bits) or Differentiated Service
– ECN (2 bits) explicit congestion notification
– Total length (16-bit) indicates total length of IP datagram
– Identification (16 bits) – Helps identify each datagram, counter based, very imp for fragmentation.
– Flags (3 bits) – Used to indiciate fragmentation.
– Fragment offset (13 bits) – Used to indicate offset of fragmentation.
– TTL (8 bits) upper limit of routers the datagram can pass through. Normally set at 64, although 128 or 255 is also common.
– Protocol (8 bits) – Specifies protocol encapsulated. 6 is TCP and 17 is UDP.
– Header Checksum (16 bits) – Checksum of header only and not payload.
– Source IP (32 bit)
– Destination IP (32 bit)
– Options (Up to 320 bits/40 bytes)
– IP Data if any, up to 65,515 bytes

TCP Flags Explained

TCP Flags

TCP has six flags that can help you troubleshoot a connection. The flags are:

U – URG
A – ACK
P – PSH
R – RST
S – SYN
F – FIN

When using tcpdump command to troubleshoot network connections, you can view TCP conversations with these flags as follows:

# tcpdump 'tcp[13] & 32 != 0' #URG
# tcpdump 'tcp[13] & 16 != 0' #ACK
# tcpdump 'tcp[13] & 8  != 0' #PSH
# tcpdump 'tcp[13] & 4  != 0' #RST
# tcpdump 'tcp[13] & 2  != 0' #SYN
# tcpdump 'tcp[13] & 0  != 0' #FIN

Another way of expressing the values is ‘tcp-fin, tcp-syn, tcp-rst, tcp-push, tcp-ack, tcp-urg.’.

# tcpdump 'tcp[tcpflags] & tcp-urg != 0' #URG
# tcpdump 'tcp[tcpflags] & tcp-ack != 0' #ACK
# tcpdump 'tcp[tcpflags] & tcp-push != 0' #PSH
# tcpdump 'tcp[tcpflags] & tcp-rst != 0' #RST
# tcpdump 'tcp[tcpflags] & tcp-sync != 0' #SYN
# tcpdump 'tcp[tcpflags] & tcp-fin != 0' #FIN

A mnemonic to remember the above is ‘Unskilled Attackers Pester Real Security Folks’.

URG flag is used to indicate that the packet should be prioritized over other packets for processing.
This flag is not used often. I can only think of telnet that uses it.

#  tcpdump 'tcp[13] & 32 != 0'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br0, link-type EN10MB (Ethernet), capture size 65535 bytes
18:29:47.741814 IP destination > source.39697: Flags [P.U], seq 2342494158:2342494159, ack 1232430662, win 31, urg 1, options [nop,nop,TS val 877959588 ecr 703327131], length 1
18:29:51.293145 IP destination.telnet > source.39697: Flags [P.U], seq 12:13, ack 5, win 31, urg 1, options [nop,nop,TS val 877963140 ecr 703330673], length 1

SYN is used for starting a connection.
ACK is used to acknowledge packets received.
PSH is used to ask the receiving end not to buffer packets, but to process them as soon as they are received.
RST is used to denote that no service is listening on the given port.

# tcpdump -n -v 'tcp[tcpflags] & (tcp-rst) != 0'
tcpdump: listening on br0, link-type EN10MB (Ethernet), capture size 65535 bytes
18:44:11.087487 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    destination.ms-sql-s > source.x11: Flags [R.], cksum 0xb12d (correct), seq 0, ack 1435238401, win 0, length 0

Understanding IP addresses

In IPv4 there are 5 different IP address classes, Class A,B,C,D,E. Keep in mind that when figuring out number of hosts, we always subtract 2, all 0’s for network and all 1’s for broadcast.

  • Class A – 0 to 127. 2^8 networks, 2^24-2 hosts.
  • Class B – 128 to 191. 2^16 networks, 2^16-2 hosts.
  • Class C – 192 to 223. 2^24 networks, 2^8-2 hosts.
  • Class D – 224 – 239. Networks and hosts not defined.
  • Class E – 240 – 255. Networks and hosts not defined.
  • Let’s take an example, network 10.0.0.0/8, and we have to figure out the subnet id, IP range and broadcast address. Since it’s /8 the netmask is 255.0.0.0 because only 8 bits are being used for network, and 24 its are being used for host. So the first IP is 10.0.0.1 and the last IP is 10.255.255.254. The broadcast address will be all 1’s in the host field, which would translate to 10.255.255.255. The subnet address is all 0’s in the host field, which is 10.0.0.0.