Category Archives: performance management

Useful Linux Memory Calculation Commands

Calculate total memory used by apps

ps aux | awk '{sum+=$6} END {print sum / 1024}’

To free pagecache:

    echo 1 > /proc/sys/vm/drop_caches

To free dentries and inodes:

    echo 2 > /proc/sys/vm/drop_caches

To free pagecache, dentries and inodes:

    echo 3 > /proc/sys/vm/drop_caches

dd physical memory being used by a user, in this it’s user daemon

ps aux |awk '{if($1 ~ "daemon"){Total+=$6}} END {print Total/1024" MB"}'

see how many processes are using swap:

grep Swap /proc/[1-9]*/smaps | grep -v '\W0 kB'

list the top 10 processes using the most swap:

ps ax | sed "s/^ *//" > /tmp/ps_ax.output 
for x in $(grep Swap /proc/[1-9]*/smaps | grep -v '\W0 kB' | tr -s ' ' | cut -d' ' -f-2 | sort -t' ' -k2 -n | tr -d ' ' | tail -10); do 
    swapusage=$(echo $x | cut -d: -f3)
    pid=$(echo $x | cut -d/ -f3)
    procname=$(cat /tmp/ps_ax.output | grep ^$pid)
    echo "============================" 
    echo "Process   : $procname" 
    echo "Swap usage: $swapusage kB"; done

TCP/IP Tuning

Improving Network Performance

When it comes to network performance, unless there is an explicit problem, leave the network optimization to the kernel.
If you see a network problem, then study the problem in detail and come up with a solution for that problem, and leave the rest alone.

Linux handles incoming network packets in the following way:

1. Hardware reception, frame arrives into the network
2. NIC sends a hard IRQ to the CPU about the frame
3. Soft IRQ state, wherein the kernel removes the frame from the NIC, passing it to the application
4. The application recevies the frame and handles it via standard POSIX calls such as read, recv, recvfrom

First, do the obvious, check for errors:

– Check duplex, full/half/auto (ethtool)
– Check speed, 10/100/1000/auto
– Check for errors using ‘netstat -i’

Next, try some NIC optimizations:

* Input Traffic
* Queue Depth
* Application Call Frequency
* Change socket queues
* RSS: Receive Side Scaling
* RFS: Receive Flow Steering
* RPS: Receive Packet Steering
* XPS: Transmit Packet Steering

One option is to slow down incoming traffic. You can do this by lowering the NICs device weight.Input traffic can be slowed down by reducing the value of /proc/sys/net/core/dev_weight.

Also try to increase the physical NIC queue depth to the maximum supported as shown below.

# ethtool --show-ring em1
Ring parameters for em1:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511
Current hardware settings:
RX:		200
RX Mini:	0
RX Jumbo:	0
TX:		511

# ethtool --set-ring em1 rx 511

# ethtool --show-ring em1
Ring parameters for em1:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511
Current hardware settings:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511

Next, increase the frequency at which the application calls recv or read.

Socket Queue can be veiewed using the below command. Pruned packets or collapsed packets indicate queue issues.

netstat -s | grep socket
    3339 resets received for embryonic SYN_RECV sockets
    2 packets pruned from receive queue because of socket buffer overrun
    99617 TCP sockets finished time wait in fast timer
    6 delayed acks further delayed because of locked socket
    39 packets collapsed in receive queue due to low socket buffe

You can increase the socket queue by changing the values of receive and send window as shown below. The values are min, default and max.

# cat /proc/sys/net/ipv4/tcp_wmem
4096	65536	16777216
# cat /proc/sys/net/ipv4/tcp_rmem
4096	87380	16777216

Receive-Side Scaling (RSS), also known as multi-queue receive, distributes network receive processing across several hardware-based receive queues, allowing inbound network traffic to be processed by multiple CPUs. RSS can be used to relieve bottlenecks in receive interrupt processing caused by overloading a single CPU, and to reduce network latency.
RSS should be enabled when latency is a concern or whenever receive interrupt processing forms a bottleneck. RSS may be enabled by default, depending upon your NIC driver.
To check if it is enabled, look in /proc/interrupts. For instance: ‘# egrep ‘CPU|eth0′ /proc/interrupts’. If only one entry is shown, then your NIC does not support RSS.

Receive Flow Steering (RFS) extends RPS behavior to increase the CPU cache hit rate and thereby reduce network latency. Where RPS forwards packets based solely on queue length, RFS uses the RPS backend to calculate the most appropriate CPU, then forwards packets based on the location of the application consuming the packet. This increases CPU cache efficiency.
RFS is disabled by default. To enable RFS, you must edit two files: /proc/sys/net/core/rps_sock_flow_entries and /sys/class/net//queues/rx-queue/rps_flow_cnt.
For /proc/sys/net/core/rps_sock_flow_entries, a value of 32768 is recommended for moderate server loads.
For /sys/class/net//queues/rx-queue/rps_flow_cnt value of rps_sock_flow_entries/N where N is number of receive queues on device. Number of receive queues is defined in the file
/sys/class/net//queues/rx-0/rps_cpus.

Receive Packet Steering is prefferred over RSS, since it is the software implementation of RSS and does not require NIC card support.
Check /sys/class/net//queues/rx-0/rps_cpus to see how many queues are configured. The rps_cpus files use comma-delimited CPU bitmaps. Therefore, to allow a CPU to handle interrupts for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to handle interrupts with CPUs 0, 1, 2, and 3, set the value of rps_cpus to 00001111 (1+2+4+8), or f (the hexadecimal value for 15). To monitor which CPU is receiving network interrupts, looks for NET_RX in ‘watch -n1 cat /proc/softirqs’

Transmit Packet Steering is a mechanism for intelligently selecting which transmit queue to use when transmitting a packet on a multi-queue
device. To accomplish this, a mapping from CPU to hardware queue(s) is recorded. XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
default for SMP). The functionality remains disabled until explicitly configured. For a network device with a single transmission queue, XPS configuration
has no effect, since there is no choice in this case. To enable XPS, the bitmap of CPUs that may use a transmit
queue is configured using the sysfs file entry: /sys/class/net//queues/tx-/xps_cpus

Some of the above material came from:
https://www.kernel.org/doc/Documentation/networking/scaling.txt and also RedHat Tuning Guide.

Filesystem Tuning

Tips for RedHat filesystem tuning.

– Optimize file system block size.
Block size is selected at mkfs time. If you plan on storing mostly smaller or larger files than the default block size,
then decrease or increase the block size of the filesystem during mkfs time. This will allow for less wasted space and
also for faster reads when the file size fits in a single block.

# tune2fs -l /dev/mapper/vg_hv1-lv_home | grep 'Block size'
Block size:               4096 (bytes)

– Match filesystem geometry to storage subsystem.
During mkfs time, align the geometry of the filesystem with that of the RAID sub-system.
This might be set automatically, but it does not hurt to check. As per RedHat “For striped block devices (for example, RAID5 arrays), the stripe geometry can be specified at the time of file system creation. Using proper stripe geometry greatly enhances the performance of an ext4 file system…
For example, to create a file system with a 64k stride (that is, 16 x 4096) on a 4k-block file system, use the following command:”

# mkfs.ext4 -E stride=16,stripe-width=64 /dev/device

– External journals
If you are using a journaling filesystem such as ext4, move the journal to a faster disk that the data.
This will speed up journal writes and make your writes faster. Journals are specified at mkfs time.

# mke2fs -O journal_dev

– Use the ‘nobarrier’ option when mounting filesystems.
Barriers is a way of ensuring that filesystem metadata is written correctly on persistent storage.
This will slow down filesystems that have fsync() being used excessively.
If you disable barriers you risk corruption of data in case of power loss, however the risk might be worth it
if data is redundant on other systems.

– Disable access time, using the ‘noatime’ mount option in /etc/fstab.
File access time is updated when it is read, you may not need this to ‘feature’.
in RHEL relatime behavior updates atime only if atime is older than mtime or ctime.
Enabling noatime also enables nodiratime.

– Increased read-ahead support
Read-ahead speeds up file access by pre-fetching data into the page cache.
Get current value using ‘blockdev -getra device’.
Change the value using ‘blocdev -setra N device’.
N is the number of 512-byte sectors.

# blockdev --getra  /dev/mapper/vg_hv1-lv_root
1024

Load Balancing Algorithms

Load balancers use a number of algorithms to direct traffic. Some of the most common algorithms are listed below.

1)Random: Using a random number generator, the load balancer directs connections randomly to the web servers behind it. This type of algorithm may be used if the web servers are of similar or same hardware specifications. If connection monitoring is not enabled, then the load balancer will continue sending traffic to failed web servers.

2)Round Robin : Using a circular queue, the load balancer walks through it, sending one request per server. Same as the random method, this works best when the web servers are of similar or same hardware specifications.

3)Weighted Round Robin: Web servers or a group of web servers are assigned a static weight. For instance, new web servers that can handle more load are assigned a higher weight and older web servers are assigned a lower weight. The load balancer will send more traffic to the servers with a higher weight than the ones to the lower weight. For instance if web server ‘A’ has a weight of 3 and web server ‘B’ has the weight of one, then web server ‘A’ will get 3 times as much traffic as web server ‘B’.

4)Dynamic Round Robin: Servers are not assigned a static weight, instead the weight is built dynamically based on metrics. These metrics may be generated on the servers and the load balancer will check them. The reason this is round robin is because if two servers have the same dynamic weight, the load balancer will use round robin between the different weights.

5)Fastest: The load balancer keeps track of response time from the web server, and will prefer to send connections to those servers that respond the quickest.

6)Least Connection: Keeps a track of the connections to the web servers, and prefers to send connections to the servers with the least number of connections.

7)Observed: Uses a combination of least connection and fastest algorithms. “With this method, servers are ranked based on a combination of the number of current connections and the response time. Servers that have a better balance of fewest connections and fastest response time receive a greater proportion of the connections. ”

8)Predictive: Works well in any environment, uses the observed method, however tries to predict which server will perform well based on rank of observed and will send more traffic to servers with a higher rank.

For additional details read http://bit.ly/1dyGelW.

CentOS CPU Scaling

CPU Scaling allows the processor to adjust speed on demand. CPUfreq governor defines the speed and power usage of a processor. The different types of governors are:

cpufreq_performance – for heavy workloads, always uses the highest cpu frequency, cost is power
cpufreq_powersave – uses the lowest cpu frequency, provides the most power savings, cost is performance
cpufreq_ondemand – adjusts cpu frequency based on need, can save power when system is idle, while ramping up when system is not idle, cost is latency while switching
cpufreq_userspace – allows any process running as root to set the frequency, most configurable
cpufreq_conservative – similar to ondemand, however unlike the ondemand governor which switches between lowest and highest, conservative performs gradual change

To use cpu governor run ‘sudo yum install cpupowerutils -y’.
To view available governors use ‘cpupower frequency-info –governors’
To add a particular driver use ‘modprobe ‘, as in ‘modprobe cpufreq_ondemand’.
To enable the given driver use ‘cpupower frequency-set –governor [governor]’, as in ‘cpupower frequency-set –governor cpufreq_ondemand’
To view CPU speed and policy use ‘ cpupower frequency-info’.
To set a frequency use ‘cpupower frequency-set’.

To view available drivers for cpu scaling checking ‘ls /lib/modules/[kernel version]/kernel/arch/[architecture]/kernel/cpu/cpufreq/’ should be a good start.

For additional information view https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/cpufreq_governors.html.