TCP/IP Tuning

Improving Network Performance

When it comes to network performance, unless there is an explicit problem, leave the network optimization to the kernel.
If you see a network problem, then study the problem in detail and come up with a solution for that problem, and leave the rest alone.

Linux handles incoming network packets in the following way:

1. Hardware reception, frame arrives into the network
2. NIC sends a hard IRQ to the CPU about the frame
3. Soft IRQ state, wherein the kernel removes the frame from the NIC, passing it to the application
4. The application recevies the frame and handles it via standard POSIX calls such as read, recv, recvfrom

First, do the obvious, check for errors:

– Check duplex, full/half/auto (ethtool)
– Check speed, 10/100/1000/auto
– Check for errors using ‘netstat -i’

Next, try some NIC optimizations:

* Input Traffic
* Queue Depth
* Application Call Frequency
* Change socket queues
* RSS: Receive Side Scaling
* RFS: Receive Flow Steering
* RPS: Receive Packet Steering
* XPS: Transmit Packet Steering

One option is to slow down incoming traffic. You can do this by lowering the NICs device weight.Input traffic can be slowed down by reducing the value of /proc/sys/net/core/dev_weight.

Also try to increase the physical NIC queue depth to the maximum supported as shown below.

# ethtool --show-ring em1
Ring parameters for em1:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511
Current hardware settings:
RX:		200
RX Mini:	0
RX Jumbo:	0
TX:		511

# ethtool --set-ring em1 rx 511

# ethtool --show-ring em1
Ring parameters for em1:
Pre-set maximums:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511
Current hardware settings:
RX:		511
RX Mini:	0
RX Jumbo:	0
TX:		511

Next, increase the frequency at which the application calls recv or read.

Socket Queue can be veiewed using the below command. Pruned packets or collapsed packets indicate queue issues.

netstat -s | grep socket
    3339 resets received for embryonic SYN_RECV sockets
    2 packets pruned from receive queue because of socket buffer overrun
    99617 TCP sockets finished time wait in fast timer
    6 delayed acks further delayed because of locked socket
    39 packets collapsed in receive queue due to low socket buffe

You can increase the socket queue by changing the values of receive and send window as shown below. The values are min, default and max.

# cat /proc/sys/net/ipv4/tcp_wmem
4096	65536	16777216
# cat /proc/sys/net/ipv4/tcp_rmem
4096	87380	16777216

Receive-Side Scaling (RSS), also known as multi-queue receive, distributes network receive processing across several hardware-based receive queues, allowing inbound network traffic to be processed by multiple CPUs. RSS can be used to relieve bottlenecks in receive interrupt processing caused by overloading a single CPU, and to reduce network latency.
RSS should be enabled when latency is a concern or whenever receive interrupt processing forms a bottleneck. RSS may be enabled by default, depending upon your NIC driver.
To check if it is enabled, look in /proc/interrupts. For instance: ‘# egrep ‘CPU|eth0′ /proc/interrupts’. If only one entry is shown, then your NIC does not support RSS.

Receive Flow Steering (RFS) extends RPS behavior to increase the CPU cache hit rate and thereby reduce network latency. Where RPS forwards packets based solely on queue length, RFS uses the RPS backend to calculate the most appropriate CPU, then forwards packets based on the location of the application consuming the packet. This increases CPU cache efficiency.
RFS is disabled by default. To enable RFS, you must edit two files: /proc/sys/net/core/rps_sock_flow_entries and /sys/class/net//queues/rx-queue/rps_flow_cnt.
For /proc/sys/net/core/rps_sock_flow_entries, a value of 32768 is recommended for moderate server loads.
For /sys/class/net//queues/rx-queue/rps_flow_cnt value of rps_sock_flow_entries/N where N is number of receive queues on device. Number of receive queues is defined in the file
/sys/class/net//queues/rx-0/rps_cpus.

Receive Packet Steering is prefferred over RSS, since it is the software implementation of RSS and does not require NIC card support.
Check /sys/class/net//queues/rx-0/rps_cpus to see how many queues are configured. The rps_cpus files use comma-delimited CPU bitmaps. Therefore, to allow a CPU to handle interrupts for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to handle interrupts with CPUs 0, 1, 2, and 3, set the value of rps_cpus to 00001111 (1+2+4+8), or f (the hexadecimal value for 15). To monitor which CPU is receiving network interrupts, looks for NET_RX in ‘watch -n1 cat /proc/softirqs’

Transmit Packet Steering is a mechanism for intelligently selecting which transmit queue to use when transmitting a packet on a multi-queue
device. To accomplish this, a mapping from CPU to hardware queue(s) is recorded. XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
default for SMP). The functionality remains disabled until explicitly configured. For a network device with a single transmission queue, XPS configuration
has no effect, since there is no choice in this case. To enable XPS, the bitmap of CPUs that may use a transmit
queue is configured using the sysfs file entry: /sys/class/net//queues/tx-/xps_cpus

Some of the above material came from:
https://www.kernel.org/doc/Documentation/networking/scaling.txt and also RedHat Tuning Guide.

Filesystem Tuning

Tips for RedHat filesystem tuning.

– Optimize file system block size.
Block size is selected at mkfs time. If you plan on storing mostly smaller or larger files than the default block size,
then decrease or increase the block size of the filesystem during mkfs time. This will allow for less wasted space and
also for faster reads when the file size fits in a single block.

# tune2fs -l /dev/mapper/vg_hv1-lv_home | grep 'Block size'
Block size:               4096 (bytes)

– Match filesystem geometry to storage subsystem.
During mkfs time, align the geometry of the filesystem with that of the RAID sub-system.
This might be set automatically, but it does not hurt to check. As per RedHat “For striped block devices (for example, RAID5 arrays), the stripe geometry can be specified at the time of file system creation. Using proper stripe geometry greatly enhances the performance of an ext4 file system…
For example, to create a file system with a 64k stride (that is, 16 x 4096) on a 4k-block file system, use the following command:”

# mkfs.ext4 -E stride=16,stripe-width=64 /dev/device

– External journals
If you are using a journaling filesystem such as ext4, move the journal to a faster disk that the data.
This will speed up journal writes and make your writes faster. Journals are specified at mkfs time.

# mke2fs -O journal_dev

– Use the ‘nobarrier’ option when mounting filesystems.
Barriers is a way of ensuring that filesystem metadata is written correctly on persistent storage.
This will slow down filesystems that have fsync() being used excessively.
If you disable barriers you risk corruption of data in case of power loss, however the risk might be worth it
if data is redundant on other systems.

– Disable access time, using the ‘noatime’ mount option in /etc/fstab.
File access time is updated when it is read, you may not need this to ‘feature’.
in RHEL relatime behavior updates atime only if atime is older than mtime or ctime.
Enabling noatime also enables nodiratime.

– Increased read-ahead support
Read-ahead speeds up file access by pre-fetching data into the page cache.
Get current value using ‘blockdev -getra device’.
Change the value using ‘blocdev -setra N device’.
N is the number of 512-byte sectors.

# blockdev --getra  /dev/mapper/vg_hv1-lv_root
1024

TCP State Diagram

TCP State Diagram

TCP goes through different states. The diagram below, which is from Richard Stevens TCP/IP Illustrated Volume 1, should help in understanding the states that TCP goes through.

TCP State Diagram
TCP State Diagram

TCP Header

TCP header is usually 20 bytes unless options are present. Maximum header length is 60 bytes. A typical TCP header contains the following:

– Source Port (16 bits)
– Destination Port (16 bits)
– Sequence bits (32 bits)
– Acknowledgement Number (32 bits) **
– Header Length (4 bits)
– Reserved (4 bits)
– CWR
– ECE **
– URG
– ACK **
– PSH
– RST
– SYN
– FIN
– Window size (16 bits) **
– TCP Checksum (16 bits)
– Urgent Pointer (16 bits)
– Options (variable)

The items with ** refer to data flowing from receiver to sender.
The rest of the items are from the sender to the receiver.

Source port and destination port are the ports used in the communications.

Sequence number identifies the byte in the stream of data. TCP numbers each byte of data. The sequence number is a 32 bit unsigned integer that loops around. Initial sequence number if often picked randomly.

Acknowledgement number is the next byte that the receiver is expecting.

Header length is between 20 bytes and 60 bytes.

CWR – Congestion Window Reduced (the sender reduced its sending rate)
ECE – ECN Echo (the sender received an earlier congestion notification)

URG, ACK, PSH, RST, SYN and FIN are explained in the TCP Flags post in my earlier blog.

Window size is the number of bytes starting with the acknowledgement number that the receiver is willing to accept. Since it’s 16 bits, it limited in size to 65,535 bytes.

TCP Checksum contains header and data checksum.

Urgent pointer is only valid if URG bit is set. This “pointer” is a positive offset that must be added to the Sequence Number field of the segment to yield the sequence number of the last byte of urgent data.

Options – Most common option is MSS or maximum segement size.
Each end of a connection normally specifies this option on the first segment it sends (the ones with the SYN bit field set to establish the connection).
The MSS option specifies the maximum-size segment that the sender of the option is willing to receive in the reverse direction.

(Some of the material above is from Richard Stevens book TCP/IP Illustrated, Volume 1)

IP Header Explained

Mininum 20 bytes, maximum 60 bytes.
Maximum size is 65,535 bytes for IP header + data.

– Version field (4 bits) for IPv4 or IPv6
– IHL or Internet Header Length (4 bits), it is the number of 32-bit words in the header
– DS field, (6 bits) or Differentiated Service
– ECN (2 bits) explicit congestion notification
– Total length (16-bit) indicates total length of IP datagram
– Identification (16 bits) – Helps identify each datagram, counter based, very imp for fragmentation.
– Flags (3 bits) – Used to indiciate fragmentation.
– Fragment offset (13 bits) – Used to indicate offset of fragmentation.
– TTL (8 bits) upper limit of routers the datagram can pass through. Normally set at 64, although 128 or 255 is also common.
– Protocol (8 bits) – Specifies protocol encapsulated. 6 is TCP and 17 is UDP.
– Header Checksum (16 bits) – Checksum of header only and not payload.
– Source IP (32 bit)
– Destination IP (32 bit)
– Options (Up to 320 bits/40 bytes)
– IP Data if any, up to 65,515 bytes

TCP Flags Explained

TCP Flags

TCP has six flags that can help you troubleshoot a connection. The flags are:

U – URG
A – ACK
P – PSH
R – RST
S – SYN
F – FIN

When using tcpdump command to troubleshoot network connections, you can view TCP conversations with these flags as follows:

# tcpdump 'tcp[13] & 32 != 0' #URG
# tcpdump 'tcp[13] & 16 != 0' #ACK
# tcpdump 'tcp[13] & 8  != 0' #PSH
# tcpdump 'tcp[13] & 4  != 0' #RST
# tcpdump 'tcp[13] & 2  != 0' #SYN
# tcpdump 'tcp[13] & 0  != 0' #FIN

Another way of expressing the values is ‘tcp-fin, tcp-syn, tcp-rst, tcp-push, tcp-ack, tcp-urg.’.

# tcpdump 'tcp[tcpflags] & tcp-urg != 0' #URG
# tcpdump 'tcp[tcpflags] & tcp-ack != 0' #ACK
# tcpdump 'tcp[tcpflags] & tcp-push != 0' #PSH
# tcpdump 'tcp[tcpflags] & tcp-rst != 0' #RST
# tcpdump 'tcp[tcpflags] & tcp-sync != 0' #SYN
# tcpdump 'tcp[tcpflags] & tcp-fin != 0' #FIN

A mnemonic to remember the above is ‘Unskilled Attackers Pester Real Security Folks’.

URG flag is used to indicate that the packet should be prioritized over other packets for processing.
This flag is not used often. I can only think of telnet that uses it.

#  tcpdump 'tcp[13] & 32 != 0'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br0, link-type EN10MB (Ethernet), capture size 65535 bytes
18:29:47.741814 IP destination > source.39697: Flags [P.U], seq 2342494158:2342494159, ack 1232430662, win 31, urg 1, options [nop,nop,TS val 877959588 ecr 703327131], length 1
18:29:51.293145 IP destination.telnet > source.39697: Flags [P.U], seq 12:13, ack 5, win 31, urg 1, options [nop,nop,TS val 877963140 ecr 703330673], length 1

SYN is used for starting a connection.
ACK is used to acknowledge packets received.
PSH is used to ask the receiving end not to buffer packets, but to process them as soon as they are received.
RST is used to denote that no service is listening on the given port.

# tcpdump -n -v 'tcp[tcpflags] & (tcp-rst) != 0'
tcpdump: listening on br0, link-type EN10MB (Ethernet), capture size 65535 bytes
18:44:11.087487 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    destination.ms-sql-s > source.x11: Flags [R.], cksum 0xb12d (correct), seq 0, ack 1435238401, win 0, length 0

More on TLS

TLS uses asymmetric and symmetric encryption. Asymmetric encryption is used for the initial communication, followed by faster symmetric key encryption.

Symmetric ciphers are stream based or block based. Stream based encrypt one message at a time. Block based take a number of bits, and encrypt them together as one. A few symmetric key encryption algorithms are:

– AES
– Blowfish
– RC4
– DES
– 3DES

A few asymmetric key encryption algorithms are:

– DH
– RSA
– Elliptic Curve
– DSS/DSA

A couple of message digest (MD) algorithms are:

– MD5
– SHA

If you want to see which algorithms an SSL server supports, use the tool ‘sslscan’ which can be installed using ‘yum install sslscan -y’.
You might have to enable EPEL repository to install using yum. After installation, if you run ‘sslscan http://www.google.com:443’ you will see a lot of very useful output, as show below. First you wil see the algorithms that sslscan supports, followed by the ones that http://www.google.com accepts. The most important item section is the one below:


Preferred Server Cipher(s):
SSLv2 0 bits (NONE)
SSLv3 128 bits ECDHE-RSA-RC4-SHA
TLSv1 128 bits ECDHE-RSA-RC4-SHA
TLS11 128 bits ECDHE-RSA-AES128-SHA
TLS12 128 bits ECDHE-RSA-AES128-GCM-SHA256

This is showing that http://www.google.com prefers SSLv3, TLSv1,1.1 and 1.2. The cipher suites preferred are ECDE-RSA-RC4-SHA.
EDCE is Elliptic Curve Ephemeral Diffie Hellman which supports PFS or Perfect Forward Secrecy.
Normally with RSA, a symmetric key is picked once as part of the SSL HELLO protocol. After that the key does not change.
This means that if the servers private key is compromised, then an attacker can get the symmetric key.
With EDCE and PFS, the symmetric key is changed every session, so even if one key is compromised, the other key will not be impacted.

You can configure Apache to prefer cipher suites, see https://httpd.apache.org/docs/current/ssl/ssl_intro.html and https://httpd.apache.org/docs/2.2/mod/mod_ssl.html#sslciphersuite.

Extending VM LVM disk

I have a CentOS VMs running in VMware, one of the VMs I was running was out of disk space. The disk was originally 85GB, I tried to increase it to 345GB. In an effort to increase the disk size, I tried:

  1. Shutdown the VM
  2. Increase the size of the disk
  3. Power on the VM
  4. Use parted to resize the partition

Unfortunately, step 4 did not work, parted complained that it could not detect the filesystem:

# parted
GNU Parted 2.1
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: VMware Virtual disk (scsi)
Disk /dev/sda: 344GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number Start End Size Type File system Flags
1 1049kB 538MB 537MB primary ext4 boot
2 538MB 85.9GB 85.4GB primary lvm

(parted) resize
WARNING: you are attempting to use parted to operate on (resize) a file system.
parted's file system manipulation code is not as robust as what you'll find in
dedicated, file-system-specific packages like e2fsprogs.  We recommend
you use parted only to manipulate partition tables, whenever possible.
Support for performing most operations on most types of file systems
will be removed in an upcoming release.
Partition number? 2
Start?  [538MB]?
End?  [85.9GB]? 344GB
Error: Could not detect file system.

It looks like parted does not like LVM. The next thing that I did is to add a new partition, then use ‘pvcreate’ to create a new physical extent, and then add that to the volume group, as seen below:

# parted
GNU Parted 2.1
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: VMware Virtual disk (scsi)
Disk /dev/sda: 344GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  538MB   537MB   primary  ext4         boot
 2      538MB   85.9GB  85.4GB  primary               lvm

(parted) mkpart
Partition type?  primary/extended? primary
File system type?  [ext2]?
Start? 85.9
End? 344GB
Warning: You requested a partition from 85.9MB to 344GB.
The closest location we can manage is 85.9GB to 344GB.
Is this still acceptable to you?
Yes/No? yes
Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda (Device or resource busy).  As a result, it may not reflect all of your changes until after reboot.
(parted) print
Model: VMware Virtual disk (scsi)
Disk /dev/sda: 344GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  538MB   537MB   primary  ext4         boot
 2      538MB   85.9GB  85.4GB  primary               lvm
 3      85.9GB  344GB   258GB   primary

(parted) quit

#ls -l /dev/sda
sda   sda1  sda2  sda3

# pvcreate /dev/sda3
  dev_is_mpath: failed to get device for 8:3
  Physical volume "/dev/sda3" successfully created
  
# pvs
  PV         VG   Fmt  Attr PSize   PFree
  /dev/sda2  vg0  lvm2 a--   79.50g      0
  /dev/sda3       lvm2 a--  240.00g 240.00g
  
# vgextend  vg0 /dev/sda3
  Volume group "vg0" successfully extended

# vgs
  VG   #PV #LV #SN Attr   VSize   VFree
  vg0    2   4   0 wz--n- 319.49g 240.00g

How to hire talent

Introduction

Have been on both sides of a site reliability engineer interview, I have come to realize that interviewing is as much art as it is science. Although there are companies which have tried to make interviewing a numbers game, and to abstract out the human element, interviewing continues to be an imperfect process. Almost anything can go wrong in an interview, from both sides. From a companies perspective, you should not consider a candidate confirmed until he/she shows up for work on the first day. In this article, I list a general format that if adhered to, may prove to be useful for interviewing of software engineers, or site reliability engineers.

Format of Interview

Once a candidate has been identified for a role, or has been “sourced” through different means such as LinkedIn, referrals, or otherwise, a recruiter should try to arrange to arrange a phone conversation with the candidate. Once a phone conversation has been arranged, start with the below steps.

– Over the phone, recruiter asks some basic technical questions, passing of which is mandatory. The recruiter also judges if the candidate is a good fit for the position based on experience and job description. In terms of what kind of technical questions a recruiter should ask, they should be limited to less than half a dozen, and should be straight forward, that do not require any technical expertise on the recruiter’s behalf. A few examples for a SRE (site reliability engineer) position can be “Which command do you use to view network traffic on a Linux host?”, “Describe the process of opening a TCP port on a Linux host”, “What is the difference between a hard link and a soft link on a filesystem?”.

– There may be a 45 minute or so technical phone interview with an engineer if a candidate passes the initial phone screen with the recruiter. In this technical phone interview, ask additional questions on various topics, such as networking, filesystems, Linux internals, processes, etc, to get a better idea of the candidates strenghts.
After a successfull technical phone interview, next step should be to schedule an onsite interview.

– Onsite interview is generally held by 4 or 5 engineers, and may be broken up in the following format:
* Unix/Linux system internals
* General troubleshooting questions
* Scripting/programming
* Project management
* Scalable architecture

You should try to avoid having different folks ask the same question. You can also split the 5 interviews into a couple of days if you like. Try to avoid non-technical questions, you can guage other aspects of the person

What to look for in a candidate

From a companies perspective, below are some qualities you should look for in a candidate:
– Smart
– Able to get things done
– Self-directed
– Diverse
– Takes Initiative
– Solves Problems on their own

The question you should be asking yourself is “Is this an awesome person who I want to work with?”. The idea is to hire across the board with consistent standards. Your job is not to break the candidate, it is to try to figure out what he/she does best and where they are lacking in strength.

During the Interview

Before the interview starts, try to make the candidate comfortable. This may sound obvious, but it is not. I remember being in 5 hour interviews without a single break, feeling tired and thirsty and unable to concentrate. Give the candidate frequent breaks for hydration if needed. You want the candidate to bring out the best in a candidate, to do this:
– Come with some hints
– Give candidates with opportunities to success
– Your job is not to hand a test
– Don’t get stuck on a problem
– Make an interview enjoyable
– Don’t make a candidate feel misreable
– If they do not feel good, they won’t be a good referral
– Try to limit general questions, such as definitions, which can be easy to memorize and do not give an idea of depth of a topic.

Good interview questions

– Start with easy questions, work you way to harder problems
– Don’t look for specific answers, as a question may have different valid answers
– Ask 2 or 3 questions
– One question is ok if it has multiple areas to dive into

Comparatively, not so good interview feedback is:
– Is generally incomplete
– Don’t include pictures of the candidate

What not to ask

Don’t ask about:
– Religion
– Age
– Disability
– Genetic info
– National origin
– Pregnancy
– Race and color
– Gender

Asking questions about the above can result in legal complications. If candidate does not get hired, he may turn around and sue your company if you asking him/her about the above questions. He/she may claim that they did not get hired because they belong to a particular race or religion.

Limit questions on:
– Textbook knowledge questions
– Culture questions
– Communications questions

You can pick up on culture fit and communication questions based on the technical questions you ask.

Providing interview feedback

Good Interview Feedback is factual. It should in the least:
– Lists your questions
– Lists the answers as given by the candidate
– Additionally, feedback should include analysis of the interviewer.
– Transcibe code
– Include a high level summary
– Evaluate culture fit
– Role level knowledge
– General cognitive ability

About recordings:
– Avoid audio, video recording

How do you conduct your interviews?

How to setup a BIND DNS Server

The Internet runs mostly on ISC’s BIND DNS server. In this artcile I will explain how to setup a simple BIND server on a Linux box.
The basic steps involved in setting up BIND are:

– Download BIND from ISC’s website
When downloading BIND, you might be better off if you pick an ESV, or an extended support version. As of the writing of this article, 9.9.5 is the latest ESV.
ESV versions are supported for at least 3 years by ISC. You can check the latest ESV version here https://www.isc.org/downloads/. If you are wondering as to why not to use the vendor provided BIND version, it’s because the vendor may not be using the ESV version, and may also lag behind in patches. Using the latest ESV from BIND will get you the most stable version for production use.

– Compile BIND
You will need gcc, make to compile bind. You can run ‘sudo yum groupinstall “Development Tools” -y’ on your CentOS or RedHat box to install GCC. In order to compile BIND, after downloading it, untar it, and then try the following from the directory you have untarred the distriution in.

$sudo yum groupinstall "Development Tools" -y
$./configure --prefix=/opt/bind-9.9.5 --enable-newstats --with-libxml2
$make
#make install

The above will install BIND in /opt/bind-9.9.5, after that you can link /opt/bind to /opt/bind-9.9.5 with “sudo ln -s /opt/bind-9.9.5 /opt/bind”.
I have enabled newstats and with libxml because this allows me to view BIND query stats in a format which I find easy to use.

– Configure a master server
Now that you have downloaded BIND, and compiled it, as well as installed it, we will move onto configuring it. The basic steps in configuration are as follows:

I have placed extensive comments in the conf file and in the zone files, please review them in order to understand what information to place in the files

A. Create a named.conf zone file in etc dire of chroot. See https://github.com/syedaali/configs/blob/master/bind-master-named.conf for a sample with comments.
B. Install a root hints file, example is here https://github.com/syedaali/configs/blob/master/bind-root.hints.
C. Create a forward lookup zone file, sample is here https://github.com/syedaali/configs/blob/master/bind-example.com.zone.
D. Create a reverse lookup zone file, sample is here https://github.com/syedaali/configs/blob/master/bind-10.1.1.rev.
E. Create a loopback forward and reverse zone. Forward sample is here https://github.com/syedaali/configs/blob/master/bind-master.localhost.
Sample of reverse lookup for localhost is here https://github.com/syedaali/configs/blob/master/bind-127.0.0.rev.

– Install one slave server
A. Compile BIND as seen in above steps for master
B. Install BIND in /opt/bind-9.9.5
C. Link to /opt/bin
D. Repeat steps B through E from master, with the exception that in the zone file, specify slave instead of master.
Sample slave zone file is here https://github.com/syedaali/configs/blob/master/bind-slave-named.conf.

– Configure your clients to point to the slave server
It is generally a good idea to avoid pointing your BIND clients to the BIND master. Instead create two slaves, and point the clients to the slaves. For instance if your master has IP address 10.1.1.10, and you have two BIND slaves, with IP 10.1.1.11 and 10.1.1.12, then in your clients /etc/resolv.conf do the following:

$cat /etc/resolv.conf
domain example.com
nameserver 10.1.1.11
nameserver 10.1.1.2

In future blogs I will specify how to monitor BIND. Do share your experience with BIND setup here.

High Availability using Keepalived

HAProxy (http://haproxy.1wt.eu/) is a popular solution for load balancing servers. In a typical load balanced configuration, there may be a number of web servers behind a pair of HAProxy load balancers. The question arises, how do you load balance the HAProxy servers themselves? One way is to use ‘keepalived’ (http://www.keepalived.org/). In order to use this solution, you need at least two HAProxy servers. On both of them install keepalived as explained below. Both servers will have a floating IP which you can create a DNS record for and give that name to your clients. For instance http://www.example.com may have IP 10.1.1.30 which is the floating IP between the two HAProxy servers. Clients will attempt to connect to 10.1.1.30. Depending on which HAProxy server is the master, the IP will be owned by that server. If that server fails, then the backup server will start to issue gratuitous ARP responses for the same IP of 10.1.1.30 and the requests to the web servers will then go through the backup HAProxy server which has now become the primary.

– Install keepalived

$sudo yum install keepalived -y

– Setup two hosts with the following IP address. The floating IP address will be assigned to the virtual router instance in the config later.

10.1.1.10 is h1.example.com (HAProxy server 1)
10.1.1.20 is h2.example.com (HAProxy server 2)
10.1.1.30 is floating IP (shared between the two server)
10.1.1.100 is SMTP server

– Sample basic config file for master. Note the use of ‘state MASTER’ and also ‘priority 101’

! Configuration File for keepalived

global_defs {
   notification_email {
     admin@example.com
   }
   notification_email_from keepalived@example.com
   smtp_server 10.1.1.100
   smtp_connect_timeout 30
   router_id LVS_DEVEL
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 101
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        10.1.1.30
    }
}

– Setup backup server. Sample basic config file for backup is below. Note the use of ‘state BACKUP’ and also ‘priority 100’. Priority should be lower on backup.

! Configuration File for keepalived

global_defs {
   notification_email {
     admin@example.com
   }
   notification_email_from keepalived@example.com
   smtp_server 10.1.1.100
   smtp_connect_timeout 30
   router_id LVS_DEVEL
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        10.1.1.30
    }
}

– Start keepalived on both master and backup.

$sudo service keepalived start

– Verify that keepalived is running. You can do this by checking the IP address on the MASTER.

On 10.1.1.10, MASTER, floating IP is assigned to eth0 when MASTER is up.

ip addr show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:50:56:bd:4c:c7 brd ff:ff:ff:ff:ff:ff
    inet 10.1.1.10/24 brd 10.1.1.255 scope global eth0
    inet 10.1.1.30/32 scope global eth0

As you can see the 10.1.1.30 IP is with the master. Check the IP of the BACKUP HAProxy host.
On BACKUP, we have only the BACKUP IP, and not the floating IP.

# ip addr show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:50:56:bd:7a:5f brd ff:ff:ff:ff:ff:ff
    inet 10.1.1.20/24 brd 10.1.1.255 scope global eth0

– You can also use tcpdump and verify that the master is sending VRRP advertisments.
VRRP uses Multicast to keep track of state, you can view multicast traffic using tcpdump as shown below.

#tcpdump net 224.0.0.0/4

Host h1.example.com is the master.

15:49:38.342468 IP h1.example.com > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
15:49:39.342767 IP h1.example.com > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
15:49:40.343062 IP h1.example.com > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
15:49:41.343371 IP h1.example.com > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20

– To test the failover, turn off keepalived on the master using ‘service keepalived stop’.
Once I stop keepalived on the MASTER, I see the floating IP has now been assigned to the BACKUP.

# ip addr show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:50:56:bd:7a:5f brd ff:ff:ff:ff:ff:ff
    inet 10.1.1.20/24 brd 10.1.1.255 scope global eth0
    inet 10.1.1.30/32 scope global eth0

– If you keep tcpdump running you should see that the BACKUP host is now sending out the VRRP advertisments. Host h2.example.com is now the master.

15:49:45.953584 IP h2.example.com > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
15:49:46.953889 IP h2.example.com > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
15:49:47.954202 IP h2.example.com > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
15:49:48.954519 IP h2.example.com > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20

– In /var/log/messages on the BACKUP HAPRoxy host, you should see it taking over the floating IP. Host h2.example.com is the master now.

Mar  8 15:49:44 h2.example.com Keepalived_vrrp[4652]: VRRP_Instance(VI_1) Transition to MASTER STATE
Mar  8 15:49:45 h2.example.com Keepalived_vrrp[4652]: VRRP_Instance(VI_1) Entering MASTER STATE
Mar  8 15:49:45 h2.example.com Keepalived_vrrp[4652]: VRRP_Instance(VI_1) setting protocol VIPs.
Mar  8 15:49:45 h2.example.com Keepalived_vrrp[4652]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 10.1.10.30
Mar  8 15:49:45 h2.example.com Keepalived_healthcheckers[4651]: Netlink reflector reports IP 10.1.10.30 added
Mar  8 15:49:46 h2.example.com ntpd[2974]: Listen normally on 4 eth0 10.1.10.30 UDP 123
Mar  8 15:49:46 h2.example.com ntpd[2974]: peers refreshed
Mar  8 15:49:50 h2.example.com Keepalived_vrrp[4652]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 10.1.10.30
Mar  8 15:49:52 h2.example.com kernel: device eth0 left promiscuous mode

– Since the test has worked, Re-enable keepalived on the MASTER host and watch in tcpdump as host h1.example.com is back to being the master.
You can re-enable keepalived using ‘service keepalived start’.

16:16:41.841635 IP h1.example.com > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
16:16:42.842722 IP h1.example.com > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
16:16:43.843847 IP h1.example.com > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
16:16:44.844982 IP h1.example.com > vrrp.mcast.net: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20

– You can also check in /var/log/messages when host h1.example.com is the master.

Mar  8 15:50:18 h1.example.com Keepalived_vrrp[4324]: Kernel is reporting: interface eth0 UP
Mar  8 15:50:18 h1.example.com Keepalived_vrrp[4324]: VRRP_Instance(VI_1) Transition to MASTER STATE
Mar  8 15:50:19 h1.example.com Keepalived_vrrp[4324]: VRRP_Instance(VI_1) Entering MASTER STATE
Mar  8 15:50:19 h1.example.com Keepalived_vrrp[4324]: VRRP_Instance(VI_1) setting protocol VIPs.
Mar  8 15:50:19 h1.example.com Keepalived_healthcheckers[4323]: Netlink reflector reports IP 10.1.10.30 added
Mar  8 15:50:19 h1.example.com Keepalived_vrrp[4324]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 10.1.10.30
Mar  8 15:50:21 h1.example.com Keepalived_vrrp[4324]: Netlink reflector reports IP 10.1.10.10 added
    

– Once you return H1 host to being the MASTER by starting keepalived, in /var/log/messages on H2 host you will see it giving up the floating IP.
After I bring h1.example.com back online, h2.example.com becomes the backup.

Mar  8 15:50:18 h2.example.com Keepalived_vrrp[4652]: VRRP_Instance(VI_1) Received higher prio advert
Mar  8 15:50:18 h2.example.com Keepalived_vrrp[4652]: VRRP_Instance(VI_1) Entering BACKUP STATE
Mar  8 15:50:18 h2.example.com Keepalived_vrrp[4652]: VRRP_Instance(VI_1) removing protocol VIPs.
Mar  8 15:50:18 h2.example.com Keepalived_healthcheckers[4651]: Netlink reflector reports IP 10.1.10.30 removed

– You can further verify your configuration by running a ping during the above exercise. Ping the floating IP from another host.
There is minimal packet loss as both MASTER, and the BACKUP take over each other’s services as appropriate.

$ ping 10.1.10.30
PING 10.1.10.30 (10.1.10.30): 56 data bytes
64 bytes from 10.1.10.30: icmp_seq=0 ttl=61 time=90.486 ms
64 bytes from 10.1.10.30: icmp_seq=1 ttl=61 time=89.514 ms
64 bytes from 10.1.10.30: icmp_seq=2 ttl=61 time=87.989 ms
64 bytes from 10.1.10.30: icmp_seq=3 ttl=61 time=98.162 ms
64 bytes from 10.1.10.30: icmp_seq=4 ttl=61 time=87.107 ms
64 bytes from 10.1.10.30: icmp_seq=5 ttl=61 time=89.163 ms
64 bytes from 10.1.10.30: icmp_seq=6 ttl=61 time=88.792 ms
64 bytes from 10.1.10.30: icmp_seq=7 ttl=61 time=89.156 ms
Request timeout for icmp_seq 8                                <------ At this point I stopped the keepalived on MASTER
Request timeout for icmp_seq 9
64 bytes from 10.1.10.30: icmp_seq=10 ttl=61 time=88.386 ms   <------ BACKUP has now started to respond to ping for floating IP
64 bytes from 10.1.10.30: icmp_seq=11 ttl=61 time=91.164 ms
64 bytes from 10.1.10.30: icmp_seq=12 ttl=61 time=88.215 ms
64 bytes from 10.1.10.30: icmp_seq=13 ttl=61 time=88.457 ms
64 bytes from 10.1.10.30: icmp_seq=14 ttl=61 time=87.170 ms
64 bytes from 10.1.10.30: icmp_seq=15 ttl=61 time=120.544 ms
64 bytes from 10.1.10.30: icmp_seq=16 ttl=61 time=91.861 ms
Request timeout for icmp_seq 17                               <------ I restarted keepalived on MASTER
64 bytes from 10.1.10.30: icmp_seq=18 ttl=61 time=89.658 ms
64 bytes from 10.1.10.30: icmp_seq=19 ttl=61 time=90.201 ms
64 bytes from 10.1.10.30: icmp_seq=20 ttl=61 time=88.008 ms
64 bytes from 10.1.10.30: icmp_seq=21 ttl=61 time=88.369 ms
^C
--- 10.1.10.30 ping statistics ---
22 packets transmitted, 19 packets received, 13.6% packet loss
round-trip min/avg/max/stddev = 87.107/91.179/120.544/7.315 ms

The above is a simple example of using IP based monitoring. You can also do application based monitoring. In order to do this, we will modify our config file on both master and slave to include a check which will check the status of the HAProxy service. If it is running, then it will continue to serve the floating IP through the MASTER. If the service stops, then the BACKUP will resume ownership. This level of monitoring is in addition to monitoring the network interface being up.

! Configuration File for keepalived

global_defs {
   notification_email {
     admin@example.com
   }
   notification_email_from keepalived@example.com
   smtp_server 10.1.1.100
   smtp_connect_timeout 30
   router_id LVS_DEVEL
}

vrrp_script check_haproxy {
    script    "/sbin/service haproxy status"
    interval 2
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 101
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        10.1.1.30
    }
    track_script {
        check_haproxy
    }
}

Notice in the above we added two additinal sections. One is ‘vrrp_script check_haproxy’. This code will run haproxy status on the MASTER. If the return code is ‘0’ then the service is considered to be up. If the return code is other than ‘0’ then the service is considered to be down and the BACKUP host will then take-over the floating IP.

– Backup server config for application monitoring. Similar to the MASTER.

! Configuration File for keepalived

global_defs {
   notification_email {
     admin@example.com
   }
   notification_email_from keepalived@example.com
   smtp_server 10.1.1.100
   smtp_connect_timeout 30
   router_id LVS_DEVEL
}

vrrp_script check_haproxy {
    script    "/sbin/service haproxy status"
    interval 2
    fall 2
    rise 2
}


vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        10.1.1.30
    }
    track_script {
        check_haproxy
    }
    
}

How do you use keepalived and HAPRoxy in your network? Share your comments.

Adding a new LVM partition with GNU parted

In this brief article I will explain how to add a new physical parition to an existing disk, and then to use that new parition to create a mountable logical volume.
Let’s assume that we have a 1TB disk, and we are running CentOS/RedHat, in this case I am using version 6.5

First attempt to view partition table, turns out that disk has GUID Partition table:

# fdisk -l /dev/sda

WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sda: 999.7 GB, 999653638144 bytes
255 heads, 63 sectors/track, 121534 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x6929f946

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1      121535   976224255+  ee  GPT

Let’s try to view partition table using GNU parted, and then add a 189GB partition.
The command ‘print free’ shows the free space at the end.
I used the command ‘mkpart’ to add the partition. Notice that partition 3 stops at 211GB.
So I can create a new partition from 211GB onwards. The end of free space is 1000GB, so I picked 400GB as my end, leaving another 600GB free for later use.
I called my new partition ‘bigdisk’ and it’s numbered 4.

# parted /dev/sda
GNU Parted 2.1
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print free
Model: Dell Virtual Disk (scsi)
Disk /dev/sda: 1000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name  Flags
        17.4kB  1049kB  1031kB  Free Space
 1      1049kB  211MB   210MB   fat16              boot
 2      211MB   840MB   629MB   ext4
 3      840MB   211GB   210GB                      lvm
        211GB   1000GB  789GB   Free Space

(parted) mkpart
Partition name?  []? bigdisk
File system type?  [ext2]? ext4
Start? 211GB
End? 400GB
Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda (Device or resource busy).  As a result, it may not reflect all of your changes until after reboot.
(parted) print
Model: Dell Virtual Disk (scsi)
Disk /dev/sda: 1000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End    Size   File system  Name     Flags
 1      1049kB  211MB  210MB  fat16                 boot
 2      211MB   840MB  629MB  ext4
 3      840MB   211GB  210GB                        lvm
 4      211GB   400GB  189GB               bigdisk

I have decided to remove the 189GB partition and increase it’s size. So instead of starting at 211GB and ending at 400GB, I am ending at 600GB.
This gives me a 389GB partition.

# parted
GNU Parted 2.1
Using /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: Dell Virtual Disk (scsi)
Disk /dev/sda: 1000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End    Size   File system  Name     Flags
 1      1049kB  211MB  210MB  fat16                 boot
 2      211MB   840MB  629MB  ext4
 3      840MB   211GB  210GB                        lvm
 4      211GB   400GB  189GB               bigdisk

(parted) rm 4
Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda (Device or resource busy).  As a result, it may not reflect all of your changes until after reboot.
(parted) mkpart
Partition name?  []? bigdisk
File system type?  [ext2]? ext4
Start? 211GB
End? 600GB
Warning: WARNING: the kernel failed to re-read the partition table on /dev/sda (Device or resource busy).  As a result, it may not reflect all of your changes until after reboot.
(parted) print
Model: Dell Virtual Disk (scsi)
Disk /dev/sda: 1000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End    Size   File system  Name     Flags
 1      1049kB  211MB  210MB  fat16                 boot
 2      211MB   840MB  629MB  ext4
 3      840MB   211GB  210GB                        lvm
 4      211GB   600GB  389GB               bigdisk

(parted) quit
Information: You may need to update /etc/fstab.

I will now view my LVS physical extent:

# pvs
  PV         VG     Fmt  Attr PSize   PFree
  /dev/sda3  vg_hv1 lvm2 a--  195.31g 11.71g

I need to add the 389GB partition as a physical extent. I know it’s /dev/sda4 because when I ran parted it showed partition #4.

# pvcreate /dev/sda4
  dev_is_mpath: failed to get device for 8:4
  Physical volume "/dev/sda4" successfully created
# pvs
  PV         VG     Fmt  Attr PSize   PFree
  /dev/sda3  vg_hv1 lvm2 a--  195.31g  11.71g
  /dev/sda4         lvm2 a--  362.70g 362.70g

I have one volume group on my system:

# vgs
  VG     #PV #LV #SN Attr   VSize   VFree
  vg_hv1   1   7   0 wz--n- 195.31g 11.71g

I am going to add another volume group, I don’t have to do this. I can simply extend my existing volume group, but I want to make another volume group in order to logically separate my applications.

# vgcreate vg_ic  /dev/sda4
  Volume group "vg_ic" successfully created
# vgs
  VG     #PV #LV #SN Attr   VSize   VFree
  vg_hv1   1   7   0 wz--n- 195.31g  11.71g
  vg_ic    1   0   0 wz--n- 362.70g 362.70g

Now I can create my logical volumes as needed. My existing logical volumes are:

# lvs
  LV      VG     Attr       LSize  Pool Origin Data%  Move Log Cpy%Sync Convert
  lv_home vg_hv1 -wi-ao---- 19.53g
  lv_root vg_hv1 -wi-ao---- 39.06g
  lv_swap vg_hv1 -wi-ao----  7.81g
  lv_tmp  vg_hv1 -wi-ao----  9.77g
  lv_var  vg_hv1 -wi-ao----  9.77g
  lv_vm1  vg_hv1 -wi-ao---- 48.83g
  lv_vm2  vg_hv1 -wi-ao---- 48.83g

Adding additional logical volumes:

# lvcreate -L 80G vg_ic -n lv_cdrive
  Logical volume "lv_cdrive" created
# lvs
  LV        VG     Attr       LSize  Pool Origin Data%  Move Log Cpy%Sync Convert
  lv_home   vg_hv1 -wi-ao---- 19.53g
  lv_root   vg_hv1 -wi-ao---- 39.06g
  lv_swap   vg_hv1 -wi-ao----  7.81g
  lv_tmp    vg_hv1 -wi-ao----  9.77g
  lv_var    vg_hv1 -wi-ao----  9.77g
  lv_vm1    vg_hv1 -wi-ao---- 48.83g
  lv_vm2    vg_hv1 -wi-ao---- 48.83g
  lv_cdrive vg_ic  -wi-a----- 80.00g

That’s about it! Next step can be to use ‘mkfs.ext4 /dev/mapper/vg_ic-lv_cdrive’ if you want to install ext4 on the LVM, followed by mounting it in /etc/fstab.
Or you can use it with KVM to install a VM on. If you like to add disk in another way, do share your experience.

%d bloggers like this: