My Linux Thoughts!

Wednesday, 8 June 2016

Why your web application performance is so slow and how to fix it!

1. Find out the problem with the client side or server side by accessing your web application from a different machine.

2. System Level troubleshooting

a) RAM related issues
b) Disk Space related Issues
c) Disk I/O read write issues
d) Network Hardware issues
e) Mount issues
f) Too many process running in the machine

3. Application Level troubleshooting

a) Application is not behaving properly. Hit to Application log file OR application server log file OR web server Log file and try to understand the issues.

b) zombie process issues – Find out if any as such process which is causing the system performance issues.

c) Application Log – depends on the application installed, this can be referred and make use of the experience with the project and troubleshoot.

d) Web Server Log – we can check apache, tomcat log as well.

e) Memory Leak of any application – This is one of well-known issues in Linux based server due to bad application coding. Many times this can be resolved either by fixing the code or rebooting. But many other solutions are there to apply.

4. Dependent Services troubleshooting

a) SMTP Response time – SMTP server is not responding faster which is causing delay in response and queue up many processes.

b) Network issues – There are many system performance issues is dependent on network or service which is depends on the network.

c) Firewall related issues

d) Antivirus related issues

Some useful commands for system troubleshooting are

a) top
b) tail –f <logfile>
c) free
d) df –k
e) du –sh
f) ps –eaf | grep
g) vmstat
h) iostat -x
i) sar
j) dmesg
k) crash
l) strace

Some useful commands for network troubleshooting are

a) ping
b) traceroute
c) telnet/nc
d) ifconfig/ip
e) netstat -tulpn/ss -tuln
f) nslookup/host/dig
g) route
h) nmap
i) tcpdump

Saturday, 30 November 2013

PXE/Kickstart boot in a nutshell

One of the key requirements of provisioning is the hardware server's ability to boot over the network instead of a diskette or CD-ROM. There are several ways computers can boot over a network, and Preboot Execution Environment (PXE) is one of them. PXE is an open industry standard supported by a number of hardware and software vendors.

PXE works with Network Interface Card (NIC) of the system by making it function like a boot device. The PXE-enabled NIC of the client sends out a broadcast request to DHCP server, which returns with the IP address of the client along with the address of the TFTP server, and the location of boot files on the TFTP server. The following steps describe how it works:

Target Machine (either bare metal or with boot sector removed) is booted.

The Network Interface Card (NIC) of the machine triggers a DHCP request.

DHCP server intercepts the request and responds with standard information (IP, subnet mask, gateway, DNS etc.). In addition, it provides information about the location of a TFTP server and boot image (pxelinux.0).

When the client receives this information, it contacts the TFTP server for obtaining the boot image.

TFTP server sends the boot image (pxelinux.0), and the client executes it.

By default, the boot image searches the pxelinux.cfg directory on TFTP server for boot configuration files on the TFTP server using the following approach:

First, it searches for the boot configuration file that is named according to the MAC address represented in lower case hexadecimal digits with dash separators. For example, for the MAC Address "88:99:AA:BB:CC:DD", it searches for the file 01-88-99-aa-bb-cc-dd.

Then, it searches for the configuration file using the IP address (of the machine that is being booted) in upper case hexadecimal digits. For example, for the IP Address "192.0.2.91", it searches for the file "C000025B".

If that file is not found, it removes one hexadecimal digit from the end and tries again. However, if the search is still not successful, it finally looks for a file named "default" (in lower case).

For example, if the boot file name is /tftpboot/pxelinux.0, the Ethernet MAC address is 88:99:AA:BB:CC:DD, and the IP address 192.0.2.91, the boot image looks for file names in the following order:

/tftpboot/pxelinux.cfg/01-88-99-aa-bb-cc-dd
/tftpboot/pxelinux.cfg/C000025B
/tftpboot/pxelinux.cfg/C000025
/tftpboot/pxelinux.cfg/C00002
/tftpboot/pxelinux.cfg/C0000
/tftpboot/pxelinux.cfg/C000
/tftpboot/pxelinux.cfg/C00
/tftpboot/pxelinux.cfg/C0
/tftpboot/pxelinux.cfg/C

The client downloads all the files it needs (kernel and root file system), and then loads them.

Target Machine reboots.

The Provisioning application uses Redhat's Kickstart method to automate the installation of Redhat Linux on target machines. Using kickstart, the system administrator can create a single file containing answers to all the questions that will usually be asked during a typical Red Hat Linux installation.

The host specific boot configuration file contains the location of the kickstart file. This kickstart file would have been created earlier by the stage directive of the OS image based on the input from user.

Friday, 29 November 2013

What happens when you type a URL in your browser?

Most of the web browsers caches DNS information that they don't have to submit a DNS query each time it connects to a recently visited website. Your local DNS servers does the same, act as cache as well as recursive name servers. I presume your web browser as well as local DNS server doesn't cache any information about the website you typed.

1. You type www.mylinuxthoughts.com into your web browser's address bar and hit enter.

2. Your browser sends a request to the first DNS server listed in your DNS client configuration file /etc/resolv.conf.

3. This name server sends a query to a root name server, which returns a list of the authoritative name servers for the appropriate Top Level Domains (TLD's) (.com, .net, .org etc.)

4. The TLD DNS name servers look at the next part of the query from right to left of www.mylinuxthoughts.com, then direct the query to the authoritative name server for mylinuxthoughts.com.

5. Since you are looking for the IP address of www.mylinuxthoughts.com, your local DNS server queries the authoritative name server for A Resource Record of www.mylinuxthoughts.com and retrieves that to your localhost.

6. Now your browser will use this IP address to establish a communication with the web server which hosts the domain www.mylinuxthoughts.com that you want to visit.

7. The TCP/IP stack of your system initiates a TCP 3-way handshake with the IP address of the server, typically on port 80/TCP, browser sends the HTTP request through TCP connection, once the handshake is successful.

8. The browser receives the HTTP response (status line), which has three parts separated by spaces. First HTTP version, second a response status code that gives the result of the request, and third an explanation of the status code.

eg: HTTP/1.1 200 OK

9. Now your browser has a connection with www.mylinuxthoughts.com’s web server. The browser will send a HTTP/GET request to retrieve the html code of the specific page that is requested.

10. Once your browser receives the HTML code from the web server, it renders the HTML code to your browser window.

11. Now your local DNS server stores this IP in it's cache for future use.

12. When you close your browser, TCP connection terminates.

Ref: DNS and BIND, 5th Edition
Pro DNS and BIND
Computer Networks, 4th Edition

Thursday, 31 October 2013

Linux Server Performance Tuning and Hardening!

System hardening is one of the toughest job for any system administrator. I would like to share few steps which can make your server more secure.

1.Physical Server Security

You must protect Linux servers physical console access. Configure the BIOS and disable the booting from external devices such as DVDs / CDs / USB pen. Set BIOS and grub boot loader password to protect these settings.

Next, set a password for the GRUB bootloader. Generate a password hash using the command /sbin/grub-md5-crypt.

# /sbin/grub-md5-crypt
Password:
Retype password:
$1$.bvWQ1$8Cf.vpU5BKQCPlr1u07iQ1

Add the hash to the first line of /etc/grub.conf as follows:
password --md5 $1$.bvWQ1$8Cf.vpU5BKQCPlr1u07iQ1
This prevents users from entering single user mode or changing settings at boot time.

Note: The “md5sum” stands for (Compute and Check MD5 Message Digest), md5 checksum (commonly called hash) is used to match or verify integrity of files that may have changed as a result of a faulty file transfer, a disk error or non-malicious interference.

2. Remove unnecessary packages

Install packages according to the functional requirement of your server. It's a good practice that not to run services like Apache or Samba running on mail servers. Same with having development packages or desktop software packages like x-server installed on production servers. It is very crucial to remove unnecessary packages or packages that don't comply with your organization's security policy. Packages like ftp,telnet etc should not be installed unless you have a justified business reason for it. There are many alternatives for it, like scp or sftp, which runs under ssh suite.

To get a list of all installed RPMs you can use the following command:

# rpm -qa

To remove the unwanted RPM's you can use

# rpm -e <package_name> or # yum erase <package_name>

If you don't want to remove the package then you can disable it using chkconfig.

To list the services configured to start at boot, run the following command:

# /sbin/chkconfig --list

# ckhconfig --level 2345 <package_name> off

To remove the startup script:
# /bin/mv /etc/rc.d/rc3.d/S25<script_name> /etc/rc.d/rc3.d/K25<script_name>

To disable services, either you can remove the start up script, or use commands like chkconfig or ntsysv. There are two steps to stopping a service: 1) stop the currently running services, and 2) change the configuration so that the services doesn’t start on the next reboot.

3. Apply Patches

Building an infrastructure for patch management is another very important step to proactively secure Linux production environments. It is recommended to have a written security policy and procedure to handle Linux security updates and issues.

For example, a security policy should detail the time frame for assessment, testing, and roll out of patches. Network related security vulnerabilities should get the highest priority and should be addressed immediately within a short time frame.

All security update should be reviewed and applied as soon as possible. If you don't have a security update policy and you decide to apply automatic update then make sure that you exclude packages from automatic update by excluding them in /etc/yum.conf. This is very important especially when you don't want to update your kernel.

4. Disable unwanted SUID/SGID files

Effective permission could be real security threat, so you need to be very careful with files which has SUID/SGID set.

You can list all the files where SUID/SGID set

# find / -path /proc -prune -o -type f -perm +6000 -ls

5. Secure services like SSH, Postfix, NFS etc

SSH
Root login should be disabled also you can create a group that the members only allowed to use ssh.

# vim /etc/ssh/sshd_config
PermitRootLogin=no
AllowGroups=sshusers

Postfix
Linux servers that are not dedicated mail or relay servers should not accept external emails. To make sure that Postfix accepts only local emails for delivery. Edit the configuration file and specify mydestination (lists all domains to receive emails) and inet_interfaces (network to listen on)

# vim /etc/postfix/main.cf
mydestination = $myhostname, localhost.$mydomain, localhost
inet_interfaces = localhost

NFS
Make all your entries in /etc/exports, you can export all filesystems using: # exportfs –a

6. Impose password policy

a) Enable Password Aging
Password information like Last password change, expiry, inactive etc are stored in /etc/login.def file.
You can get information about a particular user use chage command

# chage -l <user_name>

Last password change : Oct 31, 2013
Password expires : never
Password inactive : never
Account expires : never

b) Enforce stronger passwords
To set complexity of the password edit the system-auth file under the directory /etc/pam.d

# vim /etc/pam.d/system-auth

The pam_cracklib module checks the password against dictionary words and other constraints like whether the password contains the user´s name in some form, a palindrome, too small, just a change of case or even too much similar to the old one. Look for the line containing the pam_cracklib module like below. To enforce the password complexity change the line to:

password requisite pam_cracklib.so try_first_pass retry=3 minlen=8 ucredit=-1 dcredit=-1 ocredit=-1 lcredit=-1

minlen minimum length of password
lcredit minimum number of lower case letters
ucredit minimum number of upper case letters
dcredit minimum number of digits
ocredit minimum number of other characters
retry prompt user at most N times before returning with error.

c) Restricting use of previous passwords

Module pam_unix is used for traditional password authentication, and is obtained from the /etc/passwd and the /etc/shadow file as well if shadow is enabled.

Now, update existing password line and append remember=10 to prevent a user from re-using any of his or her last 10 passwords. Do not append new line, update exiting password line and append remember=10.

password sufficient pam_unix.so use_authtok md5 shadow remember=10

The old password list is located in /etc/security/opasswd file.

d) Lock user accounts after too many login failures

Module pam_tally maintains a count of attempted accesses, can reset count on success, can deny access if too many attempts fail. Add the following entry to the /etc/pam.d/system-auth file.

# vim /etc/pam.d/system-auth
pam_tally.so per_user deny=5 no_magic_root reset

Failure count logging file is stored in /var/log/tallylog, while on the other hand failure logging information is stored in /var/log/secure.

7. Restrict su usage

You can list the su users and the command which can be executed by them by using sudo command.

# sudo -l

I have seen many system administrators tend to add users to the wheel group and enable wheel group in the sudoers file, it could be a security risk as these users get the same privilege as root.

7. Set SELINUX

To know your SELINUX policy as well as SELINUXTYPE use the command sestatus. You can also use getenforce command to get the status of SELINUX.

# sestatus
SELINUX=enforcing

To change open the selinux file

# /etc/sysconfig/selinux
SELINUX=enforcing
SELINUXTYPE=targeted

8. Set firewall rules

You should set firewall rules

# vim /etc/sysconfig/iptables

More information about iptables can be found here

9. Tune your kernel

The sysctl command is used to configure kernel parameters at run time. To list the current values

# sysctl -a

You can optimize the performance of your server by turning the kernel. I'm discussing commonly used few parameters only. For more information you can find here.

Open the sysctl configuration file and you can add these settings. Once done use sysctl -p to load the settings.

# vim /etc/sysctl.conf

# Disable IP packet forwarding
net.ipv4.ip_forward = 0

# Disable IP source routing.
# Sender of a packet can specify the route that a packet should take through the network.
net.ipv4.conf.all.accept_source_route = 0

# Enable IP spoofing protection, turn on source route verification
net.ipv4.conf.all.rp_filter = 1

# Decrease the default value for TCP Keepalive time.
# Checking your connected socket to determine whether the connection is still up and running or if it has broken.
net.ipv4.tcp_keepalive_time = 1800

# Enable TCP SYN Flooding Attack Protection
net.ipv4.tcp_syncookies = 1

# Ignore all broadcast/multicast ICMP ECHO/TIMESTAMP request to prevent common DoS attack
net.ipv4.icmp_echo_ignore_broadcasts = 1

# Enable ExecShield protection to prevent Buffer overflow attack
kernel.exec-shield = 1

# Set limits for system wide open files/file descriptors (FD)
fs.file-max = 65535

# Set allowed PID limit
kernel.pid_max = 65535

# Limit allowed local port range
net.ipv4.ip_local_port_range = 1024 65535

# When kernel panic reboot after 30 second delay
kernel.panic = 30

Note:
To verify kernel parameters use sysctl <kernel_parameter_name>

# sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 0

You can also use
# cat /proc/sys/net/ipv4/ip_forward
0

10. Set limit for resource usage

You can use ulimit -a to see the limits sets on your resources like cpu time, limit the maximum number of processes that a single user may own, priority, open files etc.

Limiting maximum processes an user may own is important especially when users are vulnerable to 'fork bomb' attack.

To make it persistent add corresponding entry to the limits.conf file

# vim /etc/security/limits.conf

Note:
You can set user specific limits for open files/file descriptors (FD) by editing /etc/security/limits.conf file and say soft and hard limits.

# vim /etc/security/limits.conf
httpd soft nofile 4096
httpd hard nofile 10240

Ref:
Red Hat Enterprise Linux
Linux Security Cookbook
Linux System Security
Linux Security HOWTO
Hardening Tips

Thursday, 12 September 2013

Enable Core File Dumps for Application Crashes or Segmentation Faults

Core dumps are really useful debugging tool for system administrators especially when applications like Apache or MySQL crashes. When an application is poorly written or has a bug on, it may tries to access memory location that it is not supposed to and segmentation fault occurs.

Redhat and it's clones core file creation is disabled by default. You can verify this by

# ulimit -c
0

Where zero is the size of the core dump file. Do the following steps

Step 1: Set limit for core dump file

You can restrict your core dump file to 100MB (102400KB), you should say that in kilobytes.

vim /etc/security/limits.conf

Replace #* soft core 0 with * soft core 102400

Step 2: Set limit for all the users

Now you need to change this in /etc/profile file. If you are using RHEL you can find

# vim /etc/profile
ulimit -S -c 0 > /dev/null 2>1

Replace the above ulimit command in /etc/profile with the following

ulimit -S -c 102400 > /dev/null 2>1

If you don't find it in any of your Redhat clones don't worry just add this to /etc/profile file.

Step 3: Enable debugging for all the applications

To enable debugging for all the applications, edit /etc/sysconfig/init file and add the following
# vim /etc/sysconfig/init

DAEMON_COREFILE_LIMIT=102400

If you want to enable core dumping for a particular application you can say that in it's corresponding file in /etc/sysconfig/ directory.

Now edit /etc/sysctl.conf file and add the following

# vim /etc/sysctl.conf

kernel.core_pattern = /tmp/core-%e-%s-%u-%g-%p-%t
fs.suid_dumpable = 2

When an application crashes or killed by a signal, a core dump file is created inside the directory /tmp named core, but you can define the core dump file name with the following template which can contain % specifiers which are substituted by the following values when a core file is created:

%% - A single % character
%p - PID of dumped process
%u - real UID of dumped process
%g - real GID of dumped process
%s - number of the signal causing dump
%t - time of dump (seconds since 0:00h, 1 Jan 1970)
%h - hostname
%e - application file name

Reload the settings in /etc/sysctl.conf by running the following command
# sysctl -p

Step 4: Test to see core dump file is creating on crash

Now I'm going to kill Apache using the signal SIGQUIT, which dumps core on termination. Don't try this in production environment.

# kill -s SIGQUIT `cat /var/run/httpd/httpd.pid`

Now you can see a core dump file is created in the /tmp dictionary.

# ls -l /tmp
-rw------- 1 root root 5431296 Jan 12 21:07 core-httpd-3-0-0-4277-1379034432

Where httpd is the name of application which was killed by the signal SIGQUIT, 3 is the signal number, two zeros means GID and UID of root, 4277 is the PID and 1379034432 is the time of crash.

Step 5: Analyse the core dump file

You need gdb (GNU Debugger) to analyse the core dump file. You may not have gdb on your system, install gdb.

# yum install gdb

You need use the gdb command as follows
# gdb <path of the application> <path of the core dump file>

# which httpd
/usr/sbin/httpd

Now you may run gdb

# gdb /usr/sbin/httpd /tmp/core-httpd-3-0-0-24631-1374966080
Core was generated by `/usr/sbin/httpd'.
Program terminated with signal 3, Quit.

You may get error message like “Missing separate debuginfos, use: debuginfo-install httpd-2.2.15-28.el6.centos.i686", you need to have yum-utils packages installed on your system, yum-utils packages provides the command debuginfo-install. Check if you have installed yum-utils package on your system.

# rpm -qa | grep yum-utils
yum-utils-1.1.30-14.el6.noarch

If not install it and run. If you still getting error message like “Could not find debuginfo for main pkg: httpd-2.2.15-28.el6.centos.i686”, edit the file /etc/yum.repos.d/CentOS-Debuginfo.repo and set enabled=1. If you use some other Redhat clone you need to use the appropriate Debuginfo repo file.

Tuesday, 10 September 2013

Processes, Daemons, Signals and Services

Process - Process is a running program. Each process is uniquely identified by a number called a process ID (PID). Similar to files, each process has one owner and group, and the owner and group permissions are used to determine which files and devices the process can open. Init, the parent of all processes is the first process to start at boot time and has a PID of 1. A process state could be running (R), sleeping(S), stopped(T) or zombie* (Z). We can find the state of the process from the STAT field of the ps command or from the S field of the top command. To find zombie process

# ps aux | grep Z

Daemon - A daemon is a process which runs in background and has no controlling terminal.

Signal - A signal is a notification sent to a process or to a specific thread within the process to notify that an event occurred. Signals are used for the communication between user processes and from kernel to user process. We can communicate with a daemon or any running process by sending a signal using the command kill.

Signal name starts with SIG and is defined by numbers between 1-64. The kill -l command will display all signals with signal number and corresponding signal name. While on the other hand fuser -l will give you only used signal names.

# kill -l

1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL 5) SIGTRAP
6) SIGABRT 7) SIGBUS 8) SIGFPE 9) SIGKILL 10) SIGUSR1
11) SIGSEGV 12) SIGUSR2 13) SIGPIPE 14) SIGALRM 15) SIGTERM
16) SIGSTKFLT 17) SIGCHLD 18) SIGCONT 19) SIGSTOP 20) SIGTSTP
.
.
.
# fuser -l
HUP INT QUIT ILL TRAP ABRT IOT BUS FPE KILL USR1 SEGV USR2 PIPE ALRM TERM
STKFLT CHLD CONT STOP TSTP TTIN TTOU URG XCPU XFSZ VTALRM PROF WINCH IO PWR SYS UNUSED

There are five default disposition for each signal

1. Term (terminate the process),
2. Ign (Ignore the signal)
3. Core (Terminate the process and dump core)
4. Stop (Stop the process)
5. Cont (Continue the process if it is currently stopped)

The Signals every system admin should know!

1 SIGHUP This signal indicates that someone has killed the terminal program, without killing applications running inside of terminal window. Once they receive this signal, process will restart and re-read the configuration file, same as calling init q. You can make processes immune to SIGHUP signals so that they can continue to run after the user logs out with the nohup command. Default handler for this signal will terminate your program.

After changing a web server's configuration file, the web server needs to be told to re-read its configuration. Restarting Apache would result in a brief outage period on the web server. Instead, send the daemon the SIGHUP signal, same as gracefully restarting Apache # /etc/init.d/httpd graceful.

2 SIGINT This signal being sent from kernel to your application when an user tries to end it by pressing Ctrl+C (Mostly when a process freezes). It’s a request to terminate the current operation. Most programs will stop (if they catch the signal) or simply allow themselves to be killed, which is the default if the signal is not caught. Default handler for this signal will terminate your program.

3 SIGQUIT Signal is used to stop the processes that could not be killed with SIGINT, and you can do it by pressing Ctrl + \ Default handler for this signal will terminate the process and dump to a core file.

6 SIGABRT It is yet another method to terminate your program used as an emergency stop. The function abort() issues SIGABRT signal which terminates your program. Normally initiated by a debugging process or self-detected error. Default handler for this signal terminate and leave a core file for debugging purposes.

9 SIGKILL This signal terminates the program operation ungracefully; the program may not save open files, etc. This signal can not be ignored by a process. This is the “I do not care what you are doing, stop right now” signal. Sending a SIGKILL to a process will usually stop that process there and then. Default handler for this signal will terminate your program.

10 SIGUSR1 This is a general purpose signal available for programs to use in whatever way they’d like. For example, the Apache web server interprets the SIGUSR1 signal as a request to gracefully restart.

11 SIGSEGV If an application is badly written and tries to access memory that it is not supposed to, kernel send the process Segmentation Violation signal. This is often caused by reading off the end of an arrays.

12 SIGUSR2 This is also a general purpose signal available for programs to use in whatever way they’d like. You may use this signal to synchronize your program with some other program or to communicate with it.

15 SIGTERM This signal terminates the program operation gracefully, close any log files it may have open, and attempt to finish what it is doing before shutting down. In some cases, a process may ignore SIGTERM if it is in the middle of some task that can not be interrupted. This is the default signal sent by the kill command. Default handler for this signal will terminate your program.

17 SIGCHLD Kernel sends a process this signal when a child process of your program has stopped or terminated. We can use this signal to kill a zombie process, first find the zombie’s parent PID (PPID) then send him the SIGCHLD signal, kill -17 ppid. Default handler for this signal will ignore your process.

18 SIGCONT If a process has been suspended by sending SIGSTOP signal then the process will continue it's execution if it receives a SIGCONT signal. Default handler for this signal will continue your process, if stopped.

19 SIGSTOP If a process has been suspended by sending SIGSTOP signal then the process will continue it's execution if it receives a SIGCONT signal. Default handler for this signal will stop your process.

20 SIGTSTP Both SIGTSTP and SIGSTOP are designed to suspend a process which will be eventually resumed with SIGCONT. The main differences between them are SIGSTOP is a signal sent as a script(eg: kill -STOP pid ) while SIGTSTP is sent by a user pressing Control-Z on his keyboard. Default handler for this signal will stop your process.

Services - In Windows, daemons are called services. We can run this services by typing services.msc at the command prompt. Linux has a command called /sbin/service, used to run init scripts which are located in /etc/init.d/SCRIPT

* A zombie process or defunct process is a process that has completed execution but still has an entry in the process table, usually because of bugs and coding errors, and is waiting for it's parent process to pick up the return value. A zombie process is different from an orphan process. An orphan process is a process that is still executing, but it's parent process has died, and are adopted by init.

Note 1

The signals named SIGKILL and SIGSTOP cannot be caught, blocked, or ignored. The SIGKILL signal destroys the receiving process, and SIGSTOP suspends its execution until a SIGCONT signal is received. SIGCONT may be caught or ignored, but not blocked.

Note 2

You can view the key mappings that sends specific signal to a process using the “stty -a” command as shown below.

# stty -a
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>; eol2 = <undef>; swtch = <undef>; start = ^Q;
stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W; lnext = ^V; flush = ^O;

Friday, 30 August 2013

Nagios: Monitor your Network Infrastructure

You can monitor your firewall, routers, network switches etc. using Nagios. These days most of the switches and routers supports SNMP, and you can monitor port status with the check_snmp plugin and bandwidth using MRTG with the check_mrtgtraf plugin. You need to install plugin if you want to monitor your firewall.

I assume you have already installed and configured Nagios on the Nagios monitoring server. If not follow the instructions here. Once your Nagios server is ready you 'll need to follow these steps to monitor your network infrastructure.

1. Enable Switch configuration file in Nagios.cfg

Edit the nagios configuration file, unckeck switch.cfg.

# vim /usr/local/nagios/etc/nagios.cfg

cfg_file=/usr/local/nagios/etc/objects/switch.cfg

2. Define hosts for Switch/Router/Firewall

Open the configuration file and change the host_name, alias, and address fields to appropriate values for the switch.

# vim /usr/local/nagios/etc/objects/switch.cfg

# Define the switch that we'll be monitoring

define host{
use generic-switch ; Inherit default values from a template
host_name catalyst-4500 ; The name we're giving to this switch
alias Cisco Catalyst 4500 Switch ; A longer name associated with the switch
address 192.168.1.195 ; IP address of the switch
hostgroups switches ; Host groups this switch is associated with

Open the configuration file and change the host_name, alias, and address fields to appropriate values for the firewall as well as router.

3. Monitoring services for Switch/Router/Firewall

Add the following service definition to monitor packet loss and round trip average between the Nagios host and the switch every 5 minutes under normal conditions.

# Create a service to PING to switch

define service{

use generic-service ; Inherit values from a template

host_name catalyst-4500 ; The name of the host the service is associated with

service_description PING ; The service description

check_command check_ping!200.0,20%!600.0,60% ; The command used to monitor the service

normal_check_interval 5 ; Check the service every 5 minutes under normal conditions

retry_check_interval 1 ; Re-check the service every minute until its final/hard state is determined

}

This service will be:

CRITICAL if the round trip average (RTA) is greater than 600 milliseconds or the packet loss is 60% or more

WARNING if the RTA is greater than 200 ms or the packet loss is 20% or more

OK if the RTA is less than 200 ms and the packet loss is less than 20%

# Monitor uptime via SNMP

host_name catalyst-4500
service_description Uptime
check_command check_snmp!-C public -o sysUpTime.0
}

# Monitor Port 1 status via SNMP

define service{
use generic-service ; Inherit values from a template
host_name catalyst-4500
service_description Port 1 Link Status
check_command check_snmp!-C public -o ifOperStatus.1 -r 1 -m RFC1213-MIB
}

Repeat this procedure for router as well. To monitor firewall you'll need to download the appropriate plugin and define the services. If you are using Cisco ASA you can download the plugin from here.

4. Monitor your Bandwidth

You need to install MRTG if you want to monitor bandwidth usage on your switches or routers. You can set the alert when traffic rates exceed thresholds you specify. You need to use check_mrtgtraf plugin for this. The MRTG log file mentioned below should point to the MRTG log file on your system.

# Monitor bandwidth via MRTG logs

define service{
use generic-service ; Inherit values from a template
host_name catalyst-4500
service_description Port 1 Bandwidth Usage
check_command check_local_mrtgtraf!/var/lib/mrtg/192.168.1.195_1.log!AVG!1000000,1000000!5000000,5000000!10
}

In the example above, the "/var/lib/mrtg/192.168.1.195_1.log" option that gets passed to the check_local_mrtgtraf command tells the plugin which MRTG log file to read from. The "AVG" option tells it that it should use average bandwidth statistics. The "1000000,2000000" options are the warning thresholds (in bytes) for incoming traffic rates. The "5000000,5000000" are critical thresholds (in bytes) for outgoing traffic rates. The "10" option causes the plugin to return a CRITICAL state if the MRTG log file is older than 10 minutes (it should be updated every 5 minutes).

5. Verify configuration and restart Nagios.

# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

# /etc/init.d/nagios restart
Stopping nagios: [ OK ]
Starting nagios: [ OK ]

Note 1:
If you want to monitor all the ports of the switch then make an entry of all the ports while defining the services.

check_command check_snmp!-C public -o ifOperStatus.1 -r 1 -m RFC1213-MIB, -o ifOperStatus.2 -r 1 -m RFC1213-MIB, -o ifOperStatus.3 -r 1 -m RFC1213-MIB ...

Note 2:
You can monitor your router/firewall using SNMP if you know the object identifier (OID) for the router/firewall, which you can find using snmpwalk.

# snmpwalk -v1 -c public 192.168.1.205 -m ALL .1, where 192.168.1.205 is the ip address of your router/firewall.

Note 3:
You can monitor your remote linux/windows host using SNMP, but I'm not sure of reliability of SNMP. One reason is SNMP is based on less secure UDP and the other is there is no acknowledgement defined for snmp traps.

Note 4:
There are few occasions we prefer UDP over TCP, especially when we don't require any acknowledgement or few packet loss doesn't make any difference.

1. used for broadcast and multicast, as TCP doesn't support broadcast/multicast.

2. faster, there is no acknowledgement defined, and no need to resend the lost packets makes UDP faster and is widely used for videoconferencing.