Dave Williams - Nagios Log Server - Practical Experience

Nagios Log Server
Practical Experience
Dave Williams
1

| 31-07-2015 | Dave Williams | © Atos
GB | Managed Services | TTS
Agenda
▶ Background
▶ Why choose Nagios Log Server
▶ Implementation
▶ Source Configuration
▶ Useful things to know
▶ Initial Dashboards
▶ Final Dashboards
▶ System Performance
▶ Conclusions
2

Background
▶UK based
– Mainframe (IBM & Honeywell)
– Unix (HP-UX, AIX, Solaris)
– Linux (RedHat, SLES, Debian)
– Network (CASE, 3COM, CISCO)
▶Working for Atos
– French Outsourcing Company
– Mainframes, Unix, HPC,
Security, Managed Services,
Advisory Services

Background
▶ System Monitoring
– OpenView
– Netview
– Open Master
▶ Open Source Monitoring
– NetSaint on AIX
– Nagios
– Nagios XI

Why choose Nagios Log Server?
▶ Needed a log server of some nature
▶ Already built a Elk & Logstash system (not using Kibana) by hand
▶ Used Splunk in a previous life to good effect
▶ Last year Nagios Logserver announced – after Ethan and others had taken note
▶ Seemed to be a ‘cost effective’ easy build option
▶ Included authentication & access control necessary for Managed Services
environment.
5

Implementation
▶ Because of use of Centos installed from source
– no great issues, ntp requirement in install script overcome.
• Complete!
• 12 Aug 18:40:02 ntpdate[2930]: no server suitable for synchronization
found
• ===================
• INSTALLATION ERROR!
• ===================
• Installation step failed - exiting.
• Check for error messages in the install log (install.log).
• If you require assistance in resolving the issue, please include install.log
• in your communications with Nagios Enterprises technical support.
6

Implementation
• The step that failed was: 'prereqs'
• # Set date/time because ssl certificates can be in the future... (fix for pypi
and get-pip)
• # ntpdate -u pool.ntp.org
▶ Easily able to move data storage to a nominated filesystem
7

Implementation
▶ Connecting a new instance to the cluster :
– really is as simple as the manual describes
• install on new host
• connect to the web interface
• enter IP address / name of original cluster node
• enter Cluster ID of the original system
– Finish Installation.
8

Underlying Structure
9
Server 1
Server N
Logstash
Logstash
Elasticsearch
Cluster
Kibana
Queried by
Push data
into

Source Configuration
▶ Creation of feeds straightforward.
– First syslog, using syslog remote to accept other systems data
– Because of SNMPTT SNMP traps appearing in syslog also recorded
– Could use Eventlog (NXLog) for Windows in future
▶ VMware logs – from ESXi not the VM’s :
– Add Input, udp {
type => 'esxilogs'
port => 1514
}
– Save and apply, adjust iptables if required
– follow this VMWare configuration guide to setup your ESXI hosts to log
to udp://nagios.log.server.ip:1514
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayK
C&externalId=1007329
– Or read https://assets.nagios.com/downloads/nagios-log-
server/docs/Sending-ESXi-Logs-To-Nagios-Log-Server.pdf
10

Source Configuration
For NetFlow use this :-
Logstash has native NetFlow v5 and v9 codecs. It can't handle high volume (I'm
guessing no more than a few hundred flows per second)..
– udp { host => "0.0.0.0"
– port => 2055
– codec => netflow { cache_ttl => 1 versions => [ 5, 9 ] }
– type => "netflow" }
– Save and apply, adjust iptables if required
11

Source Configuration (Pi)
http://www.paluch.biz/blog/134-capturing-and-visualizing-sensor-data-using-the-elk-stack.html
▶ IoT (Internet of Things) simple solution:
– RasPi distance sensor :
– The RaspberryPi is sending its data regularly to
logstash using the TCP input using JSON. JSON
is the simplest data format available on IoT
platforms.
– input{ tcp{ port => 9400
– codec => "json_lines"
– }
– }
– output{
– elasticsearch_http{
– host => "localhost"
– port => 9200
– index => "distance-%{+YYYY.MM.dd}" } }
12
import socket import json import time from
distancemeter import get_distance,cleanup #
Logstash TCP/JSON Host JSON_PORT = 9400
JSON_HOST = '192.168.55.34' if __name__ ==
'__main__': try: s = socket.socket(socket.AF_INET,
socket.SOCK_STREAM) s.connect((JSON_HOST,
JSON_PORT)) while True: distance =
get_distance() data = {'message': 'distance %.1f
cm' % distance, 'distance': distance, 'hostname':
socket.gethostname()} s.send(json.dumps(data))
s.send('n') print ("Received distance = %.1f cm" %
distance) time.sleep(0.2) # interrupt except
KeyboardInterrupt: print("Program interrupted")

Source Configuration (Pi)
http://www.paluch.biz/blog/134-capturing-and-visualizing-sensor-data-using-the-elk-stack.html
13

Source Configuration (The Force Awakens)
14

Useful things to know
▶ How do I install Logstash plugins ?
– /usr/local/nagioslogserver/logstash/bin/plugin install logstash-codec-cef
– (Installs ArcSight logfile handler…)
▶ Check the latest upgrade documentation for how to pause shard allocation :
– https://assets.nagios.com/downloads/nagios-log-server/docs/Upgrade-
Instructions-For-Nagios-Log-Server.pdf
– For large clusters makes a real difference to how long a rolling update can
take
▶ One of my favourite filters :
– if [severity_label] == "Notice“ and [program] == “sudo” {
– drop {}
– }
15

▶ Get used to looking at curl -XGET 'http://localhost:9200/
▶ Need the cluster state ? :-
– # curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "80e9022e-f73f-429e-8927-xxxxxxxxxx",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 86,
"active_shards" : 136,
"relocating_shards" : 0,
"initializing_shards" : 6,
"unassigned_shards" : 30
16

▶ Monitoring the Nagios Log Server
– Other presentations will cover this topic – see Eric Loyd , Track 1 @ 2:30
today
▶ But mainly use :9200 locally (via NRPE) and then check_proc for the
appropriate processes.
▶ To uninstall manually :-
– Stop all of the relevant NLS processes (elasticsearch, logstash, and httpd)
and remove the following directories:
– rm -rf /usr/local/nagioslogserver
– rm -rf /var/www/html/nagioslogserver
– You can now do a ./fullinstall
17

▶ If you run equipment that has to output syslog on port 514 then Logserver can
cope (privileged port access)- NetApp is an example
– There’s a document for this ! https://assets.nagios.com/downloads/nagios-
log-server/docs/Listening-On-Privileged-Ports-With-Nagios-Log-Server.pdf
– You can change logstash to run as the root user.
– Open /etc/sysconfig/logstash and find the line: LS_USER=nagios
– Change this line to read LS_USER=root
– Restart the logstash service: # service logstash restart
18

▶ Alternative method of log shipping :-
– Was lumberjack but now logstash-forwarder (still lumberjack protocol )
• Encrypted shipping of compressed logs
• Low impact compared to a full Logstash install
• Use self signed certificates.
• Runs in EC2 micro instances
▶ CentOS 6
– wget http://packages.elasticsearch.org/logstashforwarder/centos/logstash-
forwarder-0.3.1-1.x86_64.rpm
rpm -ivh logstash-forwarder-0.3.1-1.x86_64.rpm
▶ CentOS 5
– wget http://download.elasticsearch.org/logstash-
forwarder/packages/logstash-forwarder-0.3.1-1.x86_64.rpm
rpm -ivh logstash-forwarder-0.3.1-1.x86_64.rpm
19

▶ Logstash plugins – over 180 at https://github.com/logstash-plugins
– Nice thing to know:-
– :::ruby
– output { if [type] == "syslog"
– and [program] == "jenkins"
– and [job] == "Install on Cluster"
– and "_grokparsefailure" not in [tags]
• {
• nagios_nsca {
– host => “nagios.example.com" port => 5667
– send_nsca_config => "/etc/send_nsca.cfg"
– message_format => "%{job} %{repo}"
– nagios_host => "jenkins"
– nagios_service => "deployed %{repo}"
– nagios_status => "2" } }
– # if type=syslog, program=jenkins, job="Install on Cluster" }
– # output
20

Initial Dashboards
▶ Apache dashboard :-
21
Hmm – what are the 404’s ?

Initial Dashboard
22

Initial Dashboards
▶ Zoom in by clicking on the 404 part of the Pie chart :-
23
Ah ! A good idea to find win40.jpg then.

Final Dashboards
24

Final Dashboards
25

Performance
▶ A good setting to configure to help control ES memory usage is to set the
indices field cache size. Limiting this indices cache size makes sense because
you rarely need to retrieve logs that are older than a few days. By default ES
will hold old indices in memory and will never let them go. So unless you have
unlimited memory than it makes sense to limit the memory in this scenario.
▶ To limit the cache size simply add the following value anywhere in your custom
elasticsearch.yml configuration file. This setting and adjusting the Java heap
memory size should be enough to get started but there are a few other things
that might be worth checking.
▶ indices.fielddata.cache.size: 40%
26

Performance
▶ Another idea worth looking at for an easy performance boost would be disabling
swap if it has been enabled. Again, in most cloud environment and images
swap is turned off, but it is always a setting worth checking.
▶ To bypass the OS swap setting you can simply configure a no swap value in ES
by adding the following to your elasticsearch.yml configuration file.
• bootstrap.mlockall: true
– To check that this has value has been configured properly you can run this
command.
– curl http://localhost:9200/_nodes/process?pretty
– This may cause memory warnings when ES starts up (eg, unable to lock JVM
memory (ENOMEM). This can result in part of the JVM being swapped out.
Increase RLIMIT_MEMLOCK (ulimit).) but you should be able to ignore these
warnings. If you are concerned, turn these limits off at the OS level
▶ Centos /etc/sysctl.conf:
– Fs.file-max = 16384
▶ Centos /etc/security/limits.conf:
– * - nofiles 16384
27

Performance
▶ Rules of thumb :-
– Due issues with JVM heap size, individual Elasticsearch nodes don't scale well
beyond 64GB of RAM. After reaching 64GB of RAM (with 31GB allocated to
the Java heap), you should scale horizontally rather than vertically.
– Elasticsearch has a lot of optimizations built around fast retrieval from disk,
and a lot of knobs you can tweak to ensure that the most frequently searched
indices live on SSD.
– With respect to the concern about high-volume indexing causing search
performance problems: if this is a problem you can use index routing to help
by ensuring that data is indexed on nodes with the fastest disk (say SSD in
RAID 0), then moved to nodes with spinning disk. If your cluster is search-
heavy you could also increase the number of replica shards, which requires
more storage but decreases search time.
28

Conclusions
▶ Obvious ones first :
– You can’t run this on a RaspberryPi ! (Or maybe you can – ask me outside
this presentation….)
– You need log sources that matter
– You need time to develop filters and alerts that make sense to your
organisation.
▶ Anything can be a logfile
– You can point Logserver at any readable file and parse the content
29

Questions
30

Atos, the Atos logo, Atos Consulting, Atos Worldgrid, Worldline,
BlueKiwi, Bull, Canopy the Open Cloud Company, Yunano, Zero Email,
Zero Email Certified and The Zero Email Company are registered
trademarks of the Atos group. July 2015. © 2015 Atos. Confidential
information owned by Atos, to be used by the recipient only. This
document, or any part of it, may not be reproduced, copied, circulated
and/or distributed nor quoted without prior written approval from
Atos.
31-07-2015
© Atos
Thanks
For more information please contact:
T+ 33 1 98765432
M+ 44 (0) 7973226073
dave.2.williams@atos.net

Dave Williams - Nagios Log Server - Practical Experience

More Related Content

What's hot (10)

Similar to Dave Williams - Nagios Log Server - Practical Experience (20)

More from Nagios (20)

Recently uploaded (20)

Dave Williams - Nagios Log Server - Practical Experience