Main Page

From I Will Fear No Evil
Jump to navigation Jump to search

Fault Management notes, thoughts, and example code

My code is finally up on github.com! Right now I am squishing bugs that I am finding which are not critical issues, but look bad. Additionally I have begun the ECE or event correlation engine work for the different dashboards. I still have not gotten to an installer, or added the seed data for the database. I will be focusing on that soon. I want to get a little more functionality in place to prove the tool is useful before focusing on this part. Additionally I have had some thoughts on how to templatize some of the checks better. I am still kinda kicking the idea around right now to see what kind of weak points I will run into, but the basic idea is along the lines of give me a SNMP table, and if return data is numbers graph/save it. If args are passed for what we care about, parse and event on what is found all at the template level. I think that I can build a simple skeleton for this where just about anything can be added for the check by doing this. String compare, existence, or number comparison. I kinda like this idea, since we can get a lot of data, but 99% of the time we only care about specific parts of a given table. I would like to make this something that can be created from the web page if possible, but I am not sure how ugly that would get..


I am mainly updating this wiki for the NMS work I am doing, and attempting to get some more exposure in the greater world for assistance on code. I am not a developer, just someone who is frustrated at a lack of good tools. I will be adding the NMS up to gitgub in the future as two separate repos. The first one will be the API server. The second is the front-end UI portion.

The backend is written in PHP, and has recently been migrated from PHP7.4 to 8.1. I used the Slim4 skeleton as the base of the application.

The frontend is using bootstrap, PHP and javascript.

Update status: 01-02-25

  • Currently working on getting the UI to not look like a first time high school project
  • Minor bug squishing
  • Investigating template rework to better calculate percentages to store in metrics
  • Beginning to investigate storing metric data in influxdb

Update status: 10-30-24

  • Currently hitting performance limits with 2 core/2G RAM with 61 servers. Server getting laggy..
  • Will encourage setups as < 50 servers, 2x2 is fine.
  • > 51 and < 125 testing as 4x2. This is not a RAM intense app, it is more thready and core bound. Then Drive IO is going to become the bottleneck before RAM

Update status: 10-15-24

  • Purchased ChartJS license (I may hate JS, but it makes pretty pics)


Update status: 10-08-24

  • Minor bug squishing
  • Testing using evil Javascript to make pretty graphs
  • Going to license CanvasJS for this (even a n00b can figure it out)


Update status: 08-18-24

  • Some minor bug fixes, but not too much coding done
  • beginning focus on ECE and seeing what a mess it is. Not happy about it. Will likely redesign this since it was a skel anyway. This needs to be simple and understandable dammit.
  • Looking at changing frontend UI for main page.
  • Building more templates for standards
  • Dblchecking my logic for NRPE or shell commands. This could use more work. NRPE failing for something not the command itself should have discrete alarm values. Need to think about this more..
  • SNMP checks need to be smarter, and need parsible thresholds on a per host basis, not a per check basis


Update status: 04-16-24

  • Still squishy-squishy on bugs
  • Focusing on stability and clean UI

Current Loads: 47 devices

  • 2 core 2 GB RAM
  • Load averages consistent: 0.70,1.5,1.6
  • From this with a average number of checks ( ~10 per host ) reliable monitoring can be done on 50 devices with decent results and minimal hardware for small environments.

Update on status: 03-21-2024

  • Many Many squishes of bugs
  • Reporting engine and templates much more usable
  • More documentation links for application with base examples written
  • Initial Event Correlation Engine (ECE) rules written
  • Some pages written for ECE
  • Did I mention squishing bugs?

Update on status: 02-11-2024


The focus of this wiki is for notes and gotchas for things relating to technology. It is mainly focused on the fault management tool that I am writing. However oddball things I have found or commonly forget since I rarely use them are also present. I do not go too in depth on the notes or what I find. In general it is more a quick summary and if reasonable an example that shows what the result is. I suspect that as time passes it will become more focused on the tool that I am writing. However there will likely be oddball stuff thrown in here as well that does not have to do with fault management at all....

Overall idea

I believe that fault management is commonly very much overlooked in many companies and more of a bolt-on after they have been embarrassed by an outage or network event that a customer noticed. This feels backwards to me. Fault management really should be one of the first things put into place and grow with your software and network over time. Being able to tell at a glance the "truth" of your network is almost as critical as the network running itself! Customer monitoring isn't monitoring and simply makes you look bad. If a customer or end user has to tell you something is down, you have failed. If it happens more than once then that should set off all kinds of warnings to technicians. This should not happen. The problem happens in my opinion due to management seeing monitoring as a time suck that does not really return money on their investment of time. The only time they care is when they get embarrassed, and after that they loose interest until they are embarrassed again. This is stupid. A company or technician must be able to say with confidence that their applications are working AND PROVE IT. If you cant prove anything it is just a hope that things are working as designed. Until you know, you simply are guessing on the health and availability of an application.

I have noticed a trend recently that companies are going to what I call "implied" monitoring. More looking at metrics than active validation of an application or service. While this does somewhat fit the bill for monitoring it does a major dis-service to the technicians that need to support the application. Usually this kind of monitoring will cry when something is down but does not tell you what, or where the issue happens. Only that it is happening. It is also usually slower to report the failures. That means more time for a customer to find issues than the technician. Parsing logs, and dropping them in search indexes, or forwarding matches as events are all very useful, but they are not really watching for specific application issues at the host level. I believe that an organization must work up to this kind of monitoring. It must be bult on the basics. If you do not have the basics in place, then the advanced monitoring is much less useful for an org.

My common way to approach monitoring is to start as fundamental as possible. On Linux based systems this is the way that has given me the best results.

  • daemon
    • dead daemon, well dead app :)
  • port
    • zombie processes holding a port, or different daemons attempting to use the same port is bad
  • challenge and response.
    • Verify which app has control of the port (did you get an http, email, ssh header? Did it respond at all?)
  • performance
    • This is the n+1 point. After confirming the application is running, NOW is the time to verify it is at a basic level performant.
  • log and event parsing
    • In depth log parsing, and event correlations. Without the above this is much less useful to a technician.
      • I have seen in RARE cases this done well, where it states what failed and which host. However that is much less likely to happen as the logs parsed do not always state which application is at fault, only that there is one present.

How you get the answers is less important than getting the answers, with one gotcha. There is no reason to degrade the host with a complex and slow service check. If the validation cannot be done cheaply, then it is likely something that should be broken down into more basic pieces. Killing your servers or pod with service checks is exactly backwards of what you need. Simple, fast, and accurate are what you need to focus on. Inaccurate data is WORSE than no data. Another useful trick is to assume failure until the application proves that it is doing what you expect. I always try to avoid a bias of assuming something works until it can prove it in a service check. Until that time, it is only your opinion that things are working.

Historically this has been with SNMP and NRPE service checks on a common 5 minute iteration cycle. This allows for scale of many checks across a fleet of hosts within the cycle, as well as a decent starting point for monitoring hosts. I realize that some technicians prefer a faster cycle time, but at the end of the day, you are still talking about human response times and investigation times. Getting things like sub-minute reporting does no good at all to a technician who is troubleshooting an issue. Additionally, fast cycle times do not really lend themselves well to the concept of retry. You will loose packets from time to time, your application will do weird crap from time to time. That's just fact. Make sure your monitoring does not loose its mind and scream the world is burning due to a transient packet-loss issue. Using something like retry will make it more likely you are picking up a legitimate event and not a transient failure.

Every time there is an "Outage" or severe service degrade, one of the first things that should be brought up is did the monitoring catch this issue? Was it actually granular enough to state what the issue was, or just a side affect of the issue? Getting an alarm for webiste down, vs. database down are two very different things. Both will imply a 100% outage, but usually one is faster to repair than the other. Also looking at the wrong things; say the webserver when the database is toast simply slows down the response for getting a database back online.

History of this wiki

MediaWiki destination for random notes and examples I do not want to loose or forget.

  • The site is for myself and friends who commonly use bash and other utilities and can have a one-stop-shop to find that oddball thing that was found six months ago and vaguely remembered. The site overall is not for the general public, however if you make it in here feel free to browse.
  • Keep in mind however I do have security measures in place and poking hard at stuff will block you from the domain for 30 days + if you hammer really hard.
  • If someone wants access to actually ADD information in here please sign up for an account, and I will likely grant access. I do try to set everything into some kind of category for easier searches as well as an attempt to keep this somewhat organized. Whenever possible (or I remember to do it) I do try to link to the original sources of the information. They are usually from SE, or other Q&A sites, so some of the comments are useful as well.
  • This is not wikipedia, likely the site is not going to be polished, since it is more of a catchall wiki on doing different things. There is not going to be too much rhyme or reason on what is posted on this wiki. Overall if it is something that I have had to do more than once and look up every time, I will have notes on it in here so I will not have to search next time.
  • There will be times that there are references to personal servers on my network, or oddball hostnames that are tied to iwillfearnoevil.com. It is unlikely that access will be granted to those hosts unless there is a very specific reason to do so. So dont bother asking :P
  • Category NMS has notes on my progress of my NMS design, and thoughts on monitoring overall Category NMS





Misc Notes on using Mediawiki

Consult the User's Guide for information on using the wiki software.

Getting started