Main Page
Fault Management notes, thoughts, and example code
The focus of this wiki is for notes and gotchas for things relating to technology. It is mainly focused on the fault management tool that I am writing. However oddball things I have found or commonly forget since I rarely use them are also present. I do not go too in depth on the notes or what I find. In general it is more a quick summary and if reasonable an example that shows what the result is.
Overall idea I believe that fault management is commonly very much overlooked in many companies and more of a bolt-on after they have been embarrassed by an outage or hack. This feels backwards to me. Fault management really should be one of the first things put into place and grow with your software and network over time. Being able to tell at a glance the "truth" of your network is almost as critical as the network running itself! Customer monitoring isn't monitoring and simply makes you look bad. If a customer or end user has to tell you something is down, you have failed. If it happens more than once then that should set off all kinds of warnings to technicians. This should not happen. The problem happens in my opinion due to management seeing monitoring as a time suck that does not really return money on their investment of time. The only time they care is when they get embarrassed, and after that they loose interest until they are embarrassed again. This is stupid. A company or technician must be able to say with confidence that their applications are working AND PROVE IT. If you cant prove anything it is just a hope that things are working as designed. Until you know, you simply are guessing on the health and availability of an application.
I have noticed a trend recently that companies are going to what I call "implied" monitoring. More looking at metrics than active validation of an application or service. While this does somewhat fit the bill for monitoring it does a major dis-service to the technicians that need to support the application. Usually this kind of monitoring will cry when something is down but does not tell you what, or where the issue happens. Only that it is happening. It is also usually slower to report the failures. That means more time for a customer to find issues than the technician. Parsing logs, and dropping them in search indexes, or forwarding matches as events are all very useful, but they are not really watching for specific application issues at the host level. I believe that an organization must work up to this kind of monitoring. It must be bult on the basics. If you do not have the basics in place, then the advanced monitoring is much less useful for an org.
My common way to approach monitoring is to start as fundamental as possible. On Linux based systems this is the way that has given me the best results.
- daemon
- dead daemon, well dead app :)
- port
- zombie processes, or different daemons attempting to use the same port is bad
- challenge and response.
- Verify which app has control of the port (did you get an http, or ssh header?)
- performance
- This is the n+1 point. After confirming the application is running, NOW is the time to verify it is at a basic level performant.
- log and event parsing
- In depth log parsing, and event correlations. Without the above this is much less useful to a technician.
- I have seen in RARE cases this done well, where it states what failed and which host. However that is much less likely to happen as the logs parsed do not always state which application is at fault, only that there is one present.
- In depth log parsing, and event correlations. Without the above this is much less useful to a technician.
How you get the answers is less important than getting the answers, with one gotcha. There is no reason to degrade the host with a complex and slow service check. If the validation cannot be done cheaply, then it is likely something that should be broken down into more basic pieces. Killing your servers or pod with service checks is exactly backwards of what you need. Simple, fast, and accurate are what you need to focus on. Inaccurate data is WORSE than no data. Another useful trick is to assume failure until the application proves that it is doing what you expect. I always try to avoid a bias of assuming something works until it can prove it in a service check. Until that time, it is only your opinion that things are working.
History of this wiki
MediaWiki destination for random notes and examples I do not want to loose or forget.
- The site is for myself and friends who commonly use bash and other utilities and can have a one-stop-shop to find that oddball thing that was found six months ago and vaguely remembered. The site overall is not for the general public, however if you make it in here feel free to browse.
- Keep in mind however I do have security measures in place and poking hard at stuff will block you from the domain for 30 days + if you hammer really hard.
- If someone wants access to actually ADD information in here please sign up for an account, and I will likely grant access. I do try to set everything into some kind of category for easier searches as well as an attempt to keep this somewhat organized. Whenever possible (or I remember to do it) I do try to link to the original sources of the information. They are usually from SE, or other Q&A sites, so some of the comments are useful as well.
- This is not wikipedia, likely the site is not going to be polished, since it is more of a catchall wiki on doing different things. There is not going to be too much rhyme or reason on what is posted on this wiki. Overall if it is something that I have had to do more than once and look up every time, I will have notes on it in here so I will not have to search next time.
- There will be times that there are references to personal servers on my network, or oddball hostnames that are tied to iwillfearnoevil.com. It is unlikely that access will be granted to those hosts unless there is a very specific reason to do so. So dont bother asking :P
- Category NMS has notes on my progress of my NMS design, and thoughts on monitoring overall Category NMS
Misc Notes on using Mediawiki
Consult the User's Guide for information on using the wiki software.