Event correlations
Event Correlation Logic
With normal fault management tools, you can get many different events for one kind of fault. Applications such as Nagios do attempt to address this with parent-child relationships. Applications such as Zenoss also attempt to address this as well in their own way. However, the flat solution of Nagios, or opaque one that Zenoss use seem to not go far enough or are not flexible enough to really deal with events well when this happens. You can really get into the weeds when going this route, and I feel an implied relationship is likely a better solution when working on a correlation engine.
How can we go about this? My thought is still being fleshed out on the blood-and-guts details, but I feel a solution would look something like the following at a host / application group level
- Application Group A
- load-balancer > frontend > backend > database
- An application failure in the database for example would generate a lesser alarm in the backend showing that it is impacted, all the way up the chain to the load-balancer stating they COULD be impacted ( yes, a db down would likely affect all, but never assume anything until it is proven)
- An application failure in the backend would generate lesser alarms both at the frontend and database stating they COULD be impacted
- load-balancer > frontend > backend > database
In affect, this would be a parent-child relationship, but one that is more easily visualized for the technician. Additionally application and host or pod events should be as isolated from each other as possible. Managers dont care that a drive is at 90% full. The technicians and team members do, but in general managers not as much since we are not at an impacting event yet. This would additionally help in isolation of the source of the problem. If you have a drive full, and your application dies, well there is usually a good reason the application died. :) Being able to point at an unaddressed alarm for a full drive that has been firing for 3 days before a database crash is much easier when low level problems do not get rolled up into application problems. They are addressed differently and while one can cause events in the other, they really are discrete "things". Applications should fail much more than OS or host level events.
Following this thought process, it should be reasonable to assign an integer value like a key => value array for a given application at the host level. Something stating database[0] relations[frontend[0]=>[2], load-balnacer[0]=>3, backend[0]=>1 may be a reasonable option. Testing and adjustment would be necessary, but it appears that this would be a reasonable path forward. The further away from an application having trouble a host is implies that will have at LEAST a degrade of service. So setting a six-degrees-of-Kevin-Bacon type of family would reflect this. It is possible at this point to put in additional logic for these kinds of correlations, however they will quickly become complex, and likely less value overall as they get past a certain overall complexity.
How to define the relationships
- Define a "group" such as Application A
- Define hosts associated with Application A
- Define generic relationships between the hosts.
- This could be defined as a template, so pods, or ec2 instances can be added or altered in a reproducible fashion
- You can have many applications related to the same host
- KVM, Application A, Reporting Servers, etc.
- This would allow discrete movement of hosts under these umbrellas without changing the relationships
Event Definitions
- Define all events by default as OS level events
- explicitly state with a flag service checks or events that are NOT OS level
Correlation Engine Actions
- Define a hierarchy based on events seen at the application level based on event, and the defined relationships for the application
- Do NOT alter an active events screen, but have a secondary specific to correlation
- Having two different views of the same issue can assist in isolation of the root of the problem
- Correlation Engine can support OS events by treating things like KVM, VMWare as an application, since they really are
Thoughts and Spitball
- If there is correlation done generate a SINGLE event in the main event warning users? Reasonable? They should be watching both UI's especially if they were in an over-under UI.
- Allow easier filtering of correlated events to get sane results.
- Can age-out affect correlations independently? Should it?
- Correlation should be able to have a report generated for an Ops team to review and adjust correlations easily.
- Templates really seem to be the way to go here. Either database template, or flat-files. While I have a bias to flat-files database logic would likely be more "professional", however Ops people are generally not DBA's or developers. We learn code in self defense because of developers! :)
- Templates must be maintained easily and with as little code as possible. If something is a PITA, it will not get done. There is always more work that someone will say is a higher priority.