Nms-definitions

From I Will Fear No Evil
Revision as of 14:08, 29 January 2024 by Chubbard (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Definitions and Logic for Vigilare

Device

The simplest "thing" that can exist. In general it should be a monitorable device, however it does not have to be. Fake devices that are inactive, but used for other things such as CNAME ELB's or devices like that can have a definition even though they are not real. Another example would be specific VHOST definitions for a webserver.

HostGroup

A CSV list of device ids. In general this is used for grouping a number of identical checks against a given list of hosts. This does however also fulfill the function of an adhoc list of devices a maintenance event can be aware of. This is not live yet, however initial testing shows that this would be a good option, and require minimal code or infra changes to support. From what I see, and using tools such as Nagios and Zenoss which have a close concept, this is something that is almost static after setup. At least for long lived devices.

Monitor

A service check of ANY type which can be tied to a given Hostgroup or device id. This does not necessarily mean that it DOES anything, only that it exists. An example of this would be a passive monitor where the monitored device actually sends the metric directly to the API. While I expect these to be uncommon, having the ability to behave like a Nagios passive check is necessary.

  • shell based NRPE checks are active and appear to be reliable.
  • Shell based SNMP checks are also working and appear reliable.
  • Shell based SSH checks possible
  • Curl based monitor daemon should come soon
  • Ingestion of JSON values for AWS soon

Maintenance

The maintenance events are still being fleshed out, but the current thinking is event suppression with a time window. Still allow for the event to come in and be counted, but suppress from the UI by default if we are in the middle of a window. A filter will have to be available to see raw incoming events, as there will be times when this is reasonable, but in a healthy and properly configured system my hope is to not need users to get in there quite as often.

Admin

Admin levels are currently string defined, however I am beginning to see that an integer based system with overrides is likely a better solution.

  • IE User has access level 10, and the site simply checks to confirm access level is -ge level 10
  • Secondary checks against string would be for things like Kiosk mode where we want to allow movement, but it is in an unsecured environment. This should allow for restrictions in the system if needed. This likely would be very rarely used.

Graphs

Currently 2 graphing engines are supported. RRDTool, and Graphite. I am going to begin code for InfluxDB in the near future, once I have the remaining APIs I am debugging cleaned up a bit better. Both of the current engines support the template system written for them to show data in more complex ways. this template engine will likely need better documentation and examples in the future, but at least exists and works right now. The old database table for Graphite rendering was a hot mess, and is being removed. While it worked, it was so rigid and fragile that it was not something that was really usable.

Authentication

Currently only local database authentication is supported. I would like to get LDAP in place as the second one, but that is for future me to worry about.