Nms Goals and Plan
Highest Level Look and reminders
- This NMS should be functional for small to medium type businesses with 1 to 200 devices on a single cheap host. After 200 hosts, likely we will be looking at drive IO problems depending on how many checks are being done, and what is being graphed.
- Without income it is going to be best effort on bug squishing.
- Never pull a ZenOSS. If you have community support NEVER forget it. (Jackasses)
- You are not a real dev (duh).
- One-offs may be useful for additions into the system if people make suggestions, but keep them low priority or normal stuff will get backlogged.
- Keep the system as anti-social as possible. If a user wishes to make it less secure for some reason, make it possible but let them know any risks of doing so.
- Try like hell to keep up with STIGs and security issues seen. Likely this will be a problem unless I get professional assistance.
- ALWAYS try to do more stuff with less hardware.
- Not all companies have $$$ for monitoring (or dont realize the importance)
- Throwing hardware at crap code is !@#$!@ stupid. (NetVigil)
- Avoid as best as possible lock-in on any one type of technology
Where am I going?
The overall plan for the NMS is to open source it completely on github.com/gitlab.com. Once I have something that does not embarrass me too much and is functional enough to work on the majority of machines I want others to be able to use this as well. My hope is that I can generate enough interest for others to assist me in getting the system polished and usable for any small or medium sized company to be able to use.
there is the REMOTE possibility of attempting to start a company based around this software, I mean, who would not want to do something like that? I dont know if it will ever get to that point or what it would entail, but being able to work with companies to make their problems known to them, and address them would be quite fun. As someone who freely admits to NOT being a developer, I am proud of what I have made so far, but need to get it much further for it to be taken seriously in the future.
Overall Goal
To be able to maintain and EXTEND the usefulness of this tool even without professional developers on hand. My hope is that a "shady-tree" ops person would be able to leverage the template system I am implementing to add new functions that can be shared with other companies going forward.
This system must be able to be maintained by non-experts, and be able to be extended to support whatever they need monitored.
The system MUST take security into account and be anti-social when the rules are not followed. I would love to get this system FEDRAMP certified, however I know that is going to be a long and painful task. I do believe that it would help to make the system more trusted if there was such a certification, but this is FAR down the road from where I am at now.
Anything that the machine has "done" as far as a service check MUST be reproducible by a human. Far too often automation makes this difficult if not impossible to really do. But if you cannot reproduce a behavior, then how are you going to fix it? How can you trust the results if you cannot do the same thing and get that result back?
This system MUST be able to be trusted.
Monitoring "misses" MUST be able to be addressed quickly. If there was an outage or event that monitoring did not catch, the system MUST be able to have a fresh monitor in place and active. There is NEVER a good reason for a monitoring miss to happen a second time.
Support for the "new sexy" (containers) must also be baked into the system.
Support for AWS services must also be baked into the system.
Right now, the focus is physical datacenters and more traditional virtualization, but this MUST change.
Support for transient hosts must also be supported. Auto-scaling, or new containers need to be seamless in the monitoring.
Inspiration and styles
Using many different fault management tools over the years, I noticed that there were several shortcomings that they had. If they were NOC focused, management and outside parties had no idea what was going on other than "red bad". This lead to some freakouts in several companies when someone assumed that seeing events was an "outage" of some kind. In other cases it was so generic as to be useless as far as figuring out impact (looking at you TECAM and NetVigil).
Using tools such as Nagios, NetVigil, TECAM, SNMPc, ZenOSS, and Observium has shown that the data is there, we just have to be able to present it in the way the end user needs to see it. We also must be able to make new passive or active monitors easily and reliably without special training or being a developer.
To this day, I still consider ZenOSS to be the "gold standard" of what I want to accomplish. Not the new versions that they have released now that they no longer care about users who dont give them $$$, but what it was early on. I am still looking for a ZenOSS 1.2.1 bin with no luck, but do have a 2.5.2 that I run and use as almost a template of what a useful NMS can do. From this, being able to bring the idea of "Customer facing events", "Internal Events" and a strong reporting engine together I believe will be quite useful to the industry overall.
I do not believe that the path that ZenOSS went with the design is the correct one, using Zope just because it was easy and the whole thing written in Python was not a god enough reason to go the route they did. The performance bottlenecks due to that design choice and the fact they promised to remove Zope and did not led to a loss of trust with the company. Designing my system around a database back-end written as flat as possible I believe has led to much better performance, and a bit more tolerance against oddities and data-mismatches. I dont use foreign key constraints even though it would make life easier. For non-developers trying to fix an issue, this makes things more complex.