Nms Goals and Plan: Difference between revisions

From I Will Fear No Evil
Jump to navigation Jump to search
(Created page with "=== Where am I going? === The overall plan for the NMS is to open source it completely on github.com. Once I have something that does not embarrass me too much and is functional enough to work on the majority of machines I want others to be able to use this as well. My hope is that I can generate enough interest for others to assist me in getting the system polished and usable for any small or medium sized company to be able to use. there is the REMOTE possibility of...")
 
Line 10: Line 10:


The system MUST take security into account and be anti-social when the rules are not followed.  I would love to get this system FEDRAMP certified, however I know that is going to be a long and painful task.  I do believe that it would help to make the system more trusted if there was such a certification, but this is FAR down the road from where I am at now.
The system MUST take security into account and be anti-social when the rules are not followed.  I would love to get this system FEDRAMP certified, however I know that is going to be a long and painful task.  I do believe that it would help to make the system more trusted if there was such a certification, but this is FAR down the road from where I am at now.
Anything that the machine has "done" as far as a service check MUST be reproducible by a human.  Far too often automation makes this difficult if not impossible to really do.  But if you cannot reproduce a behavior, then how are you going to fix it?  How can you trust the results if you cannot do the same thing and get that result back?
This system MUST be able to be trusted.
Monitoring "misses" MUST be able to be addressed quickly.  If there was an outage or event that monitoring did not catch, the system MUST be able to have a fresh monitor in place and active.  There is NEVER a good reason for a monitoring miss to happen a second time.
Support for the "new sexy" (containers) must also be baked into the system.
Support for AWS services must also be baked into the system.
Right now, the focus is physical datacenters and more traditional virtualization, but this MUST change.
Support for transient hosts must also be supported.  Auto-scaling, or new containers need to be seamless in the monitoring.


=== Inspiration and styles ===
=== Inspiration and styles ===

Revision as of 08:59, 30 June 2023

Where am I going?

The overall plan for the NMS is to open source it completely on github.com. Once I have something that does not embarrass me too much and is functional enough to work on the majority of machines I want others to be able to use this as well. My hope is that I can generate enough interest for others to assist me in getting the system polished and usable for any small or medium sized company to be able to use.

there is the REMOTE possibility of attempting to start a company based around this software, I mean, who would not want to do something like that? I dont know if it will ever get to that point or what it would entail, but being able to work with companies to make their problems known to them, and address them would be quite fun. As someone who freely admits to NOT being a developer, I am proud of what I have made so far, but need to get it much further for it to be taken seriously in the future.

Overall Goal

To be able to maintain and EXTEND the usefulness of this tool even without professional developers on hand. My hope is that a "shady-tree" ops person would be able to leverage the template system I am implementing to add new functions that can be shared with other companies going forward.

This system must be able to be maintained by non-experts, and be able to be extended to support whatever they need monitored.

The system MUST take security into account and be anti-social when the rules are not followed. I would love to get this system FEDRAMP certified, however I know that is going to be a long and painful task. I do believe that it would help to make the system more trusted if there was such a certification, but this is FAR down the road from where I am at now.

Anything that the machine has "done" as far as a service check MUST be reproducible by a human. Far too often automation makes this difficult if not impossible to really do. But if you cannot reproduce a behavior, then how are you going to fix it? How can you trust the results if you cannot do the same thing and get that result back?

This system MUST be able to be trusted.

Monitoring "misses" MUST be able to be addressed quickly. If there was an outage or event that monitoring did not catch, the system MUST be able to have a fresh monitor in place and active. There is NEVER a good reason for a monitoring miss to happen a second time.

Support for the "new sexy" (containers) must also be baked into the system.

Support for AWS services must also be baked into the system.

Right now, the focus is physical datacenters and more traditional virtualization, but this MUST change.

Support for transient hosts must also be supported. Auto-scaling, or new containers need to be seamless in the monitoring.

Inspiration and styles

Using many different fault management tools over the years, I noticed that there were several shortcomings that they had. Of they were NOC focused, management and outside parties had no idea what was going on other than "red bad". This lead to some freakouts in several companies when someone assumed that seeing events was an "outage" of some kind.

Using tools such as Nagios, NetVigil, TECAM, SNMPc, ZenOSS has shown that the data is there, we just have to be able to present it in the way the end user needs to see it. We also must be able to make new passive or active monitors easily and reliably without special training or being a developer.

To this day, I still consider ZenOSS to be the "gold standard" of what I want to accomplish. Not the new versions that they have released now that they no longer care about users who dont give them $$$, but what it was early on. I am still looking for a ZenOSS 1.2.1 bin with no luck, but do have a 2.5.2 that I run and use as almost a template of what a useful NMS can do. From this, being able to bring the idea of "Customer facing events", "Internal Events" and a strong reporting engine together I believe will be quite useful to the industry overall.

I do not believe that the path that ZenOSS went with the design is the correct one, using Zope just because it was easy and the whole thing written in Python was not a god enough reason to go the route they did. The performance bottlenecks due to that design choice and the fact they promised to remove Zope and did not led to a loss of trust with the company. Designing my system around a database back-end written as flat as possible I believe has led to much better performance, and a bit more tolerance against oddities and data-mismatches. I dont use foreign key constraints even though it would make life easier. For non-developers trying to fix an issue, this makes things more complex.