How Distributed is Distributed Management or Can Too Many Cooks Spoil the Broth? by John Day
The need for network management has always been recognized. At the same time, it recognized as overhead to selling equipment as well as a facility to smooth over the shortcomings of the equipment. Most datacomm networks in the 1970s and before were fairly small, often using equipment from a single vendor. While the network management stations, then called network control, was sold as a loss leader: Sell the razor cheap, they buy more blades. As the 70s progressed, networks were not only getting larger but more and more diverse. The likelihood of a multi-vendor network was not only more likely but becoming more common. A broader view of network management was needed.
All of this was coming together as the 70s neared the end and the US computing industry initiated the OSI effort in 1978. Since the industry had always sold network control, they recognized at the beginning its importance. And they adopted a new more sophisticated sounding term, network management. Not realizing that in these new packet networks, it was qualitatively different. In OSI, it was given its own working group. At the same time, there were strong forces working against network management. Causing the work to be slow to getting off the ground, partly because it was a rather amorphous topic.
BBN had done such a good job providing network management for the ARPANET, that many didn’t realize how necessary it was. Consequently, the first rudimentary foray towards network management was with ICMP around 1980, and the first real network management wasn’t until 1988 with SNMP, which is considered below.
Up to this point, there had not been much of a consistent structure to network management, just lots of lists of parameters. Althoug there was agreement on five broad ‘management applications,’ the so-called FCAPS: Fault, Configuration, Accounting, Performance, Security. But there was really no operational model of they worked together. At least now, it was 5 lists. It didn’t take long for vendors to realize that network management standards were a major threat to their barriers to competitors. With management standards, network equipment could be more easily swapped out for a competitor’s. The response was predictable.
The contributions (many from IBM) continually added more and more issues to the discussions as to what these applications might be about. In this sort of environment, it was very hard to develop concrete proposals for standards and IBM favored postponing getting too concrete until it was clearer where it was going. Of course, this was for all sorts of good reasons. In the early 1980s, IBM ran full-page ads in places like Scientific American showing how the original 5-layer SNA model really had seven layers and followed the OSI Reference Model. The ads went on to point out that while OSI did data transfer, it didn’t handle network management. Was IBM stonewalling? Some said so. It was quite obvious that whoever defined (and sold) network management said a lot about how the equipment in the network had to behave and thus controlled the account.
In 1984, GM and Boeing with NIST had joined forces in an effort, called MAP/TOP, to develop factory and office automation based on the emerging LAN and OSI standards. In the fall of 1984, they visited the subsidiary of Motorola, where I worked looking for ideas on network management. Earlier that year, we had begun work on network management for our LAN products and in late Spring I had developed a network management model, which we presented to them. They were enthusiastic about it, thought we were far ahead of everyone else. Our staff began attending IEEE 802.1 meetings to help them pull together the specifications. The reaction by the other companies was typical, “O, now I understand.” Well, only superficially but it got things moving. Within 18 months, IEEE 802.1 had an architecture and protocol completed, which were submitted fully formed into ISO as working drafts ready to be voted on. IBM never saw it coming. They tried to stop it or delay it, but the proposals had too much support. (All of the companies that had participated in its development in 802.) That broke the logjam and began in earnest the development of CMIP and its associated standards.
In some sense, that effort was relatively traditional in that it consisted of a Network Management System, modeled on the classic operations center concept, collecting information from Agents in the devices in the network and producing commands to the Agents to modify their configuration, etc. The state of hardware at the time made it important to minimize the resources the Agents required. This flipped the traditional client/server concept to putting most of the work at the client, rather than server. In addition, it had been recognized quite early in the development of packet networks that some functions like routing and congestion control had to be automatic feedback mechanisms recognizing that events in the network were happening too fast to put a human in the loop. The most that could be done was manage the network, not control it. The change in terminology was not just adopting more sophisticated sounding terminology. It was real practical shift in the nature of the problem.
The original insight in early summer of 1984 was that management was:
Monitor and Repair, but not Control.
To structure the idea further I had drawn on my early interest in the structure of nervous systems. I posited that there were four planes of network management that parallel nervous systems. Progressing from the bottom up: sensors (peripheral) collecting raw data, to agents (hypothalamus) aggregating data from the operation of the layers and the equipment to be uploaded, to managers (cerebellum) the operational center of the network where the aggregated data was further processed and handed, to coordination (cerebrum) where the long-term projections, planning and adjustments to operation were considered. With increasing aggregation of data moving up, and increasing cycle time moving down. (See the slides in one of my presentations.) These automatic feedback functions (routing, congestion control, were the current examples) constituting Layer Management, clearly corresponded to the autonomic nervous system.
Collecting data on these much larger networks created a requirement, a database would be required to make this data available and network management had special requirements. But it also raised an issue that required more than passing knowledge of databases. Management is collecting data on the parts of the equipment. In the database world this is called a ‘bill of materials’ or ‘parts-explosion’ structure. The accepted wisdom at the time was that relational databases were the solution to all database problems. Charlie Bachman, the inventor of the Entity-Relation database model (as well as the 7-layer OSI model), always said, you can’t do a bill of materials structure in a relational database. I would counter, you could but who would want to! Consequently, our group adopted a E-R Model database. (One of our new hires from MIT kept wondering about relational, so we suggested doing some tests and found that in the best case the relational model was 19 times slower.) Meanwhile, every vendor (DEC, HP, IBM, etc.) adopted a relational database for their network management system, and they all failed with some resorting to index-sequential files.
The other big requirement for effective management that we recognized was commonality, commonality, commonality. Some progress was made on this in the standards with MIB definitions for each layer and protocol, as well as common structures for the management systems and the management appliations. We actually made quite a bit more progress within our product group but upper management prevented us from submitting it to ISO. We also made considerable progress in further architecting FCAPS, but you can’t be everywhere in a large standards effort. Of course with RINA, we have essentially maximized commonality in data transfer, but there are still things we will learn in layer management that will lead to more simplification and enabling more sophisticated management than can’t be done now.
[For completeness, we must mention what was happening in the Internet, although it did little to progress the state of the art and all in all was a drag on progress. In the late 80s, about the time the IEEE work was being moved to OSI. The IETF began to look at network management. There were two major proposals: a forward looking object-oriented management protocol called HEMS, and a much more rudimentary protocol called SNMP. (The IEEE work had already experimented with a minimal protocol like SNMP and had found it was too minimal. (IEEE had found that bare Reads and Writes (Set/Get) was too minimal and generated too many requests/responses to accomplish anything, essentially too much traffic and delay. OSI had moved to the draft of an object-oriented CMIP that could operate on several objects at once with scope and filter. HEMS was similar to CMIP, although it lacked a similar function to scope and filter.) After much debate in the IETF, SNMP was chosen over HEMS, even though its implementation was larger than either HEMS or CMIP. The IETF completely missed the importance of commonality and by the time they did, it was too late. Too many RFCs had been generated with specific devices as the models for the MIBs similar to device-specific lists that were common prior 1985. When they did recognize it, some participants merely took the OSI management standards and did a global replace on terminology for the SNMP versions.
The situation was further complicated when SNMP was first approved, Cisco and other router vendors took the tact that SNMP would be suitable for monitoring, but not for configuration because it was insecure. Strictly speaking, it was. However, SNMP was encoded with ASN.1 (from OSI). (Needless to say that most computers in the world did not have ASN.1 compilers.) Instead, router vendors recommended using their management software for configuration, which used Telnet and sent passwords in the clear. (But every PC had Telnet.) Amazingly, the IETF fell for this argument. Account control raises its head again.
The IETF immediately set off on to produce SNMPv2, which would be secure. The original authors tried to force through a draft that predictably it blew up in their faces. There ensued a decade long confusion before something began to emerge. But by then the damage had been done. The state of network management was roughly where it was prior to the MAP/TOP effort.
Even now, the common solution involves so-called ‘element managers’ i.e. vendor-specific management systems untouched with what has been called a manager of managers. This is overly complex and extremely limited in what it can accomplish. At one point this effort went to the extreme of trying to centralize much of the autonomic functions e.g. SDN centralizing routing. Centralizing an inherently decentralized problem always has the same predictable result.]
In parallel with this, over the past 3 decades some have waxed eloquent about autonomic management with some even believing it was all that was required. However, there has been a dearth of real results. Somehow the papers all remain in the clouds with the rubber seldom finding the road. It might be true that autonomic is all that is needed as long as networks never get beyond the complexity of slime molds, sponges, and jellyfish. After that with the development of eyespots, there is a central nervous system. Just the act of observing the network implies some degree of centralization of information. Hence, there is some need for a homunculus.
More recently, pendulum has begun to swing back the other way with the usual tendency to the other extreme: everything should be totally decentralized. Let us take a more detailed look at network management with respect to decentralization.
There are two forms of management, we are interested in: Network Management and Application Management, or DAF Management. We will leave operating systems or systems management for another time.
Network Management is responsible for networking equipment and the DIFs they support. Since this involves real hardware, the domains of management are usually determined by who owns the equipment.
Application or DAF Management is responsible for managing the DAF and any DIFs uniquely required to support this DAF. Initially, our focus will be more on DAF Management. As we move RINA down to the legacy Media Layers and eventually to the wire, we will require network management as well.
As noted above, events are happening too fast for a human to be in the loop, it has long been recognized that there was some degree of autonomic management in the DIFs. This was referred to as Layer Management. It was clear that routing (managing resource allocation within the DIF) and congestion management are autonomic. The two operate at different time scales and are governed by policies based on the QoS-cubes supported by the DIF.
In the early work on network management architecture, some went a little overboard (HP for one) with multiple levels of managers with overlapping management domains. There are no 2nd level managers. There may be a hierarchy of subnets in a network perhaps with their own managers, but the managers are peers. The Coordination processes may prepare new input for managers but it does not effect the changes.
Let us look at the opportunities for decentralized management:
- Any number of “managers” may observe (monitor) all or part of a network. This monitoring is targeted at the health of the network. But while management domains may overlap for monitoring, not for acting. Acting has to be unique. Only one management DAF can be responsible for modifying attributes in the management domain.
- The Agents – A Management DAF consists of Management Applications and Agents. The Agents are the local members of a DAF in the processing system. Each Agent has in its domain the IPC Processes of the DIFs this Management DAF is responsible for. The Agent is analogous to the sphere in Flatland. It has access to the state of all IPCPs in its domain. While it is possible that there would be multiple Agents belonging to different Management DAFs in the same processing system, the more common case is one Agent per processing system.
There is potential here for a new form of autonomic behavior. Drawing on the nervous system analogy for ganglia. Decentralized strategies have a tendency to be good at finding local optima but not global optima. There might be strategies where the Agents or subsets of Agents could improve the health of the network by optimizing across DIFs to achieve more global optima. Also, for some networks, it may be the case that portions of the network need to undergo complex configuration changes in real time. All of these might be better coordinated by a ganglia function operating among one or more Agents.
- Event Management - The sensors in the DIF/DAFs provide the Agents with raw data. The Agents have direct access into the DIFs/DAFs. (They are, in effect, the sphere in Flatland.) Agents aggregate the data and report it to Event Management. This is how everyone else monitors the network or the distributed application. This could be distributed with all agents reporting to all monitors. (This sounds inefficient depending on the volume of data.) The important function for EM is to maintain a log of the management data. (Potentially this is a classic use of Blockchain, although it is not clear it requires that level of security. It is an open question whether given the inherent security of the DAF/DIF structure, modification of the log is seen as a threat. But it would be an interesting ‘recursive’ use of Blockchain.)
- Configuration - is under the control of the network manager responsible for given pieces of equipment. This appears to be a necessity. For DAFs, Coordination processes will probably be developed or new configurations or update old ones. Here the decision process can be a consensus, although the activation of a configuration probably must be centralized. (See below) For network management, some configuration changes may be triggered by well-defined events such as day/night, weekdays/weekends, holidays, special events (Super Bowl Sunday), disaster response, etc. Some of these may require a consensus decision but most don’t. Other than to determine what the configuration changes are and the conditions for activating them. Once the configuration differences are known, activating them can be fully automated. (A configuration is a tree. By embedding directives to indicate the order, activating a configuration is basically a tree walk.)
One interesting conjecture would be, can different parts of a large network sense the conditions for a configuration change and act independently without causing instabilities? However, there will be issues to resolve about what happens at the boundaries if adjoining areas do not perceive the same conditions, etc.
- Fault Management - is actually a management system with a small management domain, i.e. the equipment (or parts thereof) being diagnosed and hopefully repaired remotely. This pretty much has to be a traditional centralized management system.
- Performance Management - Everyone has always known there was a need for Performance Management, but it was always a bit vague what precisely it was for and where it fit. Given that the behavior in the DIFs is governed by the policies of layer management, i.e. it is autonomic. And that this is being managed, not controlled. Then it follows that performance data is collected and analyzed at the Coordination level to determine if any of the autonomic policies in the DIFs need to be adjusted, changed, etc.
All in all, this will be a ‘big data’ problem. Large amounts of data collected in the network will have to be analyzed and interpreted. This information is then reviewed by all participants (or their representatives) to determine whether policies in the DIFs need to be modified, or if there are upcoming events that will required changes in configuration and/or policies.
These decisions can be made by a consensus process of some sort. However, this level of decision making will require a degree of expertise. The decisions being made here are not direct, but indirect. They are modifying the management of the network, not its control. If any modifications are approved, they can be activated in the next configuration update. (Care will have to be given to the order this is done in.)
Basically, it would appear that Coordination is the center of decentralized management. It also appears that a considerable amount of the centralized functions can be made automatic.
For application management, of course, the DIFs associated with the DAF. (We assume that ISPs, etc. would tend to provide certain ‘vanilla’ QoS-cubes. (Some research is probably needed on what these would look like.) Then DAF designs might have DIFs that use these ‘vanilla’ services and augment them to be specific to the needs of their DAF. As for DAF management itself, I would foresee that the basic RIB, Configuration, Event, and Fault machinery would be largely the same. But we do need to explore what other DAF commonalities can be exploited.
 It is notable that in the OSI work, the head of delegation from most countries was from IBM.
 Because I was Rapporteur of the OSI Reference Model (as well as non-standards tasks), I did not attend the 802 meetings so as not attract attention to the effort even though I was writing and editing many of the contributions we submitted. IBM’s focus in 802 was on token ring, not on architecture. It worked.
 In the early 1970s on the assumption that networks were not as complex as the human nervous system, I had audited a course in invertebrate zoology to understand the range of complexity of nervous systems.
 It is analogous to ‘you could write a Java compiler for a Turing Machine, but who would want to!’
 In consensus organizations, like standards committees, it is practically a theorem that trying to ram through complete draft without even minor changes will blow up and fail. Recent experience in the US with the Republicans attempt to modify the Affordable Care Act was a classic repeat of SNMPv2 (and others I have observed). The result seems to be true over a very large scale. See the Discourse on Livy.
 In many invertebrates and vertebrates, there is a mass of nerves in another part of the body separate from the brain that are used for muscle coordination or other functions in a remote part of the body. (Dinosaurs were known for having ganglia near their hind quarters.