Disclaimer: This is a rant about technology. I will return you to your normal technical posts soon.
Recently I attended the Columbus Symphony with my daughter. She has an interest in music and I want to encourage it. As I was sitting in the theater watching the performers a few things struck me:
- Scale out
- The unknown problem
- The role of the conductor
As you watch the symphony play there are many different instruments each with individual functions. For each sheet music there are two performers while playing if a page on the sheet music needs to be turned one of the two performers stops playing and changes the page. At the next page turn the other performer takes his turn. This automated and orderly way to perform offline duties reminds me of infrastructure. We are constantly looking to remove single points of failure. We want to create redundancy so if maintenance or failure occurs the whole performance is not effected (page turning or broken strings). Failure does not happen often because each performer should do regular maintenance on their individual instruments. This redundancy is critical to a well running infrastructure. We must be able to perform regular maintenance and be redundant while a failure occurs. During periods when the pages are being switches it’s possible our infrastructure will not be at full strength this is where scale out comes into play.
It can be easy to see how two people playing an instrument is not enough to provide the required volume and power for the performance. This is where we see the principle of scale out in play. I can add as many two person violin groups as I need to produce the required volume. Adding more violins should be possible to meet the demands of the location or song. The challenge with scaling out is three-fold:
- Expertise required
- Management demands of scaling out
- Balancing the needs as a whole
In order to fill more chairs I need more highly trained performers. In infrastructure terms I need more specialized devices that are compatible. I cannot simply plug a kazoo player into a violin and expect beautiful music, compatibility and skills are required. This infrastructure tenant applies to all aspects of infrastructure.
Management demands of scaling out
As I scale out I quickly find it hard to manage so many people, simply put unless I can manage 100 people exactly as I manage 1 there is a cost associated. This is where scale out computer solutions have the advantage, assuming you buy a solution from the same vendor we hope they can be managed as one entity. I have found that vendors solutions don’t seem to have this level of intelligence. VMware has brought us vSphere which does abstract and pool compute resources. It seems that a lot of storage and networking vendors have not discovered the idea behind scale out without making it hard to manage.
Balancing the needs as a whole
Adding more violins increases the volume of my violins but may drown out all other instruments except the drums. This is not a desired effect. Adding more violins has the potential effect of requiring more instruments to meet the newly required scale. This is a very hard thing to balance. In storage systems we need balancing acts between iops, cache sizes, algorithms and spinning disks. In networking we see total throughput, hair pinning and redundant architectures all effecting our scale up. In compute we have the introduction of server-side flash and cache with the needs of the application as a whole. One cannot simply increase one metric without looking at its effect upon the whole.
In the symphony they all have a common goal. They know that goal from the start, they have practiced and trained for that goal. (QA testing and programming logic) They require that all components do their job in unity to correctly achieve the goal. If one section of instruments is a few seconds off from the rest the performance is ruined (at least for those that can tell the difference). Their unity and timing is critical. Humans are prone to mistakes and they will happen. Performers will get out of sync and need to catch up. Infrastructure is the same way. If my networking chooses to delay a message for a few seconds everything else is effected. We need all the components to work perfectly every time. This is harder than it sounds. Computers are programs and cannot account for anything that was not provided in their program.
The Unknown Problem
Here is the big problem. It’s what we don’t know that will kill us. In religion there is the concept of absolute truth and relative truth. Absolute truth is truth based upon all the facts. The concept is that if we understood everything we could always make the right choice. We would be able to be perfect and create without failures. Religion is largely based upon following a being that has absolute truth. Relative truth is truth based upon our current understanding until proven incorrect, think world is flat… now its roundish. Relative truth is the world that we deal with each day. In the performance assumptions can be made about the required number of performers based on the size of the hall or past experience with the hall. Best practices around performance sizing can be made. The assumptions are just that assumptions they cannot take into account all possible eventualities. Disaster may strike like a floor or roof caving in or something simple like an accident outside the theater will cause an ambulances noise to ring for half the performance. These factors are unknowns and are common. When writing code in college I often had my wife test out the application. It normally took her about 15 seconds to do something totally unexpected (by me) and break everything. It was so frustrating. Users and applications will do the unexpected. There are a lot of unknowns like effect of a lightning induced power failure on your storage system (it’s not good trust me). The unknown requires that we keep an open mind and adjust as needed. All IT is software defined. It does not matter if it’s on a chip or running in memory it’s software defined. Firmware is software that runs on hardware. The critical concept for me from software defined IT is the ability to have intelligence and agility. I love the story about the last google outage. We had a bug introduced into our production networking and it was detected and removed automatically by the software. Can anyone else say awe-some followed up by I am afraid of Sky-net. (For non-US readers Sky-net is a A.I. from the movie series terminator and tried to kill all humans) This is intelligent and agile. The latest movement to define in software should provide quicker redundant fixes to the unknown problem.
The role of the conductor
The conductors role is to unify the performers, set a tempo, execute clear preparations and beats, and to listen critically and shape the sound of the ensemble (Wikipedia). So he is the big boss man whole keeps the whole ship running perfectly. He could be called the architect but it’s simply untrue. The architect is a person who works with relative truth, old truth and observed truth. In order to understand my problem with the architect being called the conductor I have to illustrate another challenge: the music changes. The symphony plays a song then changes to the next song. Their roles and goals change. Violins may have had a heavy role in the last song and a very minor role in this song. Making that scale out of performers not required. The game is constantly changing each with their own challenges. Infrastructure has a much larger problem: there is not common goal. Take for example that I am running 200 virtual machines. Each virtual machine has a different role and different needs. They are like 200 garage bands playing at the same time. No amount of conducting can solve the lack of similar goals. It will sound really bad or least really loud. Each application really needs their own conductor and space. They need to be able to get access to resources in an intelligent way without effecting other applications.
Who is the conductor?
Like it or not each of our applications is our own conductor. Treating them as a single entity with the same metrics is only asking for trouble. We have been given a number of tools in the compute arena to manage individual applications like reservations, DRS, SDRS, NIOC etc. This allows the conductor of vSphere to understand some metric around our little bands. This knowledge is even automated from time to time to make out life easier for example DRS. This understanding of our applications ends at the compute layer. Storage and networking treats everyone the same. There have been some inroads into this problem: QoS and IOP’s allocations. At the end of the day storage systems want to deal with writes and reads, network wants to deal with transfer of data and neither wants to be intelligent about the 200 applications running on those ESXi hosts. When I provisioned storage to a single server it was easy. Now I provision storage to potentially 32 servers running 4,000 little bands. I need a master conductor, I need agility, I scale up, I need unity, I need something that allows my application to be their own conductors and most of all I need intelligence. I need all these things to work together in concert at my individual operating system layer. I need virtualized networking and storage. I need the same magic VMware brought with ESXi to the other realms. This post is not a slam on vendors they do an awesome job and I geek out on their stuff everyday. This is not easy or it would already be done. There are vendors out there doing parts of this today. We need to find them and support them to bring change.