Design for Platform services controller (PSC)

This is the first part in a series about building PSC architecture the rest of the articles are here:

The platform services controller that was introduced in vSphere 6.0  has been a source of challenge for a lot of people who are upgrading into it.    I have struggled to identify the best architecture to follow.   This article assumes that you want to have a multi-vCenter single sign on domain with external PSC’s.   There are a few key items to consider in architecting PSC’s:

Recovery

  • If you lose all PSC’s you cannot connect a vCenter to a new PSC you must re-install the vCenter loosing all data
  • To recover all failed PSC’s restore a single PSC from backup (Image level backup is supported) then redeploy new PSC’s for the rest.   Restoring multiple PSC’s may introduce some inconsistencies depending on time of backup.
  • In 6.5 vCenter cannot be repointed to a PSC in a different site on the same domain (6.0 can)
  • All 6.x versions of vCenter do not support repointing to a PSC in a different domain
  • If you lose all PSC’s at a site you can install new PSC’s at the site as long as at least one PSC at another site survived then repoint the vCenter to the new PSC

 

Replication

  • All PSC replication is bi-directional but not automatically in a ring (big one)
  • By default each PSC is replicating with only a single other PSC (the one you select when installing the additional PSC)
  • Site names do not have anything to do with replication today they are a logical construct for load balancers and future usage
  • Changes are not unique to a site but to a domain – in other words all changes at all sites are replicated to all other PSC’s assuming they are part of the domain

 

Availability

  • vCenter points to a single PSC never more than one at a time
  • PSC’s behind a load balancer (up to 4 supported) are active/passive via load balancer configuration
  • If you use a load balancer configuration for PSC and have a failure of the active PSC the load balancer repoints to another PSC and no reconfiguration is required
  • Site name is important with load balancers you should place all PSC’s behind a load balancer in their own site – non-load balanced PSC’s at same site should have a different site name

 

Features

  • PSC’s have to be part of the same domain together to use enhanced linked mode

 

Performance

  • PSC can replicate to one or many other PSC’s  (with an impact with many).   You want to minimize the number of replication partners because of performance impact.

Topology

  • Ring is the supported topology best practice today
  • PSC’s know each other by IP address or domain name (ensure domain is correct including PTR) – using IP is discouraged because it can never be changed;  use of FQDN allows for IP mobility.
  • PSC’s are authentication sources so NTP is critical and the same NTP across all PSC’s is critical.  (If you join one PSC to AD all need to be joined to same AD – best not to mix appliance and windows PSC’s)
  • The only reason to have external PSC’s is to use enhanced linked mode – if you don’t need ELM use an embedded PSC with vCenter and back vCenter up at the same time – see http://vmware.com/go/psctree

 

Scalability

  • Current limits are on 8 PSC’s in a domain in 6.0 and 10 in a domain in 6.5

 

Your ads will be inserted here by

Easy Plugin for AdSense.

Please go to the plugin admin page to
Paste your ad code OR
Suppress this ad slot.

With all of these items in hand here are some design tips:

  • Always have n+1 PSC’s in other words never have a single PSC in a domain when using ELM
  • Have a solid method for restoring your PSC’s – Image level or 6.5 restore feature

 

So what is the correct topology for PSC’s? 

This is a challenging question.  Let’s identify some design elements to consider

  • Failure of a single component should not create replication partitions
  • Complexity of setup should be minimized
  • Number of replication agreements should be minimized for performance reasons
  • Scaling out additional PSC’s should be as simple as possible

Ring

I spent some time in the ISP world and learned to love rings.   They create two paths to every destination and are easy to setup and maintain.   They do have issues when two points fail at the same time and potentially create partitions of routing until one of the two is restored.   VMware recommends a ring topology for PSC’s at the time of this article as shown below:

Let’s review this topology against the design elements:

  • Failure of a single component should not create replication partitions
    • True due to ring there are two ways for everything to replicate
  • Complexity of setup should be minimized
    • The setup ensures redundancy without lots of manually created performance impacting replication agreements (one manual agreement)
  • Number of replication agreements should be minimized for performance reasons
    • True
  • Scaling out additional PSC’s should be as simple as possible
    • Adding a new PSC means the following:
      • Add new PSC joined to LAX-2
      • Add new agreement between new PSC and SFO-1
      • Remove agreement between LAX-2 and SFO-1

Looks mostly simple you do need to track who is providing your ring backup loop. Which is a manual documentation process today.

Ring with additional redundancy

The VMware validated design  states that for a two site enhanced linked mode topology you should build the following:

A few items to illustrate (in case you have not read the VVD)

  • Four vCenters
  • Four PSC’s (in blue)
  • Each PSC replicates with its same site peer and one remote site peer thus making sure it’s changes are stored at two sites and with two copies that are then replicated locally and remotely (all four get it)

Let’s evaluate against the design elements:

  • Failure of a single component should not create replication partitions
    • True due to ring there are four ways for everything to replicate
  • Complexity of setup should be minimized
    • The setup requires forethought and at least one manual replication agreements
  • Number of replication agreements should be minimized for performance reasons
    • It has more replication agreements
  • Scaling out additional PSC’s should be as simple as possible
    • Adding a new PSC means potentially more replication agreements or more design

Which option should I use?

That is really up to you.  I personally love the simplicity of a ring.  Nether of these options increase availability of the PSC layer they are about data consistency and integrity.   Use a load balancer if your management plane SLA does not support downtime.

NSX Manager still running but disconnected from vCenter

A quick note in case you run into this issue.   I was running into problems where my NSX manager was running and everything seemed fine (NSX manager login / Console) but I could not manage NSX elements from inside vCenter.   No NSX manager was showing up.   Reconnecting to vCenter or rebooting would resolve this issue but then I had the problem again the next day.   I could not figure out the issue… then it dawned on me what happens every day…. BACKUP!   Somehow my NSX manager was added to the nightly backup and it would lose connection during this time.    Here is the only approved method for backing up a NSX manager:

  1. Use the configuration backup in the NSX manager administration console to make normal and regular backups

 

To recover a NSX manager do the following:

  1. Deploy a new NSX manager using OVF (same version of NSX as backup) with same IP as original manager
  2. Restore the configuration from the backup
  3. Reboot the NSX manager to ensure clean configuration
  4. Ensure it shows up in the GUI

 

Image level backups are not supported or a good idea 🙂

VMkernel types updated with design guidance for multi-site

Holy crap what do all these VMware VMkernel type mean?  I started this article and realized I had already written one here.  Sad when google leads you to something you wrote… looks like I don’t remember too well… Perhaps I should just go yell for the kids to get off my lawn now.   I wanted to take a minute to revise my post with some new things I have learned and some guidance.

capture

From my previous post:

  • vMotion traffic – Required for vMotion – Moves the state of virtual machines (active datadisk svMotion, active memory and execution state) during a vMotion
  • Provisioning traffic – Not required will use management network if not setup – cold migration, cloning and snapshot creation (powered off virtual machines = cold)
  • Fault tolerance traffic (FT)  – Required for FT – Enables fault tolerance traffic on the host – only a single adapter may be used for FT per host
  • Management traffic – Required – Management of host and vCenter server
  • vSphere replication traffic – Only needed if using vSphere replication– outgoing replication data from ESXi host to vSphere replication server
  • vSphere replication NFC traffic – Only needed if using vSphere replication – handles incoming replication data on the target replication site
  • Virtual SAN – Required for VSAN – virtual san traffic on the host
  • VXLAN – used for NSX not controlled from the add vmkernel interface.

I wanted to provide a little better explanation around design elements with some interfaces.  Specifically I want to focus on vMotion and Provisioning traffic.    Let’s create a few scenario’s and see what interface is used assuming I have all the VMkernel interfaces listed above:

  1. VM1 is running and we want to migrate from host1 to host2 at datacenter1 – vMotion
  2. VM1 is running with a snapshot and we want to migrate from host1 to host2 at datacenter1 – Provisioning traffic (if it does not exist management network is used)
  3. VM1 is running with a snapshot and we want to storage migrate from host1 DC1 to host4 DC3 – storage vMotion – Provisioning traffic (if it does not exist management network is used)
  4. VM1 is not running and we want to migrate from host1 to host2 at datacenter1 – Provisioning traffic (very low bandwidth used)
  5. VM1 is not running has a snapshot and we want to migrate from host1 to host2 at datacenter1 – Provisioning traffic (very low bandwidth used)
  6. VM2 is being created at datacenter1 – Provisioning traffic

 

So design guidance in a multi-site implementation you should have the following interfaces if you wish to separate the TCP-IP stack  or use network IO control to avoid bad neighbor situations.   (Or you could just assign it all to management vmk and go nuts on that interface = bad idea)

  • Management
  • vMotion
  • Provisioning

Use of other vmkernel interfaces depends on if you are using replication, vSAN or NSX.

Should you have multi-nic vMotion? 

Multi-nic vMotion enables faster vMotion of multiple entries off a host (as long as they don’t have snapshots).   It still is a good idea if you have large vm’s or lots of vm’s on a host.

Should you have multi-nic Provisioning?

No idea if it’s even supported or a good idea.  Provisioning network is used for long distance vMotion so the idea might be good… I would not use it today.

Should IT build a castle or a mobile home?

So I have many hobbies to keep my mind busy during idle times… like when driving a car.   One of my favorite hobbies is to identify the best candidate locations to live in if the Zombie apocalypse was to happen.   As I drive in my car between locations I see many different buildings and I attempt to rate large buildings by their Zombie proof nature.   There are many things to consider in the perfect Zombie defense location for example:

  • Avoiding buildings with large amounts of windows or first floor windows
  • Building made of materials that cannot be bludgeoned open for example stone
  • More than one exit but not too many exits
  • A location that can be defended on all sides and allows visible approach

There are many other considerations like proximity to water and food etc..  but basically I am looking for the modern equivalent of a castle:pexels-photo

OK what does this have to do with IT

Traditional infrastructure is architected like a castle its primary goal is to secure at the perimeter and be very imposing to keep people out.   During a zombie attack this model is great until they get in then it becomes a grave yard.   IT architects myself include spend a lot of time considering all the factors that are required to build the perfect castle.   There are considerations like:

  • Availability
  • Recoverability
  •  Manageability
  • Performance
  • Security

That all have to be considered and as you add another wing to your castle every one of these elements of design must be considered for the whole castle.  We cannot add a new wing that bridges the moat without extending the moat etc..   Our design to build the perfect castle has created a monolithic drag.   While development teams move from annual releases to quarters or weeks or days we continue to attempt to control the world from a perimeter design perspective.   If we could identify all possible additions to the castle at the beginning we could potentially account for them.   This was true in the castle days:  there were only so many ways to get into the castle and so many methods to break in.    Even worse the castle provided lots of nooks and locations for zombies to hide and attack me when not expecting it..  This is the challenge with the Zombie attack they don’t follow the rules they just might create a ladder out of zombie bodies and get into your castle (World War Z style).   If we compare zombies to the challenges being thrown at IT today the story becomes valid.    How do we deal with constant change and unknown?   How do we become agile to change?   Is it from building a better castle?

Introducing the mobile home

pexels-photo-106401

Today I realized that the perfect solution to my Zombie question was the mobile home.   We can all assume that I need a place to sleep.   Something that I can secure with reasonable assurance.   I can re-enforce the walls and windows on a mobile home and I gain something I don’t have with a castle: mobility.  I can move my secured location and goods to new locations.  My mobile home is large enough to provide for my needs without providing too many places for zombies to hide.  IT needs this type of mobility.   Cloud has provided faster time to market for many enterprises but in reality you are only renting space in someone else’s castle.    There are all types of methods to secure your valuables from mine but in reality we are at the mercy of the castle owner.   What if my service could become a secured mobile home… that would provide the agility I need in the long run.   The roach motel is very alive and well in cloud providers today.   Many providers have no cross provider capabilities while others provide tools to transform the data between formats.   My mobile home needs to be secure and not reconfigured each time I move between locations while looking for resources or avoiding attack.  We need to reconsider IT as a secured mobile home and start to build this model.   Some functions to consider in my mobile home:

  • Small enough to provide the required functions (bathroom, kitchen and sleeping space or in IT terms business value) and not an inch larger than required
  • Self contained security the encircles the service
  • Mobility without interruption of services

Thanks for reading my rant.  Please feel free to provide your favorite zombie hiding location or your thoughts on the future of IT.

 

Breaking out a SSO/PSC to enable enhanced linked mode

The single sign on used to be a fairly painless portion of vCenter (once we got to 5.5, in 5.0 it was a major pain).    It was essentially a lightweight directory (vsphere.local) and gateway to active directory.  The platform services controller (PSC) of vCenter 6 is a completely different animal.  It performs a lot of new functions that are not easy to transfer between instances.  For example the PSC does the following:

  • Handles and stores SSL certificates
  • Handles and stores license keys
  • Handles and stores permissions via global permissions layer
  • Handles and stores replication of Tags and Catagories
  • Built in automation replication between different sites

Why does it do all these and why do I care?

Well VMware has come to understand that virtual machines cannot be bound to a specific location more and more customer want Hybrid and multi-site capabilities while keeping the same management.   A lot of the management functions are based around Tags and permissions have a over arching layer to provide that functionality is huge.   I assume that we are going to see more features passed up to the PSC layer in order to make cross site/ vCenter features available.

Architectural change

In 6.0 VMware changed the architecture to have external PSC’s as a preferred mode of operation.   In fact they support up to 8 replicated PSC’s and they have two constructs that matter:

  • Domain (traditionally this has been vsphere.local)
  • Sites (Physical locations)

Site designation changes how the PSC’s and their multi-masters replicate (choosing to replicate to a single instance at each site then have that instance replicate to local nodes)

The change to external PSC’s is a challenge for many users.  First let me be clear about a challenge you can only have one domain: merging domains is not supported. Once you get to 6 you cannot leave a domain and join a different domain I have not seen instructions to do it and it does not seem to be supported.  In 5 you can leave a SSO domain and join a different domain so if you are still on 5 and wish to join multiple machines to the same domain do it while on 5 using SSO.  If you wish to move from an embeded PSC to an external PSC the process is pretty simple:

  1. Install a new PSC (can be windows or Linux) joined to the embedded PSC
  2. Repoint the vCenter to the new PSC (instructions here)
  3. Remove the old PSC

The key takeaway for all of you who might have slotted off during this article is this: Make any topology changes to vCenter domains before upgrading to 6.

 

Long Distance Cross vCenter vMotion requirements

The ability to move virtual machines long distances between two datacenters while running seems like the key example of the power of abstraction.   VMware has enabled this feature but it has a number of requirements that make the cost of ownership a little high.    All of these requirements are listed in VMware KB articles but you have to mine them for the details to ensure you are compatible.   Having recently been stung by these requirements I thought I would collect them into a single location.

Assumptions:

The following assumptions are made:

  • You are running two vCenters one at each site
  • You are running virtual distributed switches at each site

KB Articles mined for the data

Requirements

  • The source and destination vCenter server instances and ESXi hosts must be running version 6.0 or later.
  • Requires Enterprise Plus licensing
  • When initiating the moves in the web client both source and destination vCenter instances must be in Enhanced Linked mode and in the same vCenter Single Sign-On domain (When using API this is not a requirement)
  • Both vCenter Servers must be time synced for SSO to work
  • For migration of compute resources only, both ESXi hosts must be connected to the shared virtual machine storage.
  • When using the vSphere APIs/SDK, both vCenter Server instances may exist in separate vSphere Single Sign-On domains. Additional parameters are required when performing a non-federated cross vCenter Server vMotion.
  • MAC address must no conflict (different vCenter ID’s will ensure this)
  • vMotion cannot take place from distributed switch to standard switch
  • vMotion cannot take place between distributed switches of different versions (source and destination vDS must be the same version)
  • RTT (round-trip time) latency of 150 milliseconds or less, between hosts
  • You must create a routeable network for the Traffic for Cold migrations (Provisioning network from VMkernel types)

 

These requirements can really bite you if you are not careful.   Notice there are no constraints on vMotioning from a standard switch to a distributed switch which helps you get around version differences.   The truth is that vMotion is a miracle of engineering and then cross vCenter vMotion is an even better miracle but it comes at a cost.   Essentially best case scenario you have to have two vCenters in enhanced linked mode on the same version of ESXi, with the same hardware type or in EVC with the same version of distributed switches.   It’s a lot of asks to enable the features and something to consider if your planning on using long distance cross vCenter vMotion.

 

Configuring a NSX load balancer from API

A customer asked me this week if there was any examples of customers configuring the NSX load balancer via vRealize Automation.   I was surprised when google didn’t turn up any examples.  The NSX API guide (which is one of the best guides around) provides the details for how to call each element.  You can download it here. Once you have the PDF you can navigate to page 200 which is the start of the load balancer section.

Too many Edge devices

NSX load balancers are Edge service gateways.   A normal NSX environment may have a few while others may have hundreds but not all are load balancers.   A quick API lookup of all Edges provides this information: (my NSX manager is 192.168.10.28 hence the usage in all examples)

 

This is for a single Edge gateway in my case I have 57 Edges deployed over the life of my NSX environment and 15 active right now.   But only Edge-57 is a load balancer.   This report does not provide anything that can be used to identify it as a load balancer from a Edge as a firewall.   In order to identify if it’s a load balancer I have to query it’s load balancer configuration using:

Notice the addition of the edge-57 name to the query.   It returns:

Notice that this edge has load balancer enabled as true with some default monitors.   To compare here is a edge without the feature enabled:

Enabled is false with the same default monitors.   So now we know how to identify which edges are load balancers:

  • Get list of all Edges via API and pull out id element
  • Query each id element for load balancer config and match on true

 

 

Adding virtual servers

You can add virtual servers assuming the application profile and pools are already in place with a POST command with a XML body payload like this (the virtual server IP must already be assigned to the Edge as an interface):

capture

You can see it’s been created.  A quick query:

 

Shows it’s been created.  To delete just use the virtualServerId and pass to DELETE

 

Pool Members

For pools you have to update the full information to add a backend member or for that matter remove a member.  So you first query it:

Then you form your PUT with the data elements you need (taken from API guide).

In the client we see a member added:

capture

Tie it all together

Each of these actions have a update delete and query function that can be done.  The real challenge is taking the API inputs and creating user friendly data into vRealize Input to make it user friendly.    NSX continues to amaze me as a great product that has a very powerful and documented API.    I have run into very little issues trying to figure out how to do anything in NSX with the API.  In a future post I may provide some vRealize Orchestrator actions to speed up configuration of load balancers.