Design Scenario: Gigabit network and iSCSI ESXi 5.x

Many months ago I posted some design tips on the VMware forums (I am Gortee there if you are wondering).   Today a user updated the thread with a new scenario looking for some advise.  While it would be a bad idea personally and professionally for me to give specific advise without a design engagement I thought I might provide some thoughts about the scenario here.  This will allow me to justify some design choices I might make in the situation.   In no way should this be taken as law.  In reality everyone situation is different and little requirements can really change the design.   The original post is here.

The scenario provided was the following:

3 ESXI hosts (2xDell R620,1xDell R720) each with 3×4 port NICS (12 ports total), 64GB RAM. (Wish I would have put more on them ;-))

1 Dell MD3200i iSCSI disk array with 12 x 450GB SAS 15K Drives (11+1 Spare) w/2 4 port GB Ethernet Ports

2 x Dell 5424 switches dedicated for traffic between the MD3200i and the 3 Hosts

Each host is connected to the iSCSI network though 4 dedicated NIC Ports across two different cards

Each Host has 1 dedicated VMotion Nic Port connected to its own VLAN connected to a stacked N3048 Dell Layer 3 switch

Each Host will have 2 dedicated (active\standby) Nic ports (2 different NIC Cards) for management

Each Hosts will have a dedicated NIC for backup traffic (Has its own Layer 3 dedicated network/switch)

Each host will use the remaining 4 Nic Ports (two different NIC cards) for the production/VM traffic)

 would you be so kind to give me some recommendations based on our environment?

Requirements

  • Support 150 virtual machines
  • Do not interrupt systems during the design changes

Constraints

  • Cannot buy new hardware
  • Not all traffic is vlan segmented
  • Lots of 1GB ports per server

Assumptions

  • Standard Switches only (Assumed by me)
  • Software iSCSI is in use (Assumed again by me)
  • Not using Enterprise plus licenses

 

Storage

Dell MD3200i iSCSI disk array with 12 x 450GB SAS 15K Drives (11+1 Spare) w/2 4 port GB Ethernet Ports

2 x Dell 5424 switches dedicated for traffic between the MD3200i and the 3 Hosts

Each host is connected to the iSCSI network though 4 dedicated NIC Ports across two different cards

I personally have never used this array model, the vendor should be included on the design to make sure none of my suggestions here are not valid with this storage system.  Looking at the VMware HCL we learn the following:

  • Only supported on ESXi 4.1 U1 through 5.5 (no 5.5 U1 yet so don’t update)
  • You should be using the VMW_PSP_RR (Round Robin) for path fail over
  • The array supports the following VAAI natives Block Zero,Full Copy,HW Assisted Locking

The following suggestions should apply to physical cabling:

Storage

Looking at the diagram I made the following design choices:

  • From my limited understanding the array the cabling follows the best practice guide I could find.
  • Connection from the ESXi hosts to switches are done to create as much redundancy as possible including all available cards.  It is critical that the storage be as redundant as possible.
  • Each uplink (physical nic) should be configured to connect to an individual vmkernel port group.  Each port group should be configured with only one uplink.
  • Physical switches and port groups should be configured to use native port assuming these switches don’t so anything other than provide storage traffic between these four devices (three ESXi and one array)  if the array and switch is providing storage to more things you should follow your vendor’s best practices for segmenting traffic.
  • Port binding for iSCSI should be configured as per VMware document and vendor documents

New design considerations from storage:

  • 4 1GB’s will be used to represent max traffic the system will provide
  • The array does not support 5.5 U1 yet so don’t upgrade
  • We have some VAAI natives to help speed up processes and avoid SCSI locks
  • Software iSCSI requires that forged transmissions be allowed on the switch

Advise to speed up iSCSI storage

  • Bind your bottle neck – is it switch speeds, array processors, ESXi software iSCSI and solve it.
  • You might want to consider Storage DRS on your array to automatically balance load and IO metrics (requires enterprise plus license but saves so much time) – Also has an impact on CBT backups making them do a full backup.
  • Hardware iSCSI adapters might also be worth the time… thou they have little real benefit in the 5.x generation of ESXi

 

Networking

We will assume that we now have 8 total 1GB ports available on each host.   We have a current network architecture that looks like this (avoided the question of how many virtual switches):

network

I may have made mistakes from my reading a few items pop out to me:

  • vMotion does not have any redundancy which means if that card fails we will have to power off VM’s to move them to another host.
  • Backup also does not have redundancy which is less of an issue than the vMotion network
  • All traffic does not have redundant switches creating single points of failure

A few assumptions have to be made:

  • No single virtual machine will require more than 1Gb of traffic at any time (otherwise we have to be looking into LACP or etherchannel solutions.
  • Management traffic, vMotion and virtual machine traffic can live on the same switches as long as they are segmented with VLAN’s

 

Recommended design:

Drawing1

  • Combine the management switch and VM traffic switch into dual function switches to provide both types of traffic.
  • This uses vlan tags to include vMotion and management traffic on the same two uplinks providing card redundancy (configured active / passive)  Could also be configured with multi-nic vMotion but I would avoid due to complexity around management network starvation in your situation.
  • Backup continues to have it’s own two adapters to avoid contention

This does require some careful planning and may not be the best possible use of links.   I am not sure you need 6 links for your VM traffic but it cannot hurt.

 

Final Thoughts:

Is any design perfect?  Nope lots of room for error and unknowns.  Look at the design and let me know what I missed.  Tell me how you would have done it differently… share so we can both learn.  Either way I hope it helps.

Deep Dive: Network Health check

vSphere 5.1 introduced one of my favorite new features.  Network health check.  This feature is designed to identify problems with MTU and VLAN settings.   It is easy enough to set up MTU and VLAN’s in ESXi especially with a dVS.  In most environment the vSphere admins don’t control the physical switches making confirmation of upstream configuration hard.    The health check resolves these issues.  It is only available on dVS switches and only via the web client. (I know time to start using that web client.. your magical fat client is going away) If you have an upstream issue with MTU then you will get an alert in vCenter.   You can find the health check by selecting the dVS and clicking on the manage tab.  On the middle pane you will see Health check which you can edit and enable.   You came here because you want to know how it works.

 

MTU

MTU check is easy.   Each system sends out a ping message to the other nodes.  This ping message has a special header that tells the network not to fragment (split) the packet.   In addition it has a payload (empty data) to make the ping the size of the max MTU.   If the host get’s a return message from the ping it knows the MTU is correct.  If it fails then we know MTU is bad.   Each node checks it’s MTU at an interval.   You can manually check your MTU with vmkping but the syntax has changed between 5.0,5.1 and 5.5 so look up the latest syntax.

 

VLAN

Checking the VLAN is a little more complex.    Each VLAN has to be checked.   So one host on the same vDS (not sure which one but I am willing to bet it’s the master node) sends out a broadcast layer 2 packet on the VLAN.  Then it waits for each node to reply to the broadcast via unicast layer 2 packet.   You can determine which hosts have VLAN issues based upon who reports back.   I assume that host marked as bad then try’s to broadcast as a method to identify failed configuration or partitions.   This test is repeated on each VLAN and at regular intervals. It only works when two peers can connect.

Teaming policy

In ESXi 5.5 they added a check for teaming policy to physical switch.  This check identifies mismatches between IP Hash teaming and switches that are not configured in etherchannel/LACP.

 

Negative Effect of Health check

So why should I not use health check?  Well it does produce some traffic.  It does require you to use the web client to enable and determine which vlan’s are bad…  otherwise I cannot figure out a reason to not use it.   A simple and easy way to determine issues.

Design Advice on health check

Health check is a proactive way to determine upstream vlan or MTU issues before you deploy production to that VLAN.  It saves a ton of time when troubleshooting and fighting between networking and server teams.  I really cannot see a reason to not use it.    I have not tested the required bandwidth but it cannot be huge.   My two cents turn it on if you have a vDS… if you don’t have vDS I hope you only have ten or less VLAN’s.

Deep Dive: vSphere Traffic Shaping

Traffic Shaping is all about the bad actor scenario.  We have 100′s of virtual machines that all get along with each other.  The application team deploys a appliance that goes nuts and starts to use it’s link 100%.  Suddenly you get a call about database and website outages.  How do you deal with the application teams bad actor?  This is the most common reason why every apartment has it’s own water heater.   My wife would be very unhappy if she could not take her hot shower in the morning because Bob upstairs took an extra long shower an hour ago.   Sharing resources are great as long as resources are unlimited, not over provisioned or usage patterns stay static.  In a real world none of those things are true.  You are likely limited on resources, over provisioned and your traffic patterns change every single day.   Limits allow us to create constraints upon portions of resources in order control bad actors.

Limits (available on any type of switch)

Limits are as expected limits that a machine cannot cross.  This allows a machine to see a 10GB uplink but only use 1GB at most.  This injected slow down is into the communication stream via normal protocol methods.   The limit settings in VMware can be applied on the port group or on dvPort or dvPort Group.  Notice the difference on dVS switches we can apply limits on ports as well as port groups.  Limits can be applied on standard switches via outbound traffic while a dVS can be inbound and outbound.  There are three options on limits:

  • Average bandwidth = Average number of  bit’s per second to allow across the port
  • Peak bandwidth – Max bits per second to allow across a port when it’s utilizing it’s burst traffic, this limits the bandwidth used by the port when using it’s burst.
  • Burst Size – Max bits per second to allow in a burst.  This is the number of bytes allocated to burst when allocation over the average is required.  This can be viewed as a bank when you don’t use all your average bandwidth it can be stored up to the burst size to be used when needed.

 

Limits of the Limits

Limits produce some well… limits.   Limits are always enforced.  Meaning even if bandwidth is  available it will not be allocated to the port group/ port.  Limits on VSS’s are outbound only meaning you can still flood a switch.  Limits are not reservations.  Machines without limits can consume all available resources on a system.  So effectively limits are only useful to stop a bad actor from everyone else.  It is not a sharing method.  Limits on network do have their place but I would avoid general use if possible.

 

Network IO Control a better choice

Network IO Control (NIOC) is available only on the vDS switch.  It provides a solution to the bad actor symptom while providing flexibility.  NIOC is applied to outbound traffic.  NIOC works very much like resource pools with compute and memory.  You setup a NIOC share (resource pool) with a number between 1 and 100.   vSphere comes with some system defined NIOC shares like vMotion and management.  You can also defined new resource pools and assign them to port groups.  NIOC only comes into play during times of contention on the uplink.  All NIOC Shares are calculated on a uplink by uplink basis.  All the active traffic types on the uplink shares are added together.  For example assume my uplink has the following shares:

  • Management 10
  • vMotion 20
  • iSCSI 40
  • Virtual machines 50

If contention arises and only Management, iSCSI and virtual machines are active we would have 100 total shares.  This number is then used to divide the total available bandwidth on that uplink.  Let’s assume we have a 10GB uplink.  The each active traffic type would get based on shares:

  • Managment 1GB
  • iSCSI 4GB
  • Virtual machines 5GB

This example also assumes they are using 100% of their available links.  If management is only using 100MB the others will get it’s left over amount divided by their share amount (in this case 900mb/90 then 40 assigned to iSCSI and 50 assigned to virtual machine).   If a new traffic type comes into play then the shares are recalculated to meet the demands.   This allows you to create worst case scenarios to ensure traffic types for example:

  • Management will get at least 1GB
  • vMotion will get at least 2GB
  • iSCSI will get at least 4GB
  • Virtual machines will get at least 5GB

There is one wrinkle to this plan with multi-nic vMotion but I will address that in another post.

 

Design Choices

Limits have their uses.  They are hard to manage and really hard to diagnose… Imagine coming into a vSphere environment where limits are in place but you did not know.   It could take a week to figure out that was causing the issues.   My vote use them sparingly.   NIOC on the other hand should be used in almost every environment with Enterprise Plus licenses.   It really has no draw back and provides controls on traffic.

Deep Dive: vSphere Network Load Balancing

In vSphere load balancing is a hot topic.   As load size per physical host increases so does the need for more bandwidth.  In a traditional sense this was done with etherchannel or LACP.  This bonds together multiple links so they link and act like a single link.   This helps avoid loops.

What the heck is a loop?

A loop is anytime two layer 2 (ethernet) endpoints have multiple connections to each other.

 

It is possible with two virtual switches to create a bridged loop if care is not taken.   Virtual switches by default will not create loops.  On the physical switch side protocols like spanning tree were created to solve this link issue.  STP disables a link if a loop is detected.  If the enabled link goes down STP turns on the disabled link.   This process works for redundancy but does not do anything if link 1 is not a big enough pipe to handle the full load.    VMware has  provided a number of load balancing algorithms to provide more bandwidth.

Options

  • Route Based on Originating virtual port (Default)
  • Route Based on IP Hash
  • Route Based on Source MAC Hash
  • Route Based on Physical NIC Load (LBT)
  • Use Explicit Failover Order

 

In order to explain each of these options assume we have a ESXi host with two physical network cards called nic1 and nic2.   It’s important to understand that the load balancing options can be configured at the network switch or port group level allowing for lots of different load balancing on the same server.

Route Based on Originating virtual port (Default)

The physical nic to be used is determined by the ID of the virtual port to which the VM is connected.  Each virtual machine is connected to a virtual switch which has a number of virtual ports, each port has a number.   Once assigned the port does not change unless the host changes ESXi hosts.  This number is the virtual ID.   I don’t know the exact method used but I assume it’s something as simple and odd’s and evens for two nics.  Everything odd goes to port 1 while even goes to port 0.  This method has the lowest overhead from a virtual switch processing, and works with any network configuration.  It does not require any special physical switch configuration.  You can see though it does not really load balance.  Lets assume you have a lot of port groups with only virtual machine on port 0.  In this case all virtual machines would use the same uplink leaving the other unused.

Route Based on IP Hash

The physical nic to be used is determined by a hash of the source and destination IP address.   This method provides load balancing to multiple physical network cards from a single virtual machine.  It’s the only method that allows a single virtual machine to use the bandwidth of multiple physical nics.  It has one major draw back the physical switches must be configured to use etherchannel (802.3ad link aggregation) so they present both network links as a single link to avoid problems.   This is a major design choice.  It also does not provide perfect load balancing.  Lets assume that you have a application server that does 80% of it’s traffic with a database server.  Their communication will always happen across the same link.  They will never use the bandwidth of two links.  Their hash will always assign them the same link. In addition this method uses a lot of CPU.

  • When using etherchannel only a single switch may be used
  • Beacon probing is not supported on IP Hash
  • vDS is required for LACP
  • Troubleshooting is difficult because each destination/source combination may take a different path.  (Some virtual machine paths may work with others will not in a non-consistent pattern.)

Route Based on Source Mac Hash

The physical nic to be used is determined by a hash created from the virtual machines source address.  This method provides a more balanced approach to load balancing than originating virtual port.  Each virtual machine will always use only a single link but load will be distributed.  This method has a low CPU overhead and does not require any physical switch configuration

Route Based on Physical NIC Load (Distributed Virtual Switch Required also called LBT)

The physical nic to be used is determined by load.  The nics are used in order (nic1 then nic2)  No traffic will be moved to nic2 untile nic1 is utilized above 75% capacity for 30 seconds.  Once this is achieved traffic flows are moved to the next available nic.  They will stay at that nic until another LBT event happens moving traffic.   LBT does require the dVS and some CPU overhead.  It does not allow a single virtual machine to gain more than 100% of a single link speed.   It does balance traffic among all links during times of contention.

Use Explicit Fail over

The physical nic to be used is determined by being the highest nic on the list of available nics.  The others will not be used unless the first nic is unavailable.  This method does no load balancing and should only be used is very special cases (link multi-nic vMotion).

 

Design Advice

Which one should you use?  It depends on your need.  Recently a friend told me they never changed the default because they never get close to using a single link.   While this method has merit and I wish more people understood their network metrics you may need to plan for the future.  There are two questions I use to determine which to use:

  • Do you have any virtual machines that alone require more than a single links bandwidth? (If yes then the only option is IP Hash and LACP or etherchannel)
  • Do you have vDS’s? (If yes then use Route based on physical nic load, if no then use default or source MAC)

Simply put the LBT is a lot more manageable and easy to configure.

Do IT certifications really matter?

Twice in the last week people in IT have asked this question of me.  My answer has been it depends.  When I first started my career I hated certifications.  This is mostly because in college I attended a Microsoft certification course.   This course was a memorize the content don’t worry if you don’t understand type of test/course.   It seemed pointless to me… I passed the test and still had never worked with half the stuff I was tested on.   The memorized information was soon lost and nothing other and a piece of paper was gained.   This tainted my view toward certifications.  For many years I did not see the point and avoided them.   A few years ago an employer encouraged me to get a VMware certification.  They also offered to pay.   So I took them up on the offer and got the VCP certification.   The required course for the certification was good because it allowed a lot of time for question and answer sessions.   The instructor knew the material very well.   It was a good course.  With a little additional study I passed the test and had another IT certification.

What did I learn?

Knowing I was going to have to take the VCP test made my course learning more meaningful.   I was able to learn with intent.   I now realized that certifications might not have value but the knowledge did…  So since that time I have used certifications to motivate myself to learn.

Wait… certifications should translate into more money right?

While it is true my jobs continue to pay more as time goes along I do not believe this is because of my certifications.  I think it’s because of what I learned while doing the certifications.   Will certifications ensure more money?  Not always.   But more knowledge and skills will translate to more ability to do.

So you convinced me … what certs should I do?

Well here is the tough one.  I can tell you what certifications I see a lot of resumes and job postings:

  • ITIL – This one is on every resume.  Buy a book off Amazon and take the test… it’s not hard and people want it a lot.
  • VMware certification – Virtualization is hot… but only a few places have virtualization only admins..  VCP is normally enough.  VCAP and above are not seen much on job postings.  (Don’t get me wrong I am all about geeking out with VMware certs… as shown by my VCDX but in translation to jobs VCAP will not help you more than VCP… VCDX will but it’s a long journey)  Best fun test on that journey VCAP-DCA (it’s a live test that makes you do it’s so much fun)
  • RedHat certification (normally RHCE) redhat is still the leader in enterprise linux and their cert is a practice test that requires that you do things not just know them.
  • Windows Certification – They are a lot better than they used to be and look great for Windows jobs
  • PmP – if you want to get into technical project management this is the cert.
  • CCNA – If you are interested in networking start here… even if you don’t have Cisco in your shop.

 

Live Tests

My final note is a shout out to all testing systems that require you to work with a real environment like the VCAP-DCA, CCNA or RHCE.  These tests require you know how to do things and are awesome.  No pointless memorization required.  We need more IT tests like this…

Deep Dive: Virtual Switch Security settings and Port Binding

Security Settings:

Three options are available on a virtual switch.  These settings can be set at the switch layer then overwritten on individual port groups.

  • Promiscuous Mode – This allows the guest adapter to detect all frames passed on the vSwitch that are in the same VLAN as the guest.  Allows for packet sniffing.  This is not port mirroring it only allows a host to see it’s own traffic and any broadcast traffic.
  • MAC Address Change – Allows the guest to change it’s mac address.  If set to reject all frames for the mac not in the .vmx file are dropped at the switch.
  • Forged Transmits – If set to reject all frames from the guest with a mac address that does not match the .vmx file are dropped.

Security settings advise:

Set all three to reject on the switch keeping your operating systems admins in a box while protecting shared resources.   Then add individual polices to each port group as needed.   If you are wondering where it’s needed one of the use cases is nested virtualization.. which requires all three to be set to accept.

Port Binding:

Port binding is a setting that allows you to determine how and when the ports on a virtual switch are allocated.  Currently there are three port binding options:

  • Static binding (default)
  • Dynamic binding
  • Ephemeral binding

Static Binding – means a port is allocated to a virtual machine when it is added to the port group.  Once allocated to the port group it continues to use the port until removed from the port group (via deletion or move to another port group).  Network stats with static binding is kept through power off and vMotion.

Dynamic Binding – will be removed in the near future. Ports are allocated only when a virtual machine is powered on and the virtual network card is connected.  They are dynamically allocated when needed.  Network stats are kept through vMotion but not power off.

Ephemeral Binding – Is a lot like a standard vSwitch it can be managed from the vCenter or ESXi host.  Ports are allocated when the host is powered on and nic is connected.  One major difference is that dvPorts are created on demand all other binding type creates them when the port group is created.  This process takes more RAM and processor power and so their are limits on the number of ephemeral ports available.  Ephemeral ports are used for recovery when vCenter is down and may help with vCenter availability.  All stats are lost when you vMotion or power off the virtual machine.

Port Group Type advice:

I would use static binding on almost everything.  Ephemeral has a high cost and does not scale.  I do personally use ephemeral for vCenter because I use 100% dVS switches.  If you are using standard switches just use static across the board.

 

Deep Dive: Standard Virtual Switch vs Distributed Virtual Switch

Let the wars begin.  This article will discuss the current state of affairs between virtual switches in ESXi.   I have not included any third party switches because I believe them to becoming quickly not part of the picture with NSX.

 

Whats the deal with these virtual switches?

Virtual switches are a lot like ethernet layer 2 switches.  They have a lot of the same common features.  Both switch types feature the following configurable items:

  • Uplinks – connections from the virtual switch to the outside world – physical network cards
  • Port Groups – groups of virtual ports with similar configuration

In addition both switch types support:

  • Layer 2 traffic handling
  • VLAN segmentation
  • 801.1 Q tagging
  • nic teaming
  • Outbound traffic shaping

So the first question everyone ask’s is if two virtual machines are in the same vlan and on the same server does their communication leave the server?

No… communication between the two vm’s on the same ESXi host can communicate without leaving the switch.

 

Port Groups what are they?

Much like the name suggests port groups are groups of ports..  They can be best described as a number of virtual ports (think physical port 1-10) that are configured the same.  Port groups can have a defined number of ports and expanded at will (like a 24 port switch or 48 port switch)  There are two generic types of port groups:

  • Virtual machine
  • VMkernel

Virtual machine port groups is for guest virtual machines.  VMkernel port groups are for ESXi management functions and storage. The follow are valid uses for VMkernel ports

  • Management Traffic
  • Fault Tolerance Traffic
  • IP based storage
  •  vMotion traffic

You can have one or many port groups for VMkernel but each requires a valid IP address that can reach other VMkernel ports in the cluster.

At time of writing (5.5) the follow maximum’s apply

  • Total switch ports per host: 4096
  • Maximum active ports: 1016
  • Port groups per standard switch:512
  • Port groups per distributed switch: 6500
  • VSS port groups per host: 1000

So as you can see vDS scales a lot higher.

Standard Virtual Switch

The standard switch has one real advantage.  It does not require enterprise plus licensing to use.  It has a lot less features and some draw backs including:

  • No configuration sync – you have to create all port groups exactly the same on each host or lots of things will fail (even upper case vs lower case will cause it to fail)

Where do standard switches make sense?  Small shops with a single port group they make a lot of sense.  If you need to host 10 virtual machine on the same subnet then standard switches will work fine.

Advice

  • Use scripts to deploy switches and keep them in sync to avoid manual errors
  • Always try vMotions between all hosts before after each change to ensure nothing is broken
  • Don’t go complex on your networking design – it will not pay off

Distributed Virtual Switch

Well the distributed virtual switch is a different animal.  It is configured by vCenter and deployed to each ESXi host.  The configuration is in sync.  It has the following additional features:

  • Inbound Traffic Shaping – Throttle incoming traffic to the switch – useful to slow down traffic to a bad neighbor
  • VM network port block – Block the port
  • Private VLAN’s – This feature requires switches that support PVLAN so you can create VLAN’s inbetween vlans
  • Load – Based teaming – Best possible load balancing (another article on this topic later)
  • Network vMotion – Because the dVS is owned by vCenter traffic stats and information can move between hosts when a virtual machine moves… on a standard switch that information is lost with a vMotion
  • Per port policy – dVS allows you to define policy at the port level instead of port group level
  • Link Layer Discoery Protocol – LLDP enables virtual to physical port discovery (your network admins can see info on your virtual switches and you can see network port info – great for troubleshooting and documentation)
  • User defined network i/o control – you can shape outgoing traffic to help avoid starvation
  • Netflow – dVS can output netflow traffic
  • Port Mirroring – ports can be configured to mirror for diagnostic and security purposes

As you can see there are a lot of features on the vDS with two draw backs:

  • Requires enterprise plus licensing
  • Require vCenter to make any changes

The last draw back has provided a number of hybrid solutions over the years.  At this point VMware has created a work around with the empherial port group type and the network recovery features of the console.

Advice in using:

  • Backup your switch with PowerCli (a number of good scripts out there)
  • Don’t go crazy just because you can if you don’t need the feature don’t use it
  • Test your vCenter to confirm you can recover from a failure

So get to the point which one should I use?

Well to take the VCDX model here are the elements of design:

Availability

  • VSS – deployed and defined on each ESXi host no external requirements + for availability
  • dVS – deployed and defined by vCenter and requires it to provision new ports/ port groups – for availability

Manageability

  • VSS – pain to manage in most environments and does not scale with lots of port groups or complex solutions – for manageability
  • dVS – Central management can be deployed to multiple hosts or clusters at the same datacenter + for manageability

Performance

  • VSS – performance is fine no effect on quality
  • dVS – performance is fine no effect on quality other than it can scale up a lot larger

Recoverability

  • VSS – is deployed to each host and stored on each host… if you loose it you have to rebuild from scratch and manually add vm’s to the new switch – for recoverability
  • dVS – is deployed from vCenter and you always have it as long as you have vCenter.  If you loose vCenter you have to start from scratch and cannot add new hosts.  (don’t remove your vCenter it’s a very bad idea) + as long as you have a way to never loose your vCenter (does not exist yet)

Security

  • VSS – Offers basic security features not much more
  • dVS – Wider range of security features + for security

 

End Result:

dVS is better is most ways but costs more money.   If you want to use dVS it might be best to host vCenter on another cluster or ensure it’s availability.