What is Skype for Business Resiliency?

Skype for Business (S4B)/ Lync 2013 Resiliency Outline:  S4B and Lync 2013 resilience pools are similar to other highly available (HA) and fault tolerance solutions.  If one pool fails (i.e., server or network disruption) its clients automatically connect to the second pool. Even better, the Lync clients maintain their client-to-client sessions (i.e., VOIP and IM) after connecting to the backup pool :
  • UDP Direct.
  • UDP NAT.
  • UDP Relay.
  • TCP Relay.
When the primary server pool fails (e.g., server down) its clients experience a brief hiccup (e.g., 1 second delay) as they connect to the backup pool.  Although this near-HA solution is useful, the system is not without flaws.  The resiliency caveat is that clients enter a limited resilience mode upon connecting to the backup pool.  Clients in resilience mode operate with limited functionality:
  • New users connect in resilience mode.
  • Scheduling features are unavailable.
  • Presence state displays as unknown.
  • The Contact list and Address book is not available (searches for individuals work though).
Network administrators can later use PowerShell to manually fail-over remaining Lync services.  This process changes clients' resilience mode to the fully functional regular mode.  A talented Lync administrator can pair a network monitor application with a custom PowerShell script to fully automate the entire fail-over process.  

What is the Lync CMS?  The central management store (CMS) is a SQL Express database (Standard Edition) that stores configuration data in XML format.  The CMS database is named RTS.  Lync server tools make changes to the CMS database:
  • Lync PowerShell Console.
  • Lync Server Control Panel (LsCP).
  • Lync Topology builder.  
Each pool has a single CMS database; however only one of the CMS instances is active.

What is the replication database?  The CMS database uses a master-slave replication model.  All changes to the active RTS database are replicated to the RTSLocal databases.  Changes are replicated across pools.   Lync uses the RTCLocal instance -not the master RTC for its client services.  This distinction is important for resiliency purposes.   

What is the Lync Front End Pool?    The Lync Front End pool consists of a single (i.e., Lync Standard) or multiple (i.e., Lync Enterprise) Front End servers that connect to an associated back-end CMS database.  There is only one pool per CMS instance.  Users are assigned to a single pool; never both.

When users sign-on to Lync, their clients automatically connect to Front end servers which resides in the users' assigned Front End pool.  The Front End sever uses the replication database located in its local pool. 

What are Lync Resiliency Pools?  The Lync Resiliency pool consists of exactly two Front End pools (e.g., pool_A or pool_B).   This 1:1 pool ratio provides one primary pool and one backup pool.  If a single resiliency pool becomes unavailable, the second Front End server prevents any change to its CMS database.  However, the second server continues to operate using its replica's database which maintains a static configuration from the time off the loss.  Clients that re-connect use resilient mode because the back-end service is in a read-only state.  Consider:

  • The resiliency pool is considered active/passive when individuals are all assigned to the same pool.  
  • The resiliency pool is considered active/active when individual Lync users are assigned to separate Front End pools.  
The difference between the two models is significant during the fail-over process.  In the event one of the Lync servers fail there will be some clients that enter resiliency mode and others will remain in standard mode.

What is Lync High Availability?  Lync Enterprise separates the front-end and back-end roles for truly high-available services within a single pool.  Each pool allows up to twelve Front End servers and connects up to 80,000 simultaneous users.  In addition, Microsoft SQL 2012 hosts the CMS database by either mirroring or shared storage clustering.  Disruption to any single server within the same pool is negligible and are therefore considered highly available.  

Lync 2013 Standard edition differs because it consolidates roles; both the Front End and CMS installs on the same server.  Nonetheless, Lync Standard remains a robust solution.  It scales up to 5,000 users, is simple to deploy, and supports all workloads (i.e., VOIP, IM, etc...).  Microsoft provides an alternative to HA through the use of resiliency pools.

How do Lync resiliency pools work?

Back-End Server Perspective:

  1. There are only two resiliency pools (e.g., pool_A or pool_B).
  2. There is only one CMS master database per pool.  
  3. The CMS instance replicates to the LocalRTS database.
  4. Lync servers always use the LocalRTS -never the CMS.
Front-End Server Perspective:

    1. Front End Servers belong to a single pool.  Each pool has at least one Front End Server.
    2. Users are assigned to a single pool; never both.
    3. Active/Active pools assign users to both pools.
    4. Active/Passive pools assign users to a single pool.
    Lync Client Perspective.
    1. Users connect to Front End servers (i.e., pools) based on administrative configuration.
    2. Clients from the primary pool communicate seamlessly with clients from the secondary pool; and vice verse.
    3. Pools are not important to the end users unless the client is forced into resiliency mode.

    What happens when a Lync server goes down?  

    • Changes cannot be made to the CMS database from any pool when an outage occurs.
    • Clients access the data, rather than modify it.  The Lync FE server uses a second database 
    • Users associated with the unavailable server enter resiliency mode.  

    How do we fail-over Lync clients?

     Determine which Front End pool hosts the CMS:

    1. Get-CsService -CentralManagement
      The command will most likely error out because the primary pool, which normally hold the active CMS is unavailable. 

      However, the CMS can run on either pool, and we can at least determine if the secondary pool is the active CMS host.  Look for the active attribute in the host Identity field. 
    2. Let's assume the primary pool hosted the active CMS and is no longer available.  Fail over the CMS to the secondary pool:

      Invoke-CsManagementServerFailover -BackupSqlServerFqdn FE02.lab.local –BackupSqlInstanceName RTC –Force

      Verify the secondary pool hosts the active CMS:

      Get-CsService -CentralManagement
      Look for the active attribute in the host Identity field for confirmation.
    3. Once we've confirmed the secondary pool hosts the active CMS we can fail over the Replica to the secondary pool.

      Invoke-CsPoolFailOver –PoolFqdn lync01.domain.com –Disastermode –Verbose
      Lync clients seamlessly exit resiliency mode and enter standard mode.  The "Limited Functionality" warning disappears!


    1. And what happen if you recover the principal pool before commiting de failover??? is possible to move back the users to principal pool to recover the full funcionality???? is an automatic process or manual???


    My Instagram