BGP Troubleshooting Artifact

Artifact:  Demonstrate the value of ICT methodology.
By Steven Jordan on 11/26/2013.

Abstract:  This document identifies a BGP routing problem that resulted in unstable peering relationships, routing instability, and excessive route flaps.  The document also serves as an artifact to demonstrate the value of ICT methodology.  

Background:  Customers, all from the same organization, complained of poor web site usability.  Symptoms of the problem were described as slow loading web pages,  limited functionality, and an excessive delay of data enumeration.  To be clear, the problem only affected this particular organization. 

Solution:  Troubleshooting, using ICT methodology, identified a problem at the customer's location.  I sent the following email that outlined the resolution process:


Dear Kathy,
 
It was a pleasure speaking with you on the telephone.  I wish it were under better circumstances.  I am writing to outline the network problem that prevents staff at your location from connecting to our network.

My research indicates there may be a problem with your organization’s local Internet connection and related routing issues.  There are two influencing factors: 



     1.  Network tests sourced from your organization indicate packet loss; well before traffic reaches our servers.  The problem may be sourced from your organization or from an upstream ISP.
  
·         A ping test measured bits of data sent and received from your location to our hosted web site.  We found approximately 25% of the data was dropped en route.
 
·         A trace route documented the path the packet traveled across the Internet.  We found there were over 10 separate network hops from your organization to our servers (not unusual).  Each “hop” represents a different Internet Service Provider (ISP) router.

 
·       A separate PING test to each hop indicated that 25% of the data was lost.  We are reasonably certain that 25% of all data sent from your location is lost; from at least the CenturyLink/Quest hop.  Our server farm is located several hops upstream from the Quest network.
      2.  There may be BGP routing problems with your organization's network.  There were multiple BGP route announcements sourced from your network today.

The Internet uses a routing protocol called Boarder Gateway Protocol (BGP).  Your organization uses BGP to host a single IP subnet with multiple ISPs.  If the primary ISP is unavailable the network will failover to the secondary ISP, while still using the same IP subnet. 

I determined there were hundreds of route updates and withdraws that continued up until 6/30/2013 at 3:21:PM PST.  The following graph is a snap shot of the multiple paths to your organization’s network from external ISPs:





There were excessive route flaps and instability during this time-frame.  I suspect there are still unintended BGP peering relationships to your organization’s network.  Multiple inbound paths from different ISPs may cause data loss.  I cannot provide further details because I am not familiar your organization’s private network.  I can confirm that the problem is severe enough to interfere with our service.  Unless the problem is resolved, your staff will begin to notice issues from other web sites as well.

Earlier today, I spoke with Bob from your IT department’s help desk.  Bob was very helpful as he assisted with some of the network tests.  It is my understanding that Bob planned to escalate the issue based on the ping and trace route tests.  I also encourage you to forward this email to the appropriate network department to assist the escalation process.  Re-announcing the BGP routes to upstream ISPs should resolve the network instability. 

Please contact me with any questions.  This issue is important to me and I will do whatever I can to help make our service accessible to your staff.  Please call my cell phone any time this evening if I can be of assistance.

Sincerely,


Steven M. Jordan
Network Administrator

0 Comments:

Post a Comment

My Instagram