BGP Troubleshooting Artifact
Artifact: Demonstrate the value of ICT methodology.
By Steven Jordan on 11/26/2013.
By Steven Jordan on 11/26/2013.
Abstract:
This document identifies a BGP routing problem that resulted in unstable
peering relationships, routing instability, and excessive route
flaps. The document also serves as an artifact to demonstrate the value of ICT methodology.
Background:
Customers, all from the same organization, complained of poor web site usability. Symptoms of the problem were described as slow loading web
pages, limited functionality, and an excessive delay of data enumeration. To be clear, the problem only affected this particular organization.
Solution: Troubleshooting, using ICT methodology, identified a problem at the customer's location. I sent the following email that outlined the resolution process:
Dear Kathy,
It was a pleasure speaking with you on the telephone. I
wish it were under better circumstances. I am writing to outline the
network problem that prevents staff at your location from connecting to our
network.
My research indicates there may be a problem with your organization’s
local Internet connection and related routing issues. There are two influencing
factors:
1. Network
tests sourced from your organization indicate packet loss; well before traffic
reaches our servers. The problem may be sourced from your organization or
from an upstream ISP.
·
A ping test measured bits of data sent and received from your
location to our hosted web site. We found approximately 25% of the data
was dropped en route.
·
A trace route documented the path the packet traveled across the
Internet. We found there were over 10 separate network hops from your
organization to our servers (not unusual). Each “hop” represents a
different Internet Service Provider (ISP) router.
· A separate PING test to each hop indicated that 25% of the data was
lost. We are reasonably certain that 25% of all data sent from your
location is lost; from at least the CenturyLink/Quest hop. Our server
farm is located several hops upstream from the Quest network.
2. There
may be BGP routing problems with your organization's network. There were
multiple BGP route announcements sourced from your network today.
The Internet uses a routing protocol called Boarder Gateway
Protocol (BGP). Your organization uses BGP to host a single IP subnet
with multiple ISPs. If the primary ISP is unavailable the network will
failover to the secondary ISP, while still using the same IP subnet.
I determined there were hundreds of route updates and withdraws
that continued up until 6/30/2013 at 3:21:PM PST. The following graph is
a snap shot of the multiple paths to your organization’s network from external
ISPs:
There were excessive route flaps and instability during this
time-frame. I suspect there are still unintended BGP peering
relationships to your organization’s network. Multiple inbound paths from
different ISPs may cause data loss. I cannot provide further details because
I am not familiar your organization’s private network. I can confirm that
the problem is severe enough to interfere with our service. Unless the
problem is resolved, your staff will begin to notice issues from other web sites as
well.
Earlier today, I spoke with Bob
from your IT department’s help desk. Bob was very helpful as he assisted with some of the network tests. It is my understanding that Bob
planned to escalate the issue based on the ping and trace route tests. I
also encourage you to forward this email to the appropriate network department
to assist the escalation process. Re-announcing the BGP routes to
upstream ISPs should resolve the network instability.
Please contact me with any
questions. This issue is important to me and I will do whatever I can to
help make our service accessible to your staff. Please call my cell phone
any time this evening if I can be of assistance.
Sincerely,
Steven M. Jordan
Network Administrator