What's changed?

In Episode 134 of the The Hedge podcast, Russ White reminisces on his experience in Cisco TAC. Russ states that clients would often try to suggest they've encountered bugs/defects causing major outages, when – in reality – they were often Network changes gone wrong.

Whatever it was you 'didn't change', change it back
- Russ White
Hedge 134: Ten Things
One of the many reasons engineers should work for a vendor, consulting company, or someone other than a single network operator at some point in their career is to develop a larger view of network …
Here is the episode in question. I highly recommend you subscribe to his Podcast.

In my personal experience, I've found this insight to be true. Most network outages seem to be caused by lack of peer review, lack of testing, careless implementation, lack of change control oversight, and various other lacks of control over operations.

Example Scenario

Let me make a theoretical example to demonstrate. Picture an early afternoon on a beautiful July day. The NOC was out of their chairs, chatting, laughing, and discussing their activities for the upcoming weekend. Suddenly, the board went red. Moments later, tickets started to roll through the floodgates. Apps were down, customers were down –  there was now a Priority One incident in progress.

What's the problem? Just another day in Ops Life, no? Unfortunately, nobody had any clue what had happened. One eager operator scrolled through their alarm board and found no indication of fault; No LOS condition, adjacency lost, core device offline, etc.

What could have happened here?

Topology

Obviously not a 1:1 intepretation of a production network

To demonstrate this outage, I am using my GNS3 lab above, intended for my Segment Routing testing/refresher. This is typically a flat IS-IS topology; However, I've changed the adjacency between XRv-1 & XRv-2 from IS-IS to BGP.

I've also created a set amount of "Networks" to be advertised to the core via XRv-2. These networks are then summarized via an aggregate route. This method is not uncommon to advertise a set of prefixes from a DC, region, domain, etc.

RP/0/0/CPU0:XRv-1#show ip int br | i Loo  
Sat Jul  2 17:51:54.871 UTC
Loopback0                      10.255.255.1    Up              Up       default 
Loopback48                     10.48.48.1      Up              Up       default 
Loopback50                     10.50.50.1      Up              Up       default 
Loopback51                     10.50.51.1      Up              Up       default 
Loopback52                     10.50.52.1      Up              Up       default 
Loopback53                     10.50.53.1      Up              Up       default 
Loopback54                     10.50.54.1      Up              Up       default 
Loopback55                     10.50.55.1      Up              Up       default 
Loopback56                     10.50.56.1      Up              Up       default 
Loopback57                     10.50.57.1      Up              Up       default 
Loopback58                     10.50.58.1      Up              Up       default 
Loopback60                     10.60.60.1      Up              Up       default 
Loopback61                     10.60.61.1      Up              Up       default 
Loopback62                     10.60.62.1      Up              Up       default 
RP/0/0/CPU0:XRv-1#

RP/0/0/CPU0:XRv-1# show run router bgp 1 address-family ipv4 unicast     
Sat Jul  2 18:22:26.476 UTC
router bgp 1
 address-family ipv4 unicast
  aggregate-address 10.48.0.0/12 summary-only
  redistribute connected
 !
!

RP/0/0/CPU0:XRv-1#show bgp sessions
Sat Jul  2 18:24:15.598 UTC

Neighbor        VRF                   Spk    AS   InQ  OutQ  NBRState     NSRState
10.1.2.2        default                 0     2     0     0  Established  None
RP/0/0/CPU0:XRv-1#


RP/0/0/CPU0:XRv-2#show bgp ipv4 unicast neighbors 10.1.2.1 received routes | b Network
Sat Jul  2 18:27:02.247 UTC
   Network            Next Hop            Metric LocPrf Weight Path
*> 10.1.2.0/24        10.1.2.1                 0             0 1 ?
*> 10.48.0.0/12       10.1.2.1                               0 1 i
*> 10.255.255.1/32    10.1.2.1                 0             0 1 ?
*> 100.64.0.0/24      10.1.2.1                 0             0 1 ?

Processed 4 prefixes, 4 paths
RP/0/0/CPU0:XRv-2#

RP/0/0/CPU0:XRv-2#show run formal | i redist     
Sat Jul  2 18:27:36.445 UTC
Building configuration...
router isis zeal address-family ipv4 unicast redistribute bgp 2 level-2
router bgp 2 address-family ipv4 unicast redistribute isis zeal
RP/0/0/CPU0:XRv-2#

The summary-only configuration suppresses any prefixes, within the range of 10.48.0.0/12 [ 10.48.0.0 - 10.63.255.255 ], and advertises only the aggregate route.

The rest of the core sees this aggregate, due to XRv-2's redistribution. They can also reach the individual subnets contained within this aggregate.

RP/0/0/CPU0:XRv-10#show route 10.50.50.1
Sat Jul  2 17:51:33.993 UTC

Routing entry for 10.48.0.0/12
  Known via "isis zeal", distance 115, metric 40, type level-2
  Installed Jul  2 17:41:16.075 for 00:10:17
  Routing Descriptor Blocks
    10.5.10.5, from 10.255.255.2, via GigabitEthernet0/0/0/0, Protected
      Route metric is 40
    10.9.10.9, from 10.255.255.2, via GigabitEthernet0/0/0/1, Protected
      Route metric is 40
  No advertising protos. 
RP/0/0/CPU0:XRv-10#


RP/0/0/CPU0:XRv-10#ping 10.50.50.1 so lo0
Sat Jul  2 17:52:22.050 UTC
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.50.50.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/9 ms
RP/0/0/CPU0:XRv-10#
RP/0/0/CPU0:XRv-10#
RP/0/0/CPU0:XRv-10#ping 10.50.55.1 so lo0
Sat Jul  2 17:52:26.869 UTC
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.50.55.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/9/29 ms
RP/0/0/CPU0:XRv-10#


RP/0/0/CPU0:XRv-10#traceroute 10.50.50.1 so lo0 probe 1         
Sat Jul  2 17:53:18.956 UTC

Type escape sequence to abort.
Tracing the route to 10.50.50.1

 1  10.5.10.5 9 msec 
 2  10.4.5.4 0 msec 
 3  10.3.4.3 0 msec 
 4  10.2.3.2 0 msec 
 5  10.1.2.1 0 msec 
RP/0/0/CPU0:XRv-10#

So what happened?

Now begins the story of our outage lottery winner, John Doe. John Doe's blunder is very much inspired by a similar real-world scenario I witnessed. The topology, impact, and actions are changed, for the sake of discretion.

John Doe is in the provisioning team. He received a request to decommission the network of a service that had been cancelled. John Doe believed that the change approval process was cumbersome. Thus, he decided that he would quickly delete this network and update the relevant documentation.

John logged into XRv-1 and manually deleted the network. He then went to his long overdue lunch, as it was nearly 2 pm in the afternoon.

John did not inform the NOC he made this change. He did not utilize monitoring systems to see the impact he caused. He also did not validate the configuration he was implementing.

RP/0/0/CPU0:XRv-1#
RP/0/0/CPU0:XRv-1#configure 
Sat Jul  2 17:53:43.904 UTC
RP/0/0/CPU0:XRv-1(config)#no interface loopback 5*
RP/0/0/CPU0:XRv-1(config)#commit
Sat Jul  2 17:54:00.703 UTC
RP/0/0/CPU0:XRv-1(config)#end
RP/0/0/CPU0:XRv-1#

John was in and out of that configuration in ~17 seconds.

What was the blunder? John's intended Network to delete was the one attached to Lo58. What did he actually type? Loopback 5* . This '*' acts like a wildcard character. That means John unintentionally deleted networks 50 - 59 , along with network 58.

John's shift key just caused some major impact.

Troubleshooting

The NOC figures there is a bug, a failed ASIC, a hung router, etc. They engage escalation resource, Jenny Bourne. Jenny decides to trace from a relevant PE router, XRv-10, to the defined networks in outage.

RP/0/0/CPU0:XRv-10#traceroute 10.50.50.1 so lo0 probe 1 
Sat Jul  2 17:55:17.378 UTC

Type escape sequence to abort.
Tracing the route to 10.50.50.1

 1  10.5.10.5 0 msec 
 2  10.4.5.4 0 msec 
 3  10.3.4.3 0 msec 
 4  10.2.3.2 0 msec 
 5   * 
 6   * 
 7   * 
 8   * 
 9  
RP/0/0/CPU0:XRv-10#
RP/0/0/CPU0:XRv-10#ping 10.50.55.1 so lo0                          
Sat Jul  2 17:54:19.641 UTC
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.50.55.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
RP/0/0/CPU0:XRv-10#ping 10.50.50.1 so lo0
Sat Jul  2 17:54:35.460 UTC
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.50.50.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
RP/0/0/CPU0:XRv-10#

This trace shows that, in hop 4, packets get to the link between XRv-3 and XRv-2. Then it seems like XRv-2 or XRv-1 was dropping the traffic.

She checked Router 2:

RP/0/0/CPU0:XRv-2#show cef 10.50.50.0 detail  | i "version|via|Inter|Giga"
Sat Jul  2 18:35:02.247 UTC
10.48.0.0/12, version 51, internal 0x1000001 0x0 (ptr 0xa143c1f4) [1], 0x0 (0x0), 0x0 (0x0)
   via 10.1.2.1/32, 2 dependencies, recursive, bgp-ext [flags 0x6020]
    next hop 10.1.2.1/32 via 10.1.2.1/32
    Hash  OK  Interface                 Address
    0     Y   GigabitEthernet0/0/0/0    10.1.2.1       
RP/0/0/CPU0:XRv-2

This looks alright. XRv-2 had a valid route to XRv-1 . She then checked router 1.

RP/0/0/CPU0:XRv-1#show route 10.50.50.1
Sat Jul  2 18:03:16.485 UTC

Routing entry for 10.48.0.0/12
  Known via "bgp 1", distance 200, metric 0, type locally generated
  Installed Jul  2 17:41:15.775 for 00:22:00
  Routing Descriptor Blocks
    directly connected, via Null0
      Route metric is 0
  No advertising protos. 
RP/0/0/CPU0:XRv-1#

Next-hop via Null0 , that means this router is throwing packets in the garbage. She then checked to see what's changed on the router.

RP/0/0/CPU0:XRv-1#show configuration commit list | utility egrep -C1 "~"  
Sat Jul  2 17:57:38.308 UTC
SNo. Label/ID              User      Line                Client      Time Stamp
~~~~ ~~~~~~~~              ~~~~      ~~~~                ~~~~~~      ~~~~~~~~~~
1    1000000064            jdoe    con0_0_CPU0         CLI         Sat Jul  2 17:54:00 2022
RP/0/0/CPU0:XRv-1#

Resolution

It appears her colleague, John, made a change. She checked to see what it entailed.

RP/0/0/CPU0:XRv-1#
RP/0/0/CPU0:XRv-1#configure 
Sat Jul  2 17:59:24.871 UTC
RP/0/0/CPU0:XRv-1(config)#
RP/0/0/CPU0:XRv-1(config)#load rollback changes last 1
Building configuration...
Loading.
611 bytes parsed in 1 sec (593)bytes/sec
RP/0/0/CPU0:XRv-1(config)#show commit changes diff 
Sat Jul  2 17:59:46.179 UTC
Building configuration...
!! IOS XR Configuration 6.1.3
+  interface Loopback50
+   ipv4 address 10.50.50.1 255.255.255.255
   !
+  interface Loopback51
+   ipv4 address 10.50.51.1 255.255.255.255
   !
+  interface Loopback52
+   ipv4 address 10.50.52.1 255.255.255.255
   !
+  interface Loopback53
+   ipv4 address 10.50.53.1 255.255.255.255
   !
+  interface Loopback54
+   ipv4 address 10.50.54.1 255.255.255.255
   !
+  interface Loopback55
+   ipv4 address 10.50.55.1 255.255.255.255
   !
+  interface Loopback56
+   ipv4 address 10.50.56.1 255.255.255.255
   !
+  interface Loopback57
+   ipv4 address 10.50.57.1 255.255.255.255
   !
+  interface Loopback58
+   ipv4 address 10.50.58.1 255.255.255.255
   !
end

RP/0/0/CPU0:XRv-1(config)#commit
Sat Jul  2 17:59:52.409 UTC
RP/0/0/CPU0:XRv-1(config)#end
RP/0/0/CPU0:XRv-1#

She observed that John deleted the networks above. Since customers & the NOC were complaining, and she recognized the networks to be valid for production, she did not hesistate to restore the prefixes.

Immediately, after the commit, she checked connectivity.

RP/0/0/CPU0:XRv-10#traceroute 10.50.50.1 so lo0 probe 1 
Sat Jul  2 17:59:57.298 UTC

Type escape sequence to abort.
Tracing the route to 10.50.50.1

 1  10.5.10.5 0 msec 
 2  10.4.5.4 0 msec 
 3  10.3.4.3 0 msec 
 4  10.2.3.2 0 msec 
 5  10.1.2.1 0 msec 
RP/0/0/CPU0:XRv-10#ping 10.50.50.1 so lo0               
Sat Jul  2 18:00:02.288 UTC
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.50.50.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/9 ms
RP/0/0/CPU0:XRv-10#  

Jenny confirmed with the NOC that the board is going green and in-band connectivity appeared restored.

Post-Mortem

Mr. Doe came back from a delicious lunch to find he caused quite a headache for his colleagues. They schedule a meeting and discuss some things that could be done better:

John could've done the following

  • Informed anyone in operations, if not the NOC, that he is about to make a change. The details of the change do not matter, as long as someone knows he is working on the network.
  • He could've ran that same show commit changes diff command Jenny had utilized, prior to hitting commit. That would've exposed the gravity of his typo.
  • He could've added a comment to his change. In the event that this change was more complex, and Jenny had doubts that his change was the clear impact, she could've reviewed the change request / documentation related to his config to see what the scope and impact should be:
RP/0/0/CPU0:XRv-1#
RP/0/0/CPU0:XRv-1#configure 
Sat Jul  2 18:35:14.153 UTC
RP/0/0/CPU0:XRv-1(config)#no interface loopback 5*
RP/0/0/CPU0:XRv-1(config)#commit comment "CRQ1234"
Sat Jul  2 18:35:35.252 UTC
RP/0/0/CPU0:XRv-1(config)#end
RP/0/0/CPU0:XRv-1#


RP/0/0/CPU0:XRv-1#show configuration commit list 1 detail 
Sat Jul  2 18:36:12.599 UTC

   1) CommitId: 1000000069                 Label: NONE
      UserId:   jdoe                       Line:  con0_0_CPU0
      Client:   CLI                        Time:  Sat Jul  2 18:35:35 2022
      Comment:   "CRQ1234" 
RP/0/0/CPU0:XRv-1#

RP/0/0/CPU0:XRv-1

!!! Or: !!!

RP/0/0/CPU0:XRv-1#configure 
Sat Jul  2 18:38:13.711 UTC
RP/0/0/CPU0:XRv-1(config)#no interface loopback 5*
RP/0/0/CPU0:XRv-1(config)#commit comment "Decom Subscriber Net 58"
Sat Jul  2 18:38:25.040 UTC
RP/0/0/CPU0:XRv-1(config)#
RP/0/0/CPU0:XRv-1(config)#
RP/0/0/CPU0:XRv-1(config)#end
RP/0/0/CPU0:XRv-1#

RP/0/0/CPU0:XRv-1#show configuration commit list 1 detail 
Sat Jul  2 18:38:41.769 UTC

   1) CommitId: 1000000073                 Label: NONE
      UserId:   jdoe                       Line:  con0_0_CPU0
      Client:   CLI                        Time:  Sat Jul  2 18:38:25 2022
      Comment:   "Decom Subscriber Net 58" 
RP/0/0/CPU0:XRv-1#

What could the organization do better?

  • Well, it would be great if the NOC received notifications when someone makes a change on a core device. That would've provided a timestamp & change location for the NOC to trace down, without Jenny's help. This could be SNMP traps, or Slack hooks.
  • Make it clear to operators that the Network is a controlled utility, not a cowboy ranch. John should've created an implementation plan and reviewed it's contents. He then should've submitted a change request to be approved.
  • The org should think harder about how they control changes. Maybe it's time to consider a CI/CD pipeline to decom provisioned resources. The pipeline could even add a comment with the job ID, or the commit id.... Of course, that introduces it's own set of problems and technical debt.
RP/0/0/CPU0:XRv-1#show configuration commit list 1 detail 
Sat Jul  2 18:38:41.769 UTC

   1) CommitId: 1000000073                 Label: NONE
      UserId:   ansible                       Line:  con0_0_CPU0
      Client:   CLI                        Time:  Sat Jul  2 18:38:25 2022
      Comment:   "0ea3815881" 
RP/0/0/CPU0:XRv-1#
Likely the client would be something more like netconf

Shoulda'-Woulda'-Coulda'. The only thing for certain would be that John owed Jenny a stiff drink.