What's changed?
In Episode 134 of the The Hedge podcast, Russ White reminisces on his experience in Cisco TAC. Russ states that clients would often try to suggest they've encountered bugs/defects causing major outages, when – in reality – they were often Network changes gone wrong.
Whatever it was you 'didn't change', change it back
- Russ White
In my personal experience, I've found this insight to be true. Most network outages seem to be caused by lack of peer review, lack of testing, careless implementation, lack of change control oversight, and various other lacks of control over operations.
Example Scenario
Let me make a theoretical example to demonstrate. Picture an early afternoon on a beautiful July day. The NOC was out of their chairs, chatting, laughing, and discussing their activities for the upcoming weekend. Suddenly, the board went red. Moments later, tickets started to roll through the floodgates. Apps were down, customers were down – there was now a Priority One incident in progress.
What's the problem? Just another day in Ops Life, no? Unfortunately, nobody had any clue what had happened. One eager operator scrolled through their alarm board and found no indication of fault; No LOS condition, adjacency lost, core device offline, etc.
What could have happened here?
Topology
To demonstrate this outage, I am using my GNS3 lab above, intended for my Segment Routing testing/refresher. This is typically a flat IS-IS topology; However, I've changed the adjacency between XRv-1
& XRv-2
from IS-IS to BGP.
I've also created a set amount of "Networks" to be advertised to the core via XRv-2
. These networks are then summarized via an aggregate route. This method is not uncommon to advertise a set of prefixes from a DC, region, domain, etc.
RP/0/0/CPU0:XRv-1#show ip int br | i Loo
Sat Jul 2 17:51:54.871 UTC
Loopback0 10.255.255.1 Up Up default
Loopback48 10.48.48.1 Up Up default
Loopback50 10.50.50.1 Up Up default
Loopback51 10.50.51.1 Up Up default
Loopback52 10.50.52.1 Up Up default
Loopback53 10.50.53.1 Up Up default
Loopback54 10.50.54.1 Up Up default
Loopback55 10.50.55.1 Up Up default
Loopback56 10.50.56.1 Up Up default
Loopback57 10.50.57.1 Up Up default
Loopback58 10.50.58.1 Up Up default
Loopback60 10.60.60.1 Up Up default
Loopback61 10.60.61.1 Up Up default
Loopback62 10.60.62.1 Up Up default
RP/0/0/CPU0:XRv-1#
RP/0/0/CPU0:XRv-1# show run router bgp 1 address-family ipv4 unicast
Sat Jul 2 18:22:26.476 UTC
router bgp 1
address-family ipv4 unicast
aggregate-address 10.48.0.0/12 summary-only
redistribute connected
!
!
RP/0/0/CPU0:XRv-1#show bgp sessions
Sat Jul 2 18:24:15.598 UTC
Neighbor VRF Spk AS InQ OutQ NBRState NSRState
10.1.2.2 default 0 2 0 0 Established None
RP/0/0/CPU0:XRv-1#
RP/0/0/CPU0:XRv-2#show bgp ipv4 unicast neighbors 10.1.2.1 received routes | b Network
Sat Jul 2 18:27:02.247 UTC
Network Next Hop Metric LocPrf Weight Path
*> 10.1.2.0/24 10.1.2.1 0 0 1 ?
*> 10.48.0.0/12 10.1.2.1 0 1 i
*> 10.255.255.1/32 10.1.2.1 0 0 1 ?
*> 100.64.0.0/24 10.1.2.1 0 0 1 ?
Processed 4 prefixes, 4 paths
RP/0/0/CPU0:XRv-2#
RP/0/0/CPU0:XRv-2#show run formal | i redist
Sat Jul 2 18:27:36.445 UTC
Building configuration...
router isis zeal address-family ipv4 unicast redistribute bgp 2 level-2
router bgp 2 address-family ipv4 unicast redistribute isis zeal
RP/0/0/CPU0:XRv-2#
The summary-only
configuration suppresses any prefixes, within the range of 10.48.0.0/12
[ 10.48.0.0 - 10.63.255.255
], and advertises only the aggregate route.
The rest of the core sees this aggregate, due to XRv-2
's redistribution. They can also reach the individual subnets contained within this aggregate.
RP/0/0/CPU0:XRv-10#show route 10.50.50.1
Sat Jul 2 17:51:33.993 UTC
Routing entry for 10.48.0.0/12
Known via "isis zeal", distance 115, metric 40, type level-2
Installed Jul 2 17:41:16.075 for 00:10:17
Routing Descriptor Blocks
10.5.10.5, from 10.255.255.2, via GigabitEthernet0/0/0/0, Protected
Route metric is 40
10.9.10.9, from 10.255.255.2, via GigabitEthernet0/0/0/1, Protected
Route metric is 40
No advertising protos.
RP/0/0/CPU0:XRv-10#
RP/0/0/CPU0:XRv-10#ping 10.50.50.1 so lo0
Sat Jul 2 17:52:22.050 UTC
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.50.50.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/9 ms
RP/0/0/CPU0:XRv-10#
RP/0/0/CPU0:XRv-10#
RP/0/0/CPU0:XRv-10#ping 10.50.55.1 so lo0
Sat Jul 2 17:52:26.869 UTC
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.50.55.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/9/29 ms
RP/0/0/CPU0:XRv-10#
RP/0/0/CPU0:XRv-10#traceroute 10.50.50.1 so lo0 probe 1
Sat Jul 2 17:53:18.956 UTC
Type escape sequence to abort.
Tracing the route to 10.50.50.1
1 10.5.10.5 9 msec
2 10.4.5.4 0 msec
3 10.3.4.3 0 msec
4 10.2.3.2 0 msec
5 10.1.2.1 0 msec
RP/0/0/CPU0:XRv-10#
So what happened?
Now begins the story of our outage lottery winner, John Doe. John Doe's blunder is very much inspired by a similar real-world scenario I witnessed. The topology, impact, and actions are changed, for the sake of discretion.
John Doe is in the provisioning team. He received a request to decommission the network of a service that had been cancelled. John Doe believed that the change approval process was cumbersome. Thus, he decided that he would quickly delete this network and update the relevant documentation.
John logged into XRv-1
and manually deleted the network. He then went to his long overdue lunch, as it was nearly 2 pm in the afternoon.
John did not inform the NOC he made this change. He did not utilize monitoring systems to see the impact he caused. He also did not validate the configuration he was implementing.
RP/0/0/CPU0:XRv-1#
RP/0/0/CPU0:XRv-1#configure
Sat Jul 2 17:53:43.904 UTC
RP/0/0/CPU0:XRv-1(config)#no interface loopback 5*
RP/0/0/CPU0:XRv-1(config)#commit
Sat Jul 2 17:54:00.703 UTC
RP/0/0/CPU0:XRv-1(config)#end
RP/0/0/CPU0:XRv-1#
John was in and out of that configuration in ~17 seconds.
What was the blunder? John's intended Network to delete was the one attached to Lo58
. What did he actually type? Loopback 5*
. This '*'
acts like a wildcard character. That means John unintentionally deleted networks 50 - 59
, along with network 58
.
John's shift
key just caused some major impact.
Troubleshooting
The NOC figures there is a bug, a failed ASIC, a hung router, etc. They engage escalation resource, Jenny Bourne. Jenny decides to trace from a relevant PE router, XRv-10
, to the defined networks in outage.
RP/0/0/CPU0:XRv-10#traceroute 10.50.50.1 so lo0 probe 1
Sat Jul 2 17:55:17.378 UTC
Type escape sequence to abort.
Tracing the route to 10.50.50.1
1 10.5.10.5 0 msec
2 10.4.5.4 0 msec
3 10.3.4.3 0 msec
4 10.2.3.2 0 msec
5 *
6 *
7 *
8 *
9
RP/0/0/CPU0:XRv-10#
RP/0/0/CPU0:XRv-10#ping 10.50.55.1 so lo0
Sat Jul 2 17:54:19.641 UTC
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.50.55.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
RP/0/0/CPU0:XRv-10#ping 10.50.50.1 so lo0
Sat Jul 2 17:54:35.460 UTC
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.50.50.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
RP/0/0/CPU0:XRv-10#
This trace shows that, in hop 4, packets get to the link between XRv-3
and XRv-2
. Then it seems like XRv-2
or XRv-1
was dropping the traffic.
She checked Router 2:
RP/0/0/CPU0:XRv-2#show cef 10.50.50.0 detail | i "version|via|Inter|Giga"
Sat Jul 2 18:35:02.247 UTC
10.48.0.0/12, version 51, internal 0x1000001 0x0 (ptr 0xa143c1f4) [1], 0x0 (0x0), 0x0 (0x0)
via 10.1.2.1/32, 2 dependencies, recursive, bgp-ext [flags 0x6020]
next hop 10.1.2.1/32 via 10.1.2.1/32
Hash OK Interface Address
0 Y GigabitEthernet0/0/0/0 10.1.2.1
RP/0/0/CPU0:XRv-2
This looks alright. XRv-2
had a valid route to XRv-1
. She then checked router 1.
RP/0/0/CPU0:XRv-1#show route 10.50.50.1
Sat Jul 2 18:03:16.485 UTC
Routing entry for 10.48.0.0/12
Known via "bgp 1", distance 200, metric 0, type locally generated
Installed Jul 2 17:41:15.775 for 00:22:00
Routing Descriptor Blocks
directly connected, via Null0
Route metric is 0
No advertising protos.
RP/0/0/CPU0:XRv-1#
Next-hop via Null0
, that means this router is throwing packets in the garbage. She then checked to see what's changed on the router.
RP/0/0/CPU0:XRv-1#show configuration commit list | utility egrep -C1 "~"
Sat Jul 2 17:57:38.308 UTC
SNo. Label/ID User Line Client Time Stamp
~~~~ ~~~~~~~~ ~~~~ ~~~~ ~~~~~~ ~~~~~~~~~~
1 1000000064 jdoe con0_0_CPU0 CLI Sat Jul 2 17:54:00 2022
RP/0/0/CPU0:XRv-1#
Resolution
It appears her colleague, John, made a change. She checked to see what it entailed.
RP/0/0/CPU0:XRv-1#
RP/0/0/CPU0:XRv-1#configure
Sat Jul 2 17:59:24.871 UTC
RP/0/0/CPU0:XRv-1(config)#
RP/0/0/CPU0:XRv-1(config)#load rollback changes last 1
Building configuration...
Loading.
611 bytes parsed in 1 sec (593)bytes/sec
RP/0/0/CPU0:XRv-1(config)#show commit changes diff
Sat Jul 2 17:59:46.179 UTC
Building configuration...
!! IOS XR Configuration 6.1.3
+ interface Loopback50
+ ipv4 address 10.50.50.1 255.255.255.255
!
+ interface Loopback51
+ ipv4 address 10.50.51.1 255.255.255.255
!
+ interface Loopback52
+ ipv4 address 10.50.52.1 255.255.255.255
!
+ interface Loopback53
+ ipv4 address 10.50.53.1 255.255.255.255
!
+ interface Loopback54
+ ipv4 address 10.50.54.1 255.255.255.255
!
+ interface Loopback55
+ ipv4 address 10.50.55.1 255.255.255.255
!
+ interface Loopback56
+ ipv4 address 10.50.56.1 255.255.255.255
!
+ interface Loopback57
+ ipv4 address 10.50.57.1 255.255.255.255
!
+ interface Loopback58
+ ipv4 address 10.50.58.1 255.255.255.255
!
end
RP/0/0/CPU0:XRv-1(config)#commit
Sat Jul 2 17:59:52.409 UTC
RP/0/0/CPU0:XRv-1(config)#end
RP/0/0/CPU0:XRv-1#
She observed that John deleted the networks above. Since customers & the NOC were complaining, and she recognized the networks to be valid for production, she did not hesistate to restore the prefixes.
Immediately, after the commit, she checked connectivity.
RP/0/0/CPU0:XRv-10#traceroute 10.50.50.1 so lo0 probe 1
Sat Jul 2 17:59:57.298 UTC
Type escape sequence to abort.
Tracing the route to 10.50.50.1
1 10.5.10.5 0 msec
2 10.4.5.4 0 msec
3 10.3.4.3 0 msec
4 10.2.3.2 0 msec
5 10.1.2.1 0 msec
RP/0/0/CPU0:XRv-10#ping 10.50.50.1 so lo0
Sat Jul 2 18:00:02.288 UTC
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.50.50.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/9 ms
RP/0/0/CPU0:XRv-10#
Jenny confirmed with the NOC that the board is going green and in-band connectivity appeared restored.
Post-Mortem
Mr. Doe came back from a delicious lunch to find he caused quite a headache for his colleagues. They schedule a meeting and discuss some things that could be done better:
John could've done the following
- Informed anyone in operations, if not the NOC, that he is about to make a change. The details of the change do not matter, as long as someone knows he is working on the network.
- He could've ran that same
show commit changes diff
command Jenny had utilized, prior to hitting commit. That would've exposed the gravity of his typo. - He could've added a comment to his change. In the event that this change was more complex, and Jenny had doubts that his change was the clear impact, she could've reviewed the change request / documentation related to his config to see what the scope and impact should be:
RP/0/0/CPU0:XRv-1#
RP/0/0/CPU0:XRv-1#configure
Sat Jul 2 18:35:14.153 UTC
RP/0/0/CPU0:XRv-1(config)#no interface loopback 5*
RP/0/0/CPU0:XRv-1(config)#commit comment "CRQ1234"
Sat Jul 2 18:35:35.252 UTC
RP/0/0/CPU0:XRv-1(config)#end
RP/0/0/CPU0:XRv-1#
RP/0/0/CPU0:XRv-1#show configuration commit list 1 detail
Sat Jul 2 18:36:12.599 UTC
1) CommitId: 1000000069 Label: NONE
UserId: jdoe Line: con0_0_CPU0
Client: CLI Time: Sat Jul 2 18:35:35 2022
Comment: "CRQ1234"
RP/0/0/CPU0:XRv-1#
RP/0/0/CPU0:XRv-1
!!! Or: !!!
RP/0/0/CPU0:XRv-1#configure
Sat Jul 2 18:38:13.711 UTC
RP/0/0/CPU0:XRv-1(config)#no interface loopback 5*
RP/0/0/CPU0:XRv-1(config)#commit comment "Decom Subscriber Net 58"
Sat Jul 2 18:38:25.040 UTC
RP/0/0/CPU0:XRv-1(config)#
RP/0/0/CPU0:XRv-1(config)#
RP/0/0/CPU0:XRv-1(config)#end
RP/0/0/CPU0:XRv-1#
RP/0/0/CPU0:XRv-1#show configuration commit list 1 detail
Sat Jul 2 18:38:41.769 UTC
1) CommitId: 1000000073 Label: NONE
UserId: jdoe Line: con0_0_CPU0
Client: CLI Time: Sat Jul 2 18:38:25 2022
Comment: "Decom Subscriber Net 58"
RP/0/0/CPU0:XRv-1#
What could the organization do better?
- Well, it would be great if the NOC received notifications when someone makes a change on a core device. That would've provided a timestamp & change location for the NOC to trace down, without Jenny's help. This could be SNMP traps, or Slack hooks.
- Make it clear to operators that the Network is a controlled utility, not a cowboy ranch. John should've created an implementation plan and reviewed it's contents. He then should've submitted a change request to be approved.
- The org should think harder about how they control changes. Maybe it's time to consider a CI/CD pipeline to decom provisioned resources. The pipeline could even add a comment with the job ID, or the commit id.... Of course, that introduces it's own set of problems and technical debt.
Shoulda'-Woulda'-Coulda'. The only thing for certain would be that John owed Jenny a stiff drink.