Your Router Trusts You Too Much
When You Make An Uh-Oh
You've probably been there at least once. You're in a maintenance window. You make a change and the router is suddenly unreachable. You wait and wait, but it does not come back. You finally accept defeat & start the process for a ride-in, a DC tech, a truck roll, etc., etc.
It's not a good feeling. But, surely, it can be prevented. There are typically two ways this occurs:
1. The firmware/software upgrade.
Let's say you staged this upgrade – you should've, if you could've, anyway – but it breaks it production. Just another one for the Network Voodoo... move-on and laugh it off.
You can also use it as a lame excuse to buy out your competitors.
2. A config change....
This is one you really should try your best to prevent. Lab your topology changes as much as possible. Have strong peer review processes... but for everything else, there's another way.
Podcast Feedback
In 2021 I submitted this feedback to one of my favourite podcasts, 2.5 Admin, in regards to a fail-safe mechanism for most platforms:
Joe, Jim & Alan do great work with this podcast. If you're a Network Professional, you should expand your horizons and tune-in.
The Mechanism
This is often called a confirmed commit. You make a change on a device. The device waits a set period, expecting you to confirm that things are alright. If no confirmation occurs, the device will perform the necessary steps to roll-back the configuration.
Let's look at an example from the main platform this is famous from, JunOS:
root@BB6> configure
Entering configuration mode
[edit]
root@BB6# delete interfaces ge-0/0/0
[edit]
root@BB6# show | compare
[edit interfaces]
- ge-0/0/0 {
- description "PRODUCTION BREAKING CHANGE!!";
- unit 0 {
- family inet {
- address 10.4.6.1/31;
- }
- family iso;
- family mpls;
- }
- }
[edit]
root@BB6# commit confirmed 1
commit confirmed will be automatically rolled back in 1 minutes unless confirmed
commit complete
# commit confirmed will be rolled back in 1 minute
[edit]
root@BB6#
Let's say that, as the description PRODUCTION BREAKING CHANGE!!
implied, this breaks production and locks you out of the box. The method will save you and automatically rollback the change:
[edit]
Broadcast Message from root@BB6
(no tty) at 0:56 UTC...
Commit was not confirmed; automatic rollback complete.
[edit]
root@BB6# exit
Exiting configuration mode
root@BB6> show configuration | display set | match 0/0/0
set interfaces ge-0/0/0 description "PRODUCTION BREAKING CHANGE!!"
set interfaces ge-0/0/0 unit 0 family inet address 10.4.6.1/31
set interfaces ge-0/0/0 unit 0 family iso
set interfaces ge-0/0/0 unit 0 family mpls
set protocols mpls interface ge-0/0/0.0
set protocols isis interface ge-0/0/0.0 point-to-point
root@BB6>
This convention also exists on IOS-XR:
RP/0/0/CPU0:BB4#configure
RP/0/0/CPU0:BB4(config)#no router isis BB
RP/0/0/CPU0:BB4(config)#show commit changes diff
Building configuration...
!! IOS XR Configuration 6.1.3
- router isis BB
- apply-group GROUP_TILFA
- is-type level-2-only
- net 49.0000.0000.0000.0004.00
- address-family ipv4 unicast
- metric-style wide
- segment-routing mpls sr-prefer
!
- interface Loopback0
- passive
- address-family ipv4 unicast
- prefix-sid index 4
!
!
- interface GigabitEthernet0/0/0/0
- point-to-point
- address-family ipv4 unicast
!
- interface GigabitEthernet0/0/0/2
- point-to-point
- address-family ipv4 unicast
!
- interface GigabitEthernet0/0/0/4
- point-to-point
- address-family ipv4 unicast
!
!
end
RP/0/0/CPU0:BB4(config)#commit confirmed ?
<30-65535> Seconds until rollback unless there is a confirming commit
minutes Specify the rollback timer in the minutes
show-error Displays commit failures immediately
<cr> Commit the configuration changes via pseudo-atomic operation
RP/0/0/CPU0:BB4(config)#commit confirmed 30
RP/0/0/CPU0:BB4(config)#
RP/0/0/CPU0:BB4#show log | i "commit changes"
RP/0/0/CPU0:Jan 31 00:59:07.425 : config[65741]: %MGBL-CONFIG-6-DB_COMMIT : Configuration committed by user 'kazaii'. Use 'show configuration commit changes 1000000034' to view the changes.
RP/0/0/CPU0:Jan 31 00:59:44.622 : cfgmgr_trial_confirm[65743]: %MGBL-CONFIG-6-DB_COMMIT : Configuration committed by user 'kazaii'. Use 'show configuration commit changes 1000000035' to view the changes.
RP/0/0/CPU0:BB4#
RP/0/0/CPU0:BB4#show isis adja
IS-IS BB Level-2 adjacencies:
System Id Interface SNPA State Hold Changed NSF IPv4 IPv6
BFD BFD
BB3 Gi0/0/0/0 *PtoP* Up 28 00:02:06 No None None
BB6 Gi0/0/0/4 *PtoP* Up 18 00:02:06 Yes None None
BB5 Gi0/0/0/2 *PtoP* Up 29 00:02:06 No None None
Total adjacency count: 3
RP/0/0/CPU0:BB4#
For these platforms, and several others, things are really that elegant; You can utilize this feature to quickly get you out of Uh-Oh scenarios with limited impact.
You can also set it for a much broader timeframe – say, for the duration of your maintenance window. You can continually test your environment... knowing that the system will rollback for you.
Caveats
Sadly, sometimes it's less of a snapshot and more of a slapshot...
For other platforms like classic IOS, IOS-XE, VyOS, EdgeOS, and several others... the way this feature is implemented is to reboot the box in a set interval.
For IOS, you would perform the command:
reload in 15
– and the box would reload in 15 minutes.
Why is this such a stark difference? Well, because if you simply lock yourself out of management, while inline customer traffic is still flowing, you actually created a more serious outage; A tech-roll might've been preferred...
vyos@BB3# delete interfaces ethernet eth0
[edit]
vyos@BB3# show | compare
[edit interfaces]
-ethernet eth0 {
- address 10.2.3.3/31
- hw-id 0c:be:97:ba:00:00
-}
[edit]
vyos@BB3# commit-confirm 1
commit-confirm will automatically reboot in 1 minutes unless changes are confirmed.
Proceed? [y]y
Reboot scheduled for commit-confirm. Confirm your changes to cancel the reboot.
[edit]
vyos@BB3#
[edit]
vyos@BB3# [ OK ] Stopped /usr/bin/sg vyatta…/archive/config.boot-rollback.
[ OK ] Stopped /usr/bin/sg vyatta…/archive/config.boot-rollback.
Stopping Session 1 of user vyos.
[ OK ] Removed slice system-modprobe.slice.
[ OK ] Stopped target Graphical Interface.
[ OK ] Stopped target Timers.
[ OK ] Stopped Periodic ext4 Onli…ata Check for All Filesystems.
[ OK ] Stopped Discard unused blocks once a week.
[ OK ] Stopped Daily rotation of log files.
My Thoughts
The best method is really to take your time & do things right:
- Give yourself time.
- Lab things up.
- Test scenarios.
- Draw things out.
- Have peer reviews
- Before you commit, check the diff
Certifications spend way too much time focusing on configuring things & configuring them as fast as possible. There really should be more focus on ".. if I hit <Enter>
now, what do I expect to happen?"
Maybe you should spend more time buildings tests & valdiation into your automation, before your automation becomes a distributed outage bot.
It's too dangerous to commit alone! Use a confirmed commit.