I've been a volunteer Network Consultant with DBIUA for nearly five years. As for how I got involved, and what being a remote volunteer from 4300 kilometers away is like... I'll leave that for another post.
Of what I've achieved in the ~ five years, two things come to mind:
- Maintain & guide our monitoring and management strategy
- Overhaul our routing from static to dynamic
Although the first one is the bigger topic, since it's something that consumes most of my cycles, I'd like to talk about the second point.
What was the routing like before?
DBIUA was founded by several crafty & innovative individuals. But they really went hard mode with their routing scheme in the beginning. Everything was static, everything was well planned. When I built my first Network, I would say the opposite was true.
Chris, our pioneer founder & Jefe Augustus, explains the old setup well in the below video from 2017
What's the problem with this network?
Well, for starters, it was complex. That's a lot of artificial hops between one PoP to another. But this was the design that serviced the many functions via a collapsed access PoP that DBIUA needed.
The network was also tedious... When you consider the amount of subnets displayed there, and remember that nothing was dynamic, well, that's a lot of static routes to populate. We could mostly solve this with summarization. Each hub/access PoP would get subnets allocated from a greater /16 (10.X.x.x). Yet, this is still tedious as all blocks need to be programmed everywhere, to achieve a true full-mesh.
The second tedious element, is the operator intervention required. If a PoP had two potential ways to the upstream, an operator would have to switch the default route to the alternate path, should the primary path lose service. Floating static routes wouldn't work as the local radio was our next hop, and this interface would not go down, should the upstream PoP go down entirely. IP-SLA could've been an option here (or BFD)... but wouldn't that just be additional complexity for something that's built into OSPF?
In Comes OSPF
Many grey beards will tell you, "OSPF is easy to deploy, but hard to master." This is mainly because the default configuration of a flat single area comes with very little configuration that is greater than the sum of it's parts; It provides keepalives, failover, and network state in 6 lines or less, per device.
The first step to making this work was to change the radios from participating in routing to doing what they do best, being bridges between routers (our EdgePoint routers). We would reconfigure one radio at a time and have the downstream interface on our router inherit it's IP. This remain static at first, as the downstream radio still didn't speak OSPF. But once we cutover the downstream radio, OSPF was alive, one PoP at a time.
When all networks were live, we finally had what we needed. It felt great to rip out all the static routing. All these routes still lived in the routing table, but for a lot less stress on the command line. It was also great to experience our first automatic failover via an automatic path.
So, what was missing?
OSPF's defaults sometimes aren't so smart. Let's say we have a PoP with two potential paths to the internet, and these are equal cost. The problem is that, perhaps these links are not created equal. Because we need to thread the needle by pointing radios from home to home, some of these are pointed in not-so-ideal ways. For example, we have one path that is a low placed radio that shoots across the water. When the tide is low, this works great. But when the tide is high, it enters into the fresnel zone of the connection, thus causing packet loss and making the link less ideal.
The solution? Static costing. What we typically do is let links that are fine default to a cost of 10, or for links we truly prefer the bulk path of traffic we might set it to 1. But for links we want to be a backup path, well, we make them a cost of 30. This makes it so that only when the primary path goes down, OSPF will re-calculate and utilize the backup path as it's main path to the upstream.
We also have two upstreams to the internet, via our two providers. They are not equal in quality either. One is a not-so-reliable microwave backhaul to the mainland. Another we have a fiber connection we shoot our own 24Ghz radio to, and is much more reliable in quality and capacity. Thus, both routers have
default-information originate set, but one of them we try to keep as a warm standby. Thus we utilize a failover setup via WAN Load-Balancing.
The other thing to consider with this method, is that you are applying a policy on traffic ingress to an interface. Thus you must set the traffic that is internal to your network to be exempt from this policy. We did this via these simple lines:
If we didn't do this, then traffic would be forced into routing table
101 , via the line
action modify , which essentially is empty and vacant of our OSPF table. But when we say the destination is in RFC 1918 space, we do
action accept which essentially means just pass the traffic as you normally would... utilizing the forwarding plane we have populated via OSPF.
Please tune in for part two where I go over the fundamentals of OSPF & how we configured it in detail.