OT / ICS security · field research & frameworks

Writing · 2025-07-01 · 9 min read · networks · SHIP book

Why Your Network Only Breaks at 2 AM on Sundays

The dairy-plant spanning-tree disaster that explains why industrial networks fail on nights and weekends. Best-effort IT Ethernet versus deterministic OT, and the case for designing to fail gracefully. (SHIP Framework, Chapter 1.)

River Caudle · rivercaudle.com

The Story That Changes Everything

Settle in now boys and girls, it's time for a story about networks that seem perfectly fine until they're absolutely not.

About a year ago, I was called to a large dairy plant facing a problem that was much more than a nuisance. Every so often, certain buckets of the MCC (Motor Control Center) and all their connected gear would simply vanish from the network. When this happened, it didn't just interrupt operations. It shut down production entirely, causing major losses and requiring frantic manual intervention to keep things moving. To make matters worse, these incidents almost always happened on nights and weekends, when support was thin and the stakes were high.

On the surface, the network seemed as steady as Dr. Jekyll: calm, predictable, and quietly doing its job. But out of nowhere, usually when everyone was at home or off the clock, a different side would emerge. Segments dropped offline, alarms sounded, and the production line ground to a halt. Then, just as suddenly, everything would return to normal. This left everyone scratching their heads and dreading the next incident.

One of the biggest challenges was the lack of historical logs. The switches weren't set up to store logs persistently, so every reboot wiped the slate clean. There was also no centralized log collection. Even though the critical spanning tree events were being logged in real time, they disappeared before anyone could see them. No one noticed the root cause until I arrived on site and started manually digging through what little was left in memory.

Years of experience taught me to look for subtle patterns. Eventually, I traced the outages to a spanning tree protocol mismatch. The Cisco Catalyst distribution switches were running MSTP, while the Stratix 5200 access switches were using PVST+. Most days, this mismatch lay dormant. However, when someone plugged in a laptop running virtualization software that broadcast its own BPDUs, or when a topology change occurred, the Cisco switch would take over as root. The Stratix switches, seeing an unfamiliar BPDU, would lock down and cut off entire MCC sections until things settled down. Production stopped, losses mounted, and someone had to drop everything to fix it, almost always at the worst possible time.

Once the diagnosis was clear, the solution was straightforward. I brought the switch configurations into alignment, making sure both sides spoke the same protocol and followed best practices. Afterward, production outages stopped. The network returned to a reliable, stable state and the plant could finally operate smoothly.

But here's the real lesson: this wasn't a equipment failure or a design flaw. It was a fundamental misunderstanding of what industrial networks need to do and how they differ from the office networks that most IT professionals understand.

Why IT and OT Were Never Designed for the Same World

When most people think about networking, they think about the internet. Email, web browsing, file transfers, video calls. In this world, if your email takes an extra second to load, you might be mildly annoyed. If your video call drops, you redial. If a file transfer fails, you try again.

This is the world that standard Ethernet was designed for. Born in the 1970s for connecting computers in offices, Ethernet's fundamental philosophy is "best effort" delivery. It tries its hardest to get your data where it needs to go, and if something goes wrong along the way, higher-level protocols will figure it out. TCP will retransmit lost packets. Applications will handle delays gracefully. Users will adapt to the occasional hiccup.

But on the plant floor, an extra second isn't mildly annoying, it's catastrophic. A delayed command to a safety system can mean the difference between a controlled shutdown and an explosion. A lost packet in a high-speed motion control application can destroy product or damage equipment. A network hiccup during a critical process step can ruin an entire batch worth millions of dollars.

Operational Technology networks weren't designed for "good enough." They were designed for predictable, guaranteed, deterministic communication. When a PLC sends a command to stop a conveyor belt, that command needs to arrive within a specific time window, not when the network gets around to it.

This is why, for decades, the plant floor ran on specialized industrial protocols. ControlNet, DeviceNet, Profibus, Foundation Fieldbus, these weren't just different cables and connectors. They were fundamentally different approaches to communication, built from the ground up for deterministic, real-time control.

The Promise and Problem of Determinism

Deterministic communication means predictable timing. When you send a message, you know exactly when it will arrive. If the system is designed for 10-millisecond cycles, every message arrives within that window, every time, no exceptions.

Traditional industrial networks achieved this through carefully controlled access to the communication medium. ControlNet, for example, used a token-passing scheme where only one device could transmit at a time, eliminating collisions and ensuring predictable timing. DeviceNet used a priority-based system where more critical messages always took precedence.

This determinism was the foundation of industrial automation. It allowed engineers to design precise, coordinated systems where multiple devices worked together with split-second timing. Robots could work alongside each other without colliding. Conveyor systems could transfer products between stations seamlessly. Safety systems could respond to emergencies within guaranteed time limits.

But determinism came with a cost: complexity, expense, and isolation. Each protocol required specialized hardware, specialized training, and specialized troubleshooting skills. Worse, these networks were islands, difficult to connect to business systems, impossible to monitor with standard IT tools, and resistant to the kind of integrated, data-driven manufacturing that modern competition demands.

Why Determinism Died (And Why We Needed to Kill It)

By the early 2000s, the writing was on the wall. Standard Ethernet was everywhere, fast, cheap, and supported by an entire industry ecosystem. Every IT professional understood it. Every building was wired for it. Every device manufacturer was building Ethernet interfaces by default.

Meanwhile, industrial protocols were becoming expensive anachronisms. ControlNet modules cost thousands of dollars while Ethernet switches cost hundreds. Finding technicians who understood DeviceNet was getting harder every year. Connecting these systems to emerging business intelligence and MES systems required expensive gateways and custom integration.

The industry made a choice: abandon determinism in favor of connectivity. EtherNet/IP, Profinet, Modbus TCP, and other "industrial Ethernet" protocols emerged, running standard TCP/IP over standard Ethernet infrastructure. Suddenly, plant floor devices could share the same network as office computers. IT departments could manage industrial networks with familiar tools. Enterprise systems could access real-time production data without complex integration.

But we lost something critical in the transition: guaranteed timing. Standard Ethernet is inherently non-deterministic. It operates on a "carrier sense multiple access with collision detection" (CSMA/CD) basis, meaning devices listen to the network, transmit when they think it's clear, and deal with collisions when they occur. Network congestion is handled by buffering, retransmission, and "back-off" algorithms that introduce unpredictable delays.

For many industrial applications, this was acceptable. A temperature reading that arrives 50 milliseconds late instead of 10 milliseconds is rarely a problem. An HMI screen that updates every 100 milliseconds instead of every 50 milliseconds is barely noticeable to human operators.

But for high-performance applications, synchronized motion control, safety systems, high-speed packaging, the loss of determinism was a real limitation. Engineers found themselves having to work around the network's unpredictability, adding delays and safety margins that reduced overall system performance.

How We Reanimated Determinism: The Ingenious Hack

The genius of modern industrial networking isn't that we went back to proprietary protocols. It's that we figured out how to layer deterministic behavior on top of standard, non-deterministic Ethernet infrastructure.

This resurrection happened through a combination of clever engineering, standards development, and sheer determination to make the impossible work. The solution came in several layers:

Layer 1: Time Synchronization

You can't have deterministic communication without a common understanding of time. IEEE 1588 Precision Time Protocol (PTP) provides sub-microsecond time synchronization across Ethernet networks. Every device, controllers, drives, I/O modules, switches, maintains a synchronized clock. This creates the foundation for everything else.

Layer 2: Scheduled Traffic

With synchronized time, you can create schedules. IEEE 802.1Qbv (Time-Aware Shaper) divides network bandwidth into time slots. Critical, time-sensitive traffic gets dedicated windows where no other traffic is allowed. During these windows, the network behaves deterministically because there's no competition for bandwidth.

Layer 3: Redundant Paths

For the most critical applications, even a single lost packet is unacceptable. IEEE 802.1CB (Frame Replication and Elimination for Reliability) sends the same critical frame over multiple paths simultaneously. The receiver gets the first frame that arrives and discards duplicates. If one path fails, communication continues seamlessly over the alternate path.

Layer 4: Intelligent Switching

Modern industrial Ethernet switches aren't just dumb hubs. They understand industrial protocols, can prioritize traffic based on application requirements, and participate in the time-synchronized scheduling that makes determinism possible.

Together, these technologies create Time-Sensitive Networking (TSN), standard Ethernet that can provide deterministic performance when needed while remaining completely compatible with standard IT infrastructure.

The Raw Reality: Why Your Network Still Breaks at 2 AM

Here's what the vendors don't tell you: even with all this technology, industrial networks are more complex than ever. And complexity is the enemy of reliability.

The spanning tree disaster at the dairy plant? That was a perfect example. Two different switch vendors implementing the same "standard" protocol slightly differently. Documentation that didn't explain the interaction. No logging to track down the problem when it occurred. And a mismatch that only manifested under specific conditions that happened to occur during off-shifts.

This is the reality of modern industrial networking: You're running deterministic protocols on non-deterministic infrastructure, managed by IT departments that understand enterprise requirements but not operational constraints, using equipment from vendors who sometimes interpret standards differently, in facilities where a single configuration mistake can shut down million-dollar production lines.

The technology works. TSN is real, EtherNet/IP is proven, and properly designed networks can provide both determinism and connectivity. But the gap between "the technology works" and "your network works reliably" is filled with human decisions, vendor implementations, configuration management, and organizational politics.

That's why networks break at 2 AM on Sundays. Not because the technology is bad, but because we've created systems that require everything to work perfectly together, all the time, in environments where perfect is impossible and "good enough" isn't good enough.

The next chapters will show you how to build networks that account for this reality, networks that provide the determinism your operations need while being robust enough to survive contact with the real world.

But first, you need to understand that network reliability isn't just a technical problem. It's a human problem, an organizational problem, and a business problem. The most elegant TSN implementation in the world won't save you if your IT department doesn't understand why determinism matters, or if your vendor integrates switches with incompatible spanning tree implementations.

The goal isn't perfect networks. The goal is networks that fail gracefully, recover quickly, and give you the information you need to prevent problems before they become 2 AM emergencies.

Because in industrial networking, paranoia isn't a bug, it's a feature.

River Caudle · river@riverman.io · Houston, Texas