How Complex Operations Benefit From Process Variable Monitoring

How Complex Operations Benefit From Process Variable Monitoring

Widespread power outages often dominate headlines, especially when millions are left in the dark and a root cause isn’t immediately clear. These types of incidents typically invite speculation, particularly when asking if it was it a technical malfunction, operational lapse, cyberattack, sabotage, or even a weather-related anomaly. The fact that such a wide range of causes is routinely investigated highlights just how large, complex, and fragile today’s energy infrastructure is. While pinpointing the cause is crucial to preventing it from happening again, these disruptions also spark a broader conversation: how can grid operators better anticipate, detect and mitigate anomalies before they escalate into major service interruptions?

Keeping the Global Lights On: Wide Area Synchronous Grids  

In most parts of the world today, electricity is generated, transmitted, distributed and consumed by what are called “wide area synchronous grids” (WASGs), which are connected grids that cover a state, country or an entire region. These harmonized grids provide several advantages: the more interconnected they are, the easier it is to trade electricity, load balance, pool resources and improve resiliency. This interconnectedness  — and built-in fail-safes — also makes them more resilient than isolated power grids such as those in remote or isolated areas like North Korea, which are known for their repeated outages.

Synchronous grids operate in alternating current (AC), and all participants must have the same frequency (Hz). Common frequencies around the world are between 50 Hz (U.S.) to 60 Hz (European Union). Inside a given WASG, the frequency throughout all parts of the system must operate within a minimum variance. Whenever one part of the system needs more power, it will demand more energy and drive frequency below the line. Likewise, whenever demand eases, the frequency will increase. Many parts of the WASG act automatically (or semiautomatically) to account for fluctuating demand. These supply and demand fluctuations translate into price fluctuations reflected in markets for energy such as the European Energy Exchange (EEX), which operates as a platform for buying and selling energy and related commodities.

Today’s electricity grids may also include wind turbines and solar panels which, while valuable, can’t easily tune their production (and thus influence the frequency). In contrast, a traditional hydroelectric power plant can dose its production by powering on and off generator groups and can also regulate its intake of water.  

Every moment the energy grid experiences constant flux as homes, businesses and industries turning devices on and off create continual surges and drops in power demand. Simultaneously, shifting weather patterns, drifting clouds, storms and changing sunlight levels cause renewable energy sources like solar and wind to fluctuate unpredictably. Aligning nuclear, solar, hydro, wind and traditional sources precisely to match demand is a delicate, real-time orchestration. Maintaining this intricate equilibrium ensures the grid frequency stays precisely at its required Hz, safeguarding grid stability and uninterrupted energy.

Islanding: How the WASG Seeks to Maintain Stability

An important concept in WASG operations is “islanding,” a condition where a portion of the grid becomes electrically isolated from the main grid but continues to be energized, forming a self-sufficient island. When intentional, such as to perform maintenance or another planned operation, this can be a good thing. Unintentional islanding, however, can wreak havoc on power grids, especially if undetected. It often results from a series of events touched off by a significant disturbance that overwhelms the grid's ability to maintain stability.

WASGs are marvels that must continuously balance power generation and consumption. It’s frankly amazing how well they do this, most of the time, but a variety of disturbances could challenge these efforts and cause imbalances or fluctuations in energy levels/demand or frequencies.  Moreover, an abrupt load imbalance can cause the all-important AC power level in the affected area to deviate from the nominal value. A large enough frequency swing can trigger automatic protection systems to seek to prevent damage by disconnecting generators, which may lead to unintentional islanding. A similar reaction can occur when voltage levels fluctuate too much. In each of these scenarios, an islanded area can quickly degrade into a blackout area.

Monitoring for Anomalies with Process Variable Monitoring

Islanding is the WASG’s autonomous attempt to maintain stability (if load balancing and other automatic attempts have already failed) by isolating the problematic area. To circumvent this reaction requires operators to know about conditions that will lead to islanding before those conditions trigger the autonomous response. In other words, power grid operators need to monitor the entire grid to detect threats and anomalies outside of defined parameters, including process variable spikes that may touch off cascading failures.

Process variable monitoring gets to the heart of what operational technology (OT) engineers mean when they tell IT professionals that “OT is different.” Industrial control systems rely on OT (and increasingly Internet of Things, or IoT) devices to control physical processes. Depending on complexity, a physical industrial process may have tens or hundreds of thousands of process variables to control flow, pressure, temperature and level, each one configurable. To ensure system security, reliability and high availability, you can’t just monitor devices and network communications, as you do with IT. You must also monitor the physical processes themselves… by monitoring their variables’ anomalous readings.  

Now consider this: a WASG is essentially a massive industrial control system. For instance, an alarm in a control center or substation might be triggered if values measuring the Hz were outside safe operating thresholds, prompting an immediate islanding to preserve the resiliency of the rest of the connected grid. Until investigated, it could indicate a malicious actor has tinkered with a value, or it may be operator error, or a natural cause such as lightning, tree branches on lines or atmospheric conditions. In any case, the threat is real.  

Behavior-based Anomaly Detection in Process Variable Analysis

To minimize both cyber and operational risk, industrial environments need comprehensive risk monitoring that combines rules-based threat detection with behavior-based anomaly detection. While rule (or signature) based methods work to detect known threats, behavior-based detection is the only way to detect operational anomalies as well as unknown threats, including zero days.  

Coming Back Online: Another Intricate Dance

Coming back from a blackout or even reconnecting an islanding event is a slow, complex, step-by-step process, leveraging teams with multiple types of advanced skillsets to safely conduct dangerous, complex procedures, in collaboration with each other. During this phase, it’s crucial that operators maintain the synchronization between islands, balancing load and generation, managing the circuit breakers, plus possibly thousands of other grid control operations.  Restarting a large electric grid requires large generating facilities to provide that initial boost of energy and then regulate the supply efficiently. All hands are usually on deck, with everyone focused on recovery efforts, leveraging the tools, data, and processes they have on-hand.

During a crisis or incident, and while a root cause is still unknown, all factors are usually taken into account during the initial investigation. In today's hyper-connected landscape, deciding on whether cyber, natural events or something else played a role in contributing to the disruption is sometimes the first challenge to solve. Knowing if an outage in the energy grid, on a factory floor, in a water product facility or in a transportation system can be traced to cyber, at a high level, helps isolate and identify the problem. To eliminate cyber as a root cause, organizations must quickly demonstrate that all of their systems were functioning normally, without anomalies, and no cyber intrusions had occurred.  It’s not the time to begin deploying tools or rushing projects to start monitoring — it’s game-time.  

Organizations that have their security solutions fully operationalized beforehand, ahead of an incident, are better equipped to accelerate recovery efforts.  

  1. Understand the operational process that’s happening
  2. Understand how cyber plays a role in that process
  3. Develop a detailed asset inventory of all cyber assets
  4. Fully map all network communications between all assets
  5. Prioritize network anomalies, assess risks, investigate
  6. Monitor cyber (wired and wireless networks, endpoints) for process parameters
  7. Identify and map all process parameters to assets to network communications
  8. Understand the criticalities of the identified processes
  9. Tune the monitoring and alerting for the anomalies identified within the process variables
  10. Continue to monitor for anomalies at all levels, integrating findings with operations and cybersecurity teams