Chaos Grid

Aug 17, 2023

Future

The grid of future will be an increasingly complex entity as discussed in previous post. Numerous entities are responsible for part of the functionality with increasing number of startups and new players controlling assets. This increases the likelihood of errors creeping in. No one has full understanding of the interdependencies and resulting behaviour. Interdependencies keep changing daily with consumers selecting various offers from different players.

At the same time the grid needs to be on balance all the time – equal amounts of energy get generated and consumed

Figure. Energy system of the near future

All of this added complexity requires a lot of additional software from a wide range of actors from all over the world. Add to the mix the fact that in future complexity is likely to increase rather than decrease as new technologies mature and decentralised production increases.

The most critical layer will be the national grid. Its complexity will increase fastest as it is the central entity providing services and information to all the parties in the grid and it is responsible to overall upkeep of the whole.

All software contains errors. The more there is software, the more latent bugs reside in system. These will manifest themselves when rare events happens. First event reveals some latent error leading to a cascade. When thinking in terms of decades, such rare sequences are almost inevitable. The system need to prepare for the unknown and we need to think of ways of protecting ourselves.

One approach is certainly that over time the energy system develops into a fleet of independent energy island that can generate all their energy needs (at least for a limited time) with a mix of technologies like geothermal, small modular nuclear, wind and solar. These energy cells with trade with each other through the grid and purchase energy from the grid when the price is low enough. At minimum they should have enough storage to outlast a few days long outages.

Big power plants and national grid will exist, mainly to provide power to large scale industrial users like mines, mills for aluminum, steel and other metals, manufacturing plants (car factories etc.), chemical factories, concrete production, train tracks etc. They will provide the base layer of energy generation. Grid is also used to transfer excess energy to large energy stores whether these be water pumped to higher ground or electricity turned into ammonia, synthetic fuels or hydrogen (the flavour of the month now). Ammonia and synthetic fuels can in turn be used as fuel for combustion engines in vessels, airplanes, lorries, cars etc.

During the transition period the energy system will be increasingly complex and hard to manage. This may lead to an increase in outages or at least extreme swings in hourly rates.

For managing this transition one can take advice from cloud data centres where everything fails all the time, but services for customers remain up.

Let’s examine.

Complexity Management on The Cloud

When analysing major outages in cloud services after a data center collapses, they tend to start with one small error that has unexpected effect on some latent error, which cause a series of new more serious errors leading to triggering more latent errors until the system goes pear shaped.

Often the first correction methods done in near panic make the situation even worse by triggering further previously undetected small errors that finally can cause a total breakdown. These resulting error state can last for days before the service is restored, sometimes with permanent losses of data to customers. These types of unexpected consequences are a fact of life in complex system.

To live with these realities one major cloud service provider – Netflix – originally developed a system of thinking called chaos engineering (now renamed resilience engineering because descriptive names are out of fashion).

Chaos Engineering

Software on the cloud based on cheap, unreliable servers. Reliability is built on software level. Reliability can be built up with many ways. Most common is horizontal scaling where we have multiple processes of the same service running on multiple machines. Should any of the machines fail, the remaining processes on remaining machines continue serving customers. To add reliability the services can run on multiple data centres (DCs) to protect against a loss of an entire DC. The core concept is replication.

On software level to speed up development a so-called micro-services architecture emerged. Large teams have split into smaller 6-8 person teams each responsible for a well-defined chunk of the functionality with well-defined interface to other teams. Each team is self-organising and has self-determination. They are responsible for design, implementation and keeping their thingy up and working. Usually also for sales and marketing.

The end result is that previously monolith services that operated on expensive, fault-tolerant and reliable hardware are broken into a multitude of independent parts running on a fleet of cheap but less reliable servers. Making changes to small, well-defines part is easier than changes to a large body of code and upgrading is also much easier and thus can be done more frequently. The frequent small changes allow teams to quickly adapt to new information about customer needs. These micro-services are interdependent on each other in complex ways.

Drawing the interdependency graph reveals the true complexity what is sometimes called “death star” pattern.

https://twitter.com/codingfabian/status/543383413177454592

Figure. Dependency graph between micro-services.

To understand such a graph in detail is not possible. Since different teams will have different people with different backgrounds (junior and senior) it is not possible to be sure that everyone follows commonly agreed principles and even seasoned professionals make occasional mistakes. Making mistakes and learning from them is how you become a star in any field after all.

For an Internet based service provider, the only source of revenue that keeps the company afloat, is the service. If it fails for longer time, the company will lose its customers and go out of business.

Since underlying hardware components fail all the time and software components have regular errors and resulting interdependency graph is too complex for anyone to understand, the only practice alternative left is to observe its behaviour, spot where the problem is, inform the responsible team and roll back to previous version that was working until a working new version is available. And preferably in an automated way.

Some errors are very rare, but they may have very several consequences. Latent errors that depend on failure somewhere else are normally not found out during normal testing - called unit and integration tests. The only way to detect them and to ensure that services will recover, is to test them in production.

The way to smoke these out sounds counterintuitive, but there needs to be a system in place that generates errors randomly on purpose. A system that goes around causing havoc like picking a random, innocent micro-service and stopping it or taking messages between services and dropping them coldly to the floor or delaying them unnecessarily.

The resulting behaviour of the system is observed: are failed services re-started, lost messages resent and do client services handle the errors gracefully. If the service does not recover, we have found a weak spot and can inform the winning team.

One such toolset is Netflix’s Sibian army with tools like Chaos Monkey or Chaos Gorilla.

Note also that in agile the faults are not due to an individual but the whole team is responsible. Most commonly at least two people need to read through and test each change before it is approved to staging and after successful staging it goes to production (mileage between teams do vary), but always several persons have read and approved each change. All deliveries are team efforts.

Chaos Grid to the Rescue

As you see, the complex cloud and complex grid have similar characteristics even when fault types are different. And the method to make sure that services are resistant against errors is simply to test them in production all the time. This testing needs to be randomised.

The grid is physical and software is virtual so how could benevolent chaos help the grid?

Here be one idea. If the grid operator wants to take a serious approach to availability of service, they need to start testing the operational grid all the time. This leads to randomly tripping wires here and there and observing how the grid behaves. These deliberate line cuts need to happen at different hierarchy levels: in the national grid, distribution operator level and perhaps even at the final consumer level although this is debatable.

If the network is reliable, it needs to stay up or at least recover fast. Any failures are actually good things as they can now be fixed, and this prevents a large-scale outage in future. This way the grid stays fit rather than decaying into a bad grid as complexity and intermittent energy generation is added to it and reliable generation removed.

Testing naturally is done at off-peak times, not in the middle of the winter (or whatever your peak is). Optionally one month of each year is selected for randomised resilience testing so people can be mentally prepared for it. Since not all users will be ready for this, they need to be prepared to pay extra for a lower quality service. In other words, if outages ever happen, resilient customer can continue in island mode while non-resilient have power cut. For this the non-resilient customers are charged a premium.

Such testing can only be started, once first the exiting network is analysed and needed improvements are made to make it resilient against failures. The idea behind testing is not to humiliate anyone but to ensure that future changes do not hurt the fault tolerance of the system.

In addition to testing the hardware aspects of the grid, the added software needs similar testing. In the vital infrastructure that we all depend in the future, Chaos Monkey and Gorilla will be having a field day running services down, removing messages that parties sent to each other, delaying messages and so on. If we do not let them run amok, bad things will happen. This is like exposing the grid to stress to make it stronger. Exercise makes the grid grow muscles.

Be Your Own Worst Enemy

As the grid is such a central piece of core infrastructure, it needs more rigorous testing than other domain in all areas of its operation.

Some software mistakes made during development can create security holes - bugs that enable adversaries to penetrate services or internal networks and access critical product, market or financial information. Companies cannot rely entirely on external parties to secure their core assets. Even when there are good tools like virus scanners, firewalls, intrusion detection systems, the presence of these tools does not help when they fail. You are ultimately in charge.

Logically there needs to be similar services than chaos engineering on the security side – a comprehensive set of tools and people that try constantly to break into the systems, detect problems and flag them.

A company might set a complete honeypot system – replica of some service with relaxed firewall rules to see if anyone can gain access during such a misconfiguration. Then observe their behaviour to see what type of information they are after.

Or they actively let a penetration testing company try to break in. These tests can take many forms.

Drones that fly close to facility premises and try to eavesdrop Wi-Fi traffic and get in via it. Penetration tester might also install rogue cell base stations that try to force phones and user equipment to disable their encryption making it possible to eavesdrop (so called Stingray base stations)
Repairman. People in yellow vests boldly walking in as repair persons to see which meeting rooms, floors or server cabinets they can poke into. Then poke their laptop to nearest Ethernet and seeing what they can discover.
Leave specially prepared USB sticks or other USB based attack devices spread at random places in the company meeting rooms to see if someone sticks them in and gets infected.
Phishing emails are sent to employees
Eavesdrop at the nearby café or restaurant if company employees reveal any sensitive information

Why go all the trouble? Core infrastructure are under constant attack already from state level players and to a less degree criminals. To be able to understand the level of exposure and prepare, organisations that want to stay afloat, need to start testing themselves and slowly driving employee understanding up and fixing issues when they pop up.

Wall calendar also plays a role in these activities. The best time for rogue play is New Year’s Eve when the regular staff is marinating their brains and contemplating experiences of the past year with large quantities their beverage of preference and involuntary junior temps are manning the mission. That’s when pros launch attacks ringing in the new year. These attacks travel around the world following the clock dutifully. Nothing better than start the new year with a nice little security breach after spending the night out on the town. Nothing like putting what’s left of the little grey cells into work mood first dawn of the new year.

Providing this type of testing as a service is also a big opportunity for service companies. Selling bad mood and misery as a service is one of the sunshine businesses of future.

Summary

The electric grid is becoming a complex network with a multitude of generation technologies including large amounts of independent small-scale producers who make independent decisions. This can make the network prone to unpredictable emergent behaviour. Ways to mitigate this include price signals, demand response systems and chaos engineering.

Next: Stock Exchange

Schrodinger Mind by Martti Ylikoski

Discussion about this post