Addendum: When society’s KPIs are not met. Alarms and Tickets
Concept and alarm handling process
Let’s stop the regular programming and return to a previous post stream on public side services.
In this post we introduced the concept of Safety Investigation Authority. It is a team that looks at problems when things go bad at society level and tries to find out root causes and recommendations how to avoid.
On second thought, there are a lot of problems that are perhaps not life threatening but still waste money or generate harm some other ways. And these are caused by political decisions or inaction.
Perhaps there should be a way of continuous monitoring and a way of fixing things when they go pear shaped?
Earlier the idea of defining key performance indicators (KPIs) was introduced to be defined for each proposal. After implementation, the KPIs to be monitored and published.
Today most often some estimates are made before a policy change, there is criticism from the opposition that the number are not realistic, but finally a change is introduced. The estimates show many good things to be expected, but rarely is there any follow up. Or if media write about subpar results, this won’t lead to much.
But what should happen if the targets are not met? There are several alternatives.
In industry failures of equipment tend to cause an alarm. Alarms are classified into different severity levels like information, warning, minor, major, critical.
Faults are visualized to a group of people monitoring the system. In this case “the system” would be the whole country, region, city, community or project that is our focus. Monitoring team uses a variety of troubleshooting tools to find out what is causing the issue and can it be easily fixed.
New people who join the team have detailed instructions how to go about doing checks and performing corrective actions. These documents are called “runbooks” and today most actions contained there are already automated.
If there is no quick fix, the task of analysis and resolution is moved to a different team. Typically, a trouble ticket is created. This team has more know-how and can use more detailed set of tools. They may for example have simulation environment where they try to replicate the problem using real data or generate synthetic data.
Regularly there are several upper-level teams handling issues. If the second level cannot fix and issue, they pass the ticket to the third level.
Finally, when solution is found, a rollout follows. Before final rollout, the correction is tested in lab and then in some small field test. The troubleshooting team may have to create a change request for a different board before rollout can happen for non-critical issues. In case of critical issues, it may be implemented right away. The actual implementation also starts with just part of the system being updated, then rolled out gradually entirely.
About ongoing problems, all stakeholders are told that there are problems, how the corrections are ongoing and how the thing got fixed. Issues are stored to archive for later analysis.
This model is sometimes criticized because there is loss of information whenever tickets bounce around in the organization. We’ve all been pestered with unnecessary clarification questions from some ticket handling teams about stuff that we’ve already told. This may be due to lack of context or perhaps more often due to another common mistake – policies requiring answers withing a given service level agreement. People meet their SLAs by asking fake questions from customers.
The alternative being proposed that the expert should be embedded with the first line so that whenever a tough case pops up, the first line can immediately ask someone with more known-how to help. Most of the hard issues are resolved then and there making customers happy and as an additional benefit, the first line people build their skills and can take bigger and bigger responsibilities over time.
This is usually not done as management fears costs. And it may not be suitable in all environments as real experts might live in another country.
When dealing with critical issues, the way to respond, is to gather the true experts together and let them work on the topic until resolution is reached.
On Society Level
So what should happen on society’s level when some change in law does not bring in the promised results?
The minimum is to compare various organisations that made predictions before the change. How accurate where they? Over time it becomes obvious what organisations aim for accuracy and who is affected by political sensibilities. And the average error can also be measured and calculated and shown in future next to each prediction.
Opportunity needs to be given to prediction making organisations so that they can improve their accuracy. Their predictive models have been inaccurate and this is the right moment to analyse why and make corrections to them so that over time the accuracy gets better.
Of course units that predict something fails have now an incentive to sabotage the results somehow. Some mechanism to dampen this desire needs development.
The feedback process is basically just one more of society’s systems and it can be misused or it can lead to secondary effects no one foresaw. It needs similar design, simulation and game play to get it right,
The actual policy that went sideways needs fixing as well. Its hard to say anything general as policies and their impacts vary greatly. The whole purpose of the troubleshooting process is to look at how important the failure is and propose some corrective actions. Then the decision is made in the same process as the original one. (i.e. inside companies there is some kind of change management board when scope is limited but on society/community level its best to feedback the proposal to the standard governance “pipe”).