Making errors is part of the job. We don’t know what and when but we know that will happen. It can happen many times a day or a week, at work as well as in our everyday lives. Someone says that the only way to avoid mistakes is doing nothing. I disagree. Sometimes, not taking action is the mistake. And it’s fine.
Errors are part of life, of our learning process of our evolution. Mutations are errors in DNA replication. Some mutations improve adaptability to the environment. Many profound discoveries and inventions were simply mistakes made by scientists on alternative quests (E.g. penicillin, pacemaker, chocolate chips cookies).
Creating a safe environment for teams, especially in IT, doesn’t mean just running retrospectives and hearing everyone’s voice. Safety is not just about trust, autonomy, giving some slack or time for experimentation. There are some aspects at organizational level but also other dynamics at team level.
It is important to establish the correct team culture so that everyone is supportive and comfortable not only when succeeding but also in the middle of critical moments. You need your team at their best, focused and positive while responding to incidents, restoring systems after failure, reacting to attacks, when under external pressures or internal difficulties. This process doesn’t happen in a day.
You need to get any occasion to coach the team and train them to respond in a productive way without losing time, confidence and trust in each other. It can be that your DevOps, QA, SRE or Developer broke the build, lost latest progress, deleted a configuration or a database or a branch, set the wrong feature flag, didn’t activate a cloud policy or any of the other millions of possible actions and operations that statistically everyday goes wrong in some part of the world at any time.
What I usually teach the team is:
- Pause, Think, Ask.
- Verify if there is really a problem or the mistake is assuming there is an issue. Do not necessarily stop at the first answer. Some people exclude the possibility of having an issue just because it is unlikely. Pandemics are also unlikely but they happen. It’s OK to double check. No need to be paranoid. You will learn in time who in the team and the organization is able to provide reliable answers quickly.
- If there is an issue, share the news. No one is comfortable giving bad news, but guess what, someone has to do it. If the office is on fire, everyone would like to know ASAP. Without alarming too many people, share the information about the incident or the concerning fact with the right people. According to the type of suspected incident, level of risk, impacted assets and resources, the people you need to contact may be different. A team lead, a manager, an executive, IT department, the Scrum Master, the Product Owner, legal, ad hoc incident report alias email, office manager, firefighters… No need to inform everyone. Good company policies should define a lightweight, clear, easy procedure to take action, including who should be informed and how. It is important that policies don’t turn into bureaucratic madness where people need to fill 3 forms, open ticket service desks, or call fictional people such as Mr. Wolf.
- According to the type of issue, it could be an easy fix or not. It could require an automated or manual restore. It could require time and many people or not. Sometimes a small chain of fast messages in chats among the right people can solve issues in production systems in less than 5 minutes and without too much fuss.
- Once the issue is solved it is very important to follow up. Run a post-mortem of the incident. How it happened? what everyone was doing at that moment? What preconditions, assumptions, habits, messages lead to the error? Some people are inclined to minimise. Some exaggerate. Some will blame, some will feel judged. It is important and OK to admit the responsibility. Radical transparency, openness, courage, accountability are Scrum values. Not by chance, but on purpose.
- More important than the who, though, it needs to be the what, the why and the how. The team needs to think, sweat, work, brainstorm on all possibile and even creative ways to make the system or the process more robust, more automated,more resilient. What can we do to minimise a risk or the impact of an error? What monitoring measures can we take? Can we aim to have a self-detecting or semi self-repairing system? Think big, look for experimentation and innovation. Encourage the whole team to fully cooperate in the exercise including people that may feel not directly involved. How can risks be shared in a better way?
- Stay away from manual approvers, CAB, bottlenecks, stupid rules.
The aim is not to spend a fortune in time, money, tools, resources to have perfect systems, perfect days for perfect teams, in perfect organizations.
The goal is to learn, reduce costs and footprints, improve, increase awareness.
After going through post-mortem meetings, as Scrum Master, Agile Coach, Manager or Facilitator, check on the team. Look for signs.
Did the failure lead to scapegoating or retaliation? Are the messengers “shot” or neglected? Is there any bullying or mocking attitude? Are there excessive or repetitive jokes about one’s mistake or weakness? Are the responsibles asking forgiveness and saying sorry too many times due to a previous mistake?
Talk to them. One to one and in a team. They need to support each other and move on.
Generative organizational culture which is based on high-trust and emphasizes information flow is predictive of software delivery performance and organizational performance in technology. The idea that a good culture that optimizes information flow is predictive of good outcomes is based on research by sociologist Dr. Ron Westrum. Westrum’s research included human factors in system safety, particularly in the context of accidents in technological domains such as aviation and healthcare.
In summary:
- Mistakes happen. More often than what we think.
- Encourage bridging and cooperation. Whether one can solve it alone, they should still pass the ball, share the info. Secrecy or hiding issues under the carpet is poison.
- After the issue is fixed, run a post mortem. Ask the team not to look only at the past. What improvements can be proposed? Tell them to document findings.
- Move on. No self-flagellation. Genuine reasonable mistakes are part of the business. Risks and responsibilities are shared across the organization.
- Remind people that life doesn’t have to be without mistakes. Some of them bring innovation and evolution.
- The only crime is not learning from mistakes.
- Fear, silence, tensions aren’t making any good. Provide support, clarity, safety, guidance to the team.
- Lead by example. When you make a mistake, acknowledge it, take responsibility, share it, run root-cause analysis and seek feedback. Move on better than yesterday, till the next endeavour.
- Be a challenger, a pioneer, an explorer. Sometimes things go smooth, sometimes no.
- It takes collaborative effort, growth mindset, generative culture and perseverance to be successful.
* * *
I write about organizational patterns, transformational leadership, healthy businesses, high-performing teams, future of workplace, culture, mindset, biases and more. My focus is in leading, training, and coaching teams and organizations in improving their agile adoption. Articles are the result of my ideas, studies, reading, research, courses, and learning. The postings on this site and any social profile are my own and do not represent or relate to the postings, strategies, opinions, events, situations of any current or former employer.
This article has been published for the first time on danieledavi.com by the author Daniele Davi’.
© Daniele Davi’, 2021. No part of this article or the materials available through this website may be copied, photocopied, reproduced, translated, distributed, transmitted, displayed, published, broadcast or reduced to any electronic medium, human or machine-readable form, in whole or in part, without prior written consent of the author, Daniele Davi’.