Auditing an engineering organisation

Written on 2021-01-22

History

Auditing is an ancient practice used by bean-counters and bookkeepers since antiquity, to ensure accounts are balanced, and rules are being followed. In some business contexts, for example publicly listed companies, accounting audits are a regular occurence.

A fun bit of trivia: the origin of the role “chancellor of the exchequer” originally Norman. A table with a chessboard pattern was used to simplify calculations and budget balancing, like a primitive spreadsheet. The Norman/French word for chessboard is “échiquier”, and so the process of auditing the books was called “exchequer”.

Context

In software engineering and information technology there are several common auditing practices:

The above audits are usually conducted because of a contractual obligation or compliance process. In this post I am writing about a more general, less formal, internal process for getting a grip on an engineering organisation, and writing up findings to improve decision making and focus people’s minds.

Before reading further, I recommend skimming a UK National Audit Office summary report. I find these reports interesting: they often surface observations and insights that one would not expect, equally often they observe practices that are very common in tech/IT. The NAO usually produces a well-written, concise summary report which can, if nothing else, improve one’s writing ability.

What and why

A general engineering audit could, depending on organisation size, take only a few days, or stretch to many months. Likewise, you could conduct an audit in a very formal and rigid style, or audit an engineering organisation more casually and informally.

I think the following (non-exhaustive) situations are good reasons to audit:

Surfacing problems/risks aren’t the only reasons to do an audit. Dispelling uncertainty about whether you’re doing a good job is an equally valid reason. In fact, the main two reasons for not doing an audit are:

  1. you did an audit recently

  2. you have conflicting priorities or no capacity for new work

Why

Let’s individually consider the three scenarios listed above

Recently joined

Conducting an audit when you have recently joined an engineering organisation is an ideal time to do so, and an audit can serve two purposes:

  1. You are a fresh pair of eyes, and you are well placed to spot anomalies, inefficiences, or problems that others have accepted due to slow erosion over time

  2. By doing an audit, you will very quickly learn how things are done across the wider organisation, rather than just your team. You may spot opportunities for convergence or situations where you can avoid re-inventing the wheel

Technical debt

If you have running software or hardware somewhere, inevitably you will have made trade-offs or accrued some technical debt. You:

There may be inherent problems with the current solution, software might:

Hardware might:

These are not necessarily immediate problems, or they may never actually affect the service or product. However it is important to track them, often organisations have a “risk register”. Cynical readers will imagine such registers as spreadsheets where improvements go to languish, people acknowledge risks, and the act of acknowledging them is seen as a mitigation.

Doing an audit allows you to gather (perhaps many) risks and present them together. When compiled together in a novel format which groups and quantifies any risks (more on this later) it can paint a sobering picture that focuses the minds of leadership/management/decision makers. This picture can be less easy to ignore than a periodically updated spreadsheet or wiki page.

Changing the size of the engineering organisation

Probablistically, it is rare that an organisation’s size remains constant. People get promoted, switch jobs, have children, become ill, perish, etc.

Startups can grow quietly, or rapidly, sometimes achieve 50-100% growth year-over-year. Governments and large public companies can decide to downsize or split departments.

When an organisation decides to grow or shrink, hopefully there is a goal:

Whether this goal can be achieved is an assumption, and depending on the cost of doing so, it can be prudent to attempt to pre-emptively de-risk this assumption with some analysis.

An engineering audit can clarify such an assumption by identifying:

If engineers are firefighting incidents or are not keeping up with a backlog of patches, it would not be prudent to reduce the number of engineers.

If the software development cycle is slow, or capacity to onboard new engineers is low because of high demand for product development then it may be difficult to onboard new engineers. Priorities and delivery deadlines may have to change to ensure new engineers are happy and productive.

An audit may completely contradict an assumption that reducing headcount is an intelligent allocative decision. An audit could contradict an assumption that the organisation should increase headcount by 50%: if it takes an engineer 6 months to get up to speed, should you hire 10 people, or hire 3 engineers and spend 3 months improving the onboarding situation.

Alternatively, if your organisation assumes it needs more engineers because of a lack of capability to do X, an audit looking at the learning and development investment and uptake may identify an alternative course of action: ensure engineers have budget and time to participate in learning and development activities.

How to audit

Who is your audience

Before you start, work out who is going to look at your audit results:

How you will communicate with your audience

How you present your results will influence how you proceed:

What data is available to you

If you are in a mature software delivery organisation then you may have access to:

These can all be useful inputs into an audit, answering questions such as:

If you are in a less mature organisation, or in an organisation that has chronically underinvested in good practice, then you may have none of the above. However you still have the people who build and operate the software/hardware/product/service. You can ask your colleagues to:

Depending on the situation, you may wish to use qualitative or quantitative methods:

If you are going to consult with your colleagues during your audit, you can view this consultation as a form of user research. The GOV.UK service manual guidance on user research is illuminating.

If you are sending out a survey, you should ensure that it can be filled out quickly and ergonomically. Asking your participants to download a spreadsheet is not going to engender happy thoughts, and you will be making work for yourself later. You may wish, depending on the content of the survey, to allow for anonymous submission. Some topics to look at include:

Informal 1:1 interviews where attribution is anonymised can produce high quality information (which you should independently verify) at low cost, providing your colleagues trust you. Explaining up front why you are conducting an audit is important: people can fear for their product or job when in fact all you are doing is attempting to provide clarity and reprioritise work to reduce risk. You should take detailed notes or, if you have consent, record the interviews for future reference.

A very informal interview could consist of:

How to write a report

Assuming you are writing a report, you may wish to structure your report with some of the following headings:

Abstract

Imagine you are going through hundreds of documents in an organisation’s shared drive, where there is no information management system or informal hierarchy. The abstract should tell such a reader what is in the report, and why they should read it.

For example:

This report is an informal engineering audit conducted in January 2021. Individuals in the engineering and product disciplines were informally interviewed, and a number of risks were identified. Risks include: lack of new widgets, surplus of old widgets, lack of widget handling automation. Recommendations include: adoption of widget management software, and engineer training.

Introduction

This may not be necessary depending on how the abstract is written. You may wish to introduce the scope of the audit, and the reasons why you decided to audit.

For example:

As a new engineering manager joining the organisation I decided to conduct an audit over a 2 week period so that I could familiarise myself with the organisation’s ways of working and the different product areas. I wanted to understand how engineers who report to me conducted their work, and to work out which areas I would best be placed to help.

Alternatively:

The organisation is, over the next annual period, seeking to expand the engineering organisation by 50%. In order to do this effectively I felt that we needed more information about where we need engineers the most, and to identify any bottlenecks or blockers that would prevent hiring and onboarding so many new engineers.

Methodology

Depending on how you conduct your audit, you may not have much to describe in a methodology section, and any information about the methodology could be added to the introduction.

If you are using any quantitative or statistical methods which require explanation, you should do so in this section.

If you have specific reasons for choosing a specific methodology, ensure that it is written in the report somewhere, preferably in the methodology section.

For example:

The audit was conducted over a 2 week period. I informally interviewed 2 people from every team (5 teams, 10 people total). Before I conducted the interviews I sent out a survey to the engineers, the results of which I used to corroborate the information gathered in interviews. For a statistical summary of the survey results, refer to the appendix.

The questions asked during the interviews included: what are we doing well as an org, what are we doing badly as an org, what product presents the biggest risk to the org.

An informal interview was chosen because of the short timeline and the small number of participants.

Structure of the organisation

Organisations change size and shape over time. For future readers it can be useful to summarise the size/shape of the organisation at the time of the audit. Significant changes in organisational structure may invalidate recommendations made in the audit report. The audit report may have an influence on the shape of the organisation in future, and such detail can be informative for future readers reviewing the report retrospectively.

For example:

The engineering organisation is divided into two teams: the development team, and the operations team. Each team has their own team lead, and each team lead reports directly to the CEO. The development team’s KPI is the cadence at which new features can be released, and the operation team’s KPI is the number of incidents reported by users.

Such detail could provide valuable context for a reader who arrives at the organisation a year or so later, and finds the organisation in a different shape.

Description of processes and practices

Depending on the audience, you may need to spend a considerable amount of time and paragraphs explaining the various processes which people follow. This can be further exacerabated by the tech sector’s penchant for acronyms or buzzwords which may do more to obscure than to clarify.

You may wish to clarify what is meant by commonly overloaded terms such as:

Explaining what the process is designed to do is useful for a reader, so is why the process was adopted (if a reason could be found). What can be especially important is how long a process takes: in the abstract all code should be reviewed by another peer, but if it takes 2 days to review a one line change then the value of code review may be diminished.

It can be tempting to be very subjective or opinionated in this section, especially when the processed employed are ridiculous or resemble tragedy and farce. However overzealous rhetoric can damage the effectiveness of an allegedly objective report.

An attempt at a generic, objective example:

The development team cuts a new release when they have new features to put in front of users, or if they have produced any fixes for bugs reported by users or internal testers. A developer cuts a new release by running a script on their laptop, then pushing to the release branch in their version control system.

The operations team deploy either when they have configuration changes, operating system updates, or auxilliary software patches to install, or when they have a new release given to them by the development team. Due to the lack of a pre-production environment, they first deploy to half of the servers, then after 30 minutes to the other half.

List of good things

A purpose of this section is to validate changes or procedures previously adopted, and provide evidence that they are working. For example:

Previously an engineering team spent 6 months optimising the continuous delivery pipeline and reducing the number of unreliable tests. Most of the engineers reported that they are more productive today than they were a year ago, and many interview participants mentioned this work has having a positive effect on their productivity.

Compared to a year ago, the number of out-of-hours pages has been reduced by 70% due to the change from alerting on symptoms to alerting on user behaviour. On-call engineers are happier with their work/life balance, and are more confident that any alert received is not a false positive.

Another reason for this section is to avoid the report appearing too negative. It is highly unlikely that everything is in a precarious or risky state of affairs, and it is very easy to dismiss a document which appears biased, rhetorical, or shallowly one-sided.

If the situation is dire, and you were not able to find any good things at all, then you reduce ambiguity by explicitly including such an observation:

The audit procedure attempted to discover good practices and procedures, as well as to identify risks. Unfortunately it was not possible to observe any practice that could be described as effective.

As with any document which describes actions done by people, it is important to be mindful of people’s thoughts and feelings, and the agile prime directive still applies:

“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.”

List of risks and bad things

This section is the obverse of the previous section, and may be considerably longer or shorter. Depending on the size of the organisation, and any systems covered in your audit, it may be helpful to group risks together. For example:

Once you have grouped together risks you’ve identified, describe each one individually and, if you can, include an example or some context of what that risk really means. For example:

Security risk: 20% of the virtual machines in the production environment are running Ubuntu Trusty which, since April 2019, no longer receives security updates. Not applying security updates could be construed as negligence or malpractice in the eyes of our shareholders and accreditors.

Another example:

Commercial risk: vendor X has not produced new features or marketing materials for software Y which we use for Z. Discussions with our account manager indicate that X’s new focus is on a different product line. If X chooses not to continue to develop Y, then we may have to switch to an alternative product at short notice, because of a security vulnerability not being fixed.

Recommendations

Once you’ve grouped risks together and enumerated them, you should be in a position to:

For example:

A number of security risks were identified including lack of prompt patching and lax identity and access controls. Both risks could precipitate a data breach, which cost similar organisations £X last year.

I recommend we ask 3 engineers and 1 product manager to spend 6 months improving our ability to patch our software. With a target of reducing the current average time taken to patch of 30 days to 7 days.

I recommend we ask 2 engineers to spend 3 months tightening up our identity and access management systems and processes. Moving service X and Y to use our single sign-on system will reduce the number of places we administer access controls.

Do not feel the need to provide a solution to every risk identified. Some risks cannot be mitigated and must be accepted or transferred. Some risks may not be risks which can be solved by you or your team. You may wish, for completeness, to enumerate which risks were identified but did not have specific recommendations.

Conclusion

Depending on how thorough your recommendations were, writing a conclusion may be superfluous. You may wish to sum up some of the recommendations, or write down what you intend to do with the report in the near future.

Appendices

If your audit has been particularly thorough or fruitful, you may have large quantities of useful anecdotes and data which informs the risks and recommendations in your audit.

If so, you may wish to include these as appendices in your audit. This will add weight to your conclusions, or provide context in case you have misinterpreted something.

If you have examined historical incidents, you may have useful diagrams, graphs, tables, and other records to which you may wish to link.

Making it useful

Once you’ve compiled the results of your audit, you should use it to achieve the outcome you desire, which may be reprioritisation, or adoption of a new technology. Alternatively, the audit may have validated that your current approach is good and your course does not need to be adjusted.

Unless you are in the privileged position of making all the decisions, you probably need to have some conversations with others, and your audit results will be a useful input into those conversations.

Before you circulate the report more widely, you should share your report with some of the people you communicated with while writing it. If you surveyed or interviewed people, you should check that you’re not misrepresenting their thoughts and feelings. If you compiled an audit based on quantitative information like budgets, invoices, or configuration manifests then you may wish to get your peers to check your work: your report will be most effective if you ensure that it is accurate.

Once you’re confident that your findings are accurate and that you have accurately interpreted the data, and you are authentically representing the thoughts and feelings of your colleague then you should share the audit more widely, and store it in a place where it can be most useful for future readers.

You may wish to schedule a meeting or workshop with relevant people to specifically address the risks and recommendations: maybe you need a budget increase to address a commercial risk, or you need another team to help you. As always, make sure that any meeting or workshop has actions assigned at the end, and that a single person (eg you) is going to ensure that they get done by the assignee.

Conversely, if you are in an organisation where you are individually empowered to address many of the risks you have identified then you can use specific sections of the report as context for code changes or procedure changes that you are making. For example, consider linking to a specific section of your audit in a pull request or user story.

Closing thoughts

I fear that my rather dry explanation of how to audit an engineering organisation may put you off. Indeed, the act of methodically going through processes and practices is not enjoyable for everyone, however you are likely to:

Therefore, it might be worth your while.