The 6 levels of autonomous unit-testing explained

Preface

The trend of AI-generated code has started to ramp up since 2021 with Github Copilot (powered by OpenAI Codex) as a milestone. Coders who experiment with self-generating code tools may be wowed every once in a while by the tools’ capabilities, however, the integrity of the generated code is generally low, and there’s a long path to traverse on the way to fully autonomous coding. Hence, as we are now getting closer to harnessing the technology to generate test code, it is needed more than ever. We believe that the path of code generation is intertwined with the path of test code generation.

Testing – essential but hard

Today’s leading industry standard to validate software behavior are tests that sample the expected behavior for a particular software module (unit tests, micro tests, component tests, etc.). These tests, for the most part, are manually coded by the developer who wrote the code under test. However, manually written tests have several downsides:

1. Quality of test code and developers’ devotion to test writing, fluctuate between different developers and their mood or prioritization.
2. While test writing is relatively mundane work, it takes significant time of the professional developer’s time, causing software projects’ budgets to increase.
3. Systematically mapping complex code behaviors is a challenging task for developers. Usually, there is a high degree of correlation between what they sample in the test to what they coded, so tests usually miss the un-intended behaviors.

Different official sources agree that more than 20% of developers’ time in software companies is spent on writing tests and maintaining high code integrity for new and refactored code. Although manual coding of tests may be frustrating, we understand why many spend precious time writing them. For example, refactoring code is not uniform in its effort, as the existence of meaningful tests with high code coverage (i.e. high code integrity) usually makes the refactoring process much smoother. To reduce the manual test coding effort, AI code generators are being suggested, and they have started to show promising results.

The 6 Levels of Autonomous Code Integrity

Autonomous code integrity (ACI) is the ability to automatically generate tests and measure correctness. In this post, we present the 6 levels of the Autonomous Code Integrity (ACI) Levels model. It is inspired by the parallel effort to build autonomous vehicles (AV) and designed to set the framework for building systems that autonomously take care of code integrity.

Level of autonomous code integrity (unit-testing)
*Note that sometimes the term automated unit testing is used to describe manual tests that automatically run with every build, but in this blog post, we talk about unit tests that are automatically generated.

The 4 pillars of ACI

The philosophy behind the levels model is derived from what we believe are the 4 pillars of ACI:

Code Integrity – Includes the task of unit-test generation and its corresponding correctness/confidence metrics. Advances in this pillar include the ability to generate better tests (targeted at edge cases and vulnerable areas) as well as more reliable correctness/confidence metrics. This is the equivalent of the execution of steering and acceleration in the domain of the autonomous vehicle.
Code context – a key component to succeed in the autonomous integrity task is to extract the right context for the unit-test generation, i.e., to understand relations, and deduce which parts of the code are relevant, etc. In AV a similar pillar is directed towards the awareness of the driving environment.
Code complexity – Refers to the ability to support multiple scenarios, mainly this means supporting different code complexity levels but may also include multiple languages, frontend code vs. backend code, etc. Similarly, in AV, it is referred to as driving modes (urban, highway, etc.).
Strategy – Unit-test strategy is a key component in unit testing regardless of whether it is manual or automatic. It combines the system’s ability to answer these two questions: 1. How important is this component and consequently, what is the correctness score that we wish to achieve for this component? and 2. Given limited resources, what is the optimal suite for this component? This has no direct equivalent in the AV world. Although optimizations for AV systems are critical, they aren’t directly considered part of the driving task, unlike in the unit-testing domain.

We can see notice the resemblance between these 4 pillars to the 4 pillars of autonomous driving –

The ACI levels in-depth

Level 0: No unit-testing automation

Unit tests are manually coded. The programmer provides the entire test suite on a choice of selected modules, though there may be systems in place to help the programmer. An example would be the CI/CD automation testing system, which can provide coverage metrics for the test suite. But since it technically doesn’t directly affect the code integrity, it does not qualify as automation.

Level 1: Unit-Testing assistance

This is the lowest level of automation. Some early adopters of code generation solutions may already have started to generate level 1 test code using this new emerging technology. While some simple context may be extracted, the developer still needs to provide additional information to get good results. Level 1 suggests that it can only be used on simple pieces of code, and in most cases create “happy path” tests or naive boilerplate tests.

Level 2: Partial unit-testing automation

In level 2, although we can generate test code for a larger variety of code. The tests that are created are still somewhat naive in their ability to test interesting edge cases, but the awareness of the code environment increases significantly. The system can scan the repository and find the needed context, but may still rely on the developer to provide additional context. The developer still writes the vast majority of the test code.

Level 3: Conditional unit-testing automation

At this level, the developer no longer has to provide context. The test code is now capable of surfacing up interesting edge cases and supporting many code scenarios, and our code integrity metrics are reliable. However, the developer still needs to manage the testing strategy, as there is no awareness of the importance of each component.

Level 4: High unit-testing automation

Level 4 allows most developers in most scenarios to get a fully automated test code suite without their intervention. While tests may not cover all the code behaviors, they will cover most of them including challenging edge cases. In high-complexity code, the system may fail or ask the developer to provide feedback. Simultaneously with improving our ability to measure code integrity and infer code importance, code strategies can be applied.

Level 5: Full unit-testing automation

Level 5 unit-testing automation does not require human attention in the test creation process. Level 5 supports numerous software scenarios and code complexities as long as the entire relevant code base is provided. The ability of level 5 systems to extract the entire code behaviors exceeds the performance of an experienced programmer. Resource optimization is performed automatically, taking into account the inferred code importance, resulting in a concise test suite.

State of the industry

In the AV domain, full automation is considered a significant improvement, because it binary allows or disallows the use of on-the-road time for other tasks. When driving a car, losing attention for a fraction of a second can have lethal results, therefore fully relying on semi-autonomous cars could have lethal implications. In the ACI domain, humans may benefit from lower levels of autonomy thanks to free time that will be saved when much more complex tasks that now require full human attention will be converted to tasks in which the human only validates the correctness of the results. Furthermore, the optimal level of autonomy isn’t necessarily the highest one! Developers may choose to tune the time that they devote during their workday to collaborate with the system according to their preferences.

While the future of an automatic generation of code and unit-test code is promising and exciting, the technologies that enable them (such as large language models) are still in their infancy. We are now just starting to make our first steps from level 0 to level 1, and yet possibilities are infinite. It’s fair to say that autonomous unit testing will not be widely accepted by developers unless they are confident that it will increase their code integrity, and more importantly, make them happier, by allowing them to focus on the more creative parts of their day.