Test-driven development

TDD (test-driven development) is a popular and efficient style of programming. In short, it has three phases that the developer cycles through repeatedly, to reliably and iteratively engineer code that matches the requirements, with close to 100% test-coverage. 

Each phase should be as short as possible, producing just enough code to prepare for the next phase.

In the first phase, often called Red, the programmer writes just enough of a test to make it fail (compilation error counts as failing). In the next phase, called Green, the programmer produces just enough production code to make the test pass (without worrying about code quality). In the final phase, Refactor, the programmer tidies up the code which was just written to make it clean and adherent to coding standards. After this the cycle is repeated with the next test case.

Generative AI

During 2023, the development of generative AI has, for the first time, led to AI tools that promise to significantly increase programmer productivity, by auto-generating production-quality code based on directives in natural language (prompts). This is an area that develops very fast, with GitHub Copilot as one of the forerunners. Although promising, this technology has one serious drawback: trust. There are no guarantees whatsoever that the code that the AI-tools produce actually works or doesn't contain subtle bugs, since the AI doesn't understand the code that it is writing. It is also difficult to tell the AI tool to produce code in the exact style or coding conventions that is required by the team. This implies that all code that the AI-tool produces has to be scrutinized carefully before injecting it into the production repository. No doubt, these tools could lead to huge productivity gains for many developers, but the mentioned drawback will make it difficult to scale in a reliable way.

Generative test-driven development

With this in mind I want to propose a synthesis of these two coding techniques, TDD and generative AI, to combine the reliability of TDD with the productivity boost of generative AI, in a way that scales.

The main idea is to replace the manual red-phase with generative AI. But instead of letting the AI "guess" the best answer to a series of prompts, the task of the AI is to produce the simplest possible code-change that passes all given tests (an application of Occam's razor), using the nomenclature given in the specification (tests).

In the green phase, the programmer focuses on providing test cases in a logical order (from simple and specific to more elaborate and general), with carefully selected terms to guide the code design and naming of classes, methods and variables. Here the given-when-then style of writing tests (used in behavior driven development/BDD), as well as test-frameworks that promote this style (for example XspecT) can be useful.

The refactor phase still needs to be performed manually, or perhaps semi-automated. This phase can also double as a learning phase for the AI. The goal of the AI should be to produce the best possible code during the automated phase, to minimize the amount of manual refactoring needed.

The vision

A seasoned programmer, powered with a generative TDD-tool, working in an agile team, would work something like this:

First the requirements for the next top-priority feature is carefully analyzed and documented together with the product owner.

Given the requirements, some high level designs are sketched out together with the team, based on which the feature can be broken down into smaller and more narrow user stories with well defined acceptance criteria (each of which can be deployed to production when completed).

Given a user story and the associated requirements and acceptance criteria, the programmer powers up their AI-augmented IDE and writes the first failing test. As the programmer is typing, the test is first indicated with gray (incomplete), but soon enough turns red, at which point the programmer stops typing. Now the AI takes over (indicated both with the color red and a spinner). After a few seconds, the test turns green, but the spinner continues while the AI attempts to refactor the code while keeping all tests green. After another few seconds, the test turns blue and the spinner stops. The programmer now knows that the AI has not only written the code to make all tests pass, but also made its best effort to refactor the code. The programmer navigates to the newly altered production code and inspects the result, perhaps adjusting it here and there where the AI didn't get it completely right. Finally, the programmer runs the tests again, to make sure none of them got broken during the manual refactoring, before starting to write the next test case.

After having used this AI-tool for a while, the programmer notices that the AI almost always gets the code right (sometimes even improving the design beyond what the programmer is capable of), so the programmer starts working on the next test immediately after the AI signals that it is done, only occasionally checking the result. At other times however, the programmer fails to provide a good next test case, so the AI gets stuck trying to find a solution. After perhaps 30 seconds the AI timeout and the programmer has to backtrack to provide a better next test, which will allow the AI to take a shorter leap to find the solution. If two test cases contradict each other so that there is no solution, the AI should find this inconsistency immediately and provide an error message.

The future

A generative TDD tool would shift the craft of the programmer from writing data structures and algorithms to producing specifications. This might lead to the next major paradigm shift in programming, where language development focuses on expressing executable requirements more elegantly and efficiently, increasing the productivity of the software engineer even further and also inviting people with different skill sets than the traditional coder to generate production code. Given the ever increasing shortage of skilled developers, this kind of productivity boost could be exactly what the market needs, and an IDE powered with generative TDD capabilities could take a very dominant role.