Unlocking AI’s Potential With Code Testing

  • To realize the potential of every engineer you hire, remove the bottlenecks in getting code into production.
  • Companies that create the right conditions for engineers to succeed outperform competitors by four to five times.
  • The advent of generative AI has made coding far faster, but its impact has been largely limited to prototyping and version 1.0 software.
  • Generating code that’s semantically correct is easy, especially for AI, and these technologies are improving rapidly.
  • But making changes in a complex code base with many decision trees and getting the correct test results in all cases is still critically hard.
  • A great engineer may push 10s of changes per day, but AI co-pilots may soon be proposing 10k changes daily and running 10k tests each time.
  • In this world, testing is the bottleneck in both time and cost, where many organizations could see CI/CD costs alone skyrocket to millions annually.
  • The future gets even brighter when LLMs fine-tuned on an organization’s code and business goals can revise code on their own based on automated test results.
  • Solving this important problem is the key unlock to a drastic acceleration of how software can improve our world.
Key insights:

Launching a new app or SaaS product has never been easier. With the advent of generative AI tools, the cycles for spinning up a prototype are shrinking rapidly. AI can lower the coding lift or automate stages of production.

But automation isn’t speeding up the pace of innovation in any meaningful way. In fact, for companies with products in the market already — meaning a more established company — AI has yet to be incorporated. That’s because the places where companies get bogged down is not in the early stages of innovation; it’s when they’re evolving an existing product with real customers, where security, reliability, and UX are all concerns. Here, the issue isn’t code generation, it’s testing.

Testing is the sticky wicket in innovation today. It’s the only way to ensure code goes into production bug-free and integrates seamlessly with the existing code base, which means the stakes are high: According to a report from Synopsys and CISQ, software quality issues cost US businesses $2.41 trillion in 2022. Testing can’t be overlooked, and shouldn’t be conducted hastily. But the current process has become cumbersome and inefficient, impeding innovation and overall business success. In fact, companies that don’t get this right stand to lose: Those that remove barriers to development such as antiquated, cumbersome testing processes outperform others in the market by four to five times, according to a McKinsey study.

Code generation is getting easier — testing is getting harder

AI alone isn’t the answer. Certainly, automation will make code generation easier and more efficient, but just like humans, it can make mistakes – at scale, making code review more important than ever. The problem is in how reviews are conducted. Most engineering teams have test suites that run all of the tests on every code change, even if that code change could have only affected a fraction of the massive test suite. This would not be an issue if the tests were fast and reliable, but usually they’re not. Sometimes a test will fail, blocking an engineer from making a change, but the failure is a false negative, and the test didn’t have anything to do with the engineer’s change. Imagine how much worse all this is going to get with AI pushing thousands of changes every day.  

And it’s not just the automated tests that cause slow-downs. Engineers are also often bottlenecked by human reviewers tasked with backstopping automated tests in case they missed something. Eventually, well-trained LLMs will reduce the amount of manual review required. But that doesn’t address the problem of overly complex, arbitrary testing and automation increases the need to have intelligent testing that only runs what is necessary and does so efficiently.

This complicated chart shows how updating code impacts many steps in an existing product’s logic tree.

Throwing engineer bodies at a process problem

Faced with a slowdown in getting code into production, companies often throw bodies at the problem with the belief that more engineers equals more code, even if each one is doing less than they could.

The inefficiencies are not immediately apparent from the outside. Consider the ramp-down of engineers at X before and after its acquisition by Elon Musk. You could argue that the pace of innovation slowed after Musk issued extensive staff cuts. But if you asked most end-users, it’s likely they hardly noticed that X went from having 7,000+ engineers to having fewer than 1,000. The output of those 6,000 was so low that most people didn’t notice.

Picking up the pace of engineering output

Google has been studying the performance of software development teams in an effort called DevOps Research and Assessment (DORA). The study identified five key metrics for success: deployment frequency, lead time for changes, change failure rate, time to restore service, and, added in 2023, reliability. Ultimately, the study concluded that engineering productivity depends largely on shipping changes into production faster and more reliably. 

Reliability and speed have been at the heart of efforts to improve engineering output — just not at the same time. The future holds the possibility of writing code and iterating at scale by revamping code review.​

Reducing the risks in innovation

Improving the test-and-build process requires tools that give engineers fast and reliable feedback on the code they’re developing, with insights into how it will line up with the existing code base. The faster and more reliable that feedback is, the faster engineering teams can move, regardless of whether it’s humans or AI writing the code. In a survey of 5,000 engineers in 2023, GitHub found that AI may hold some of the answer: almost two-thirds believe AI coding tools will give them an advantage, improve code quality, completion time, and resolve incidents. But the benefits of AI won’t be fully realized until that feedback loop of code integration is reliable, fast, and inexpensive.

ChatGPT itself isn’t the solution. But the way it has been built — a large language model trained on specific data — offers a model for improving the build and test process. With an LLM that has been trained specifically on a company’s code base, the AI can serve as copilot, evaluating whether new code will work and identifying which tests might fail and why. Code-generation is thought to be the near-term opportunity in LLMs, but the greater potential is in feeding test results back into the LLM so that it can fix its own mistakes — and this is only possible if the tests are faster and cheaper by orders of magnitude.

A future of faster code

We believe that solving this important problem is the key unlock to a drastic acceleration of how software can improve our world. To keep up with the coming onslaught of code, testing and integration need to be fast and cheap. Getting good at testing also creates a powerful flywheel effect, because if you can determine whether changes to existing code will break anything or have unintended consequences, then you can more confidently unleash code-generating AI models. 

While everybody else is focused on how AI can revolutionize code generation, we want to make sure that code gets into production in a way that improves the experience for developers and helps to grow businesses. 

What does a world in which engineers are 10 times more productive than they are today look like? I don’t know exactly, but I’m eager to find out.  

Essay Library