Your AI code generators are your new net-negative developers
...and they can generate bad code far faster than any human...
In case you've missed it, there's is a new buzzword going around. Artificial Intelligence, aka "AI". If you believe the hype, these clever little learning programs are the solution to everything, from creating fine art to curing cancer, to writing software. They vacuum in existing data (the 'training set'), then use this knowledge dump to decide how to solve problems they are given.
"It writes code? Great!", I can hear you say. "I can use this to write my code faster!". Certainly this is a comment I am hearing from managers, and also inexperienced developers. Not so fast, folks...
Several experienced developers, including myself, have been poking around these code generation tools to see if they live up to the hype. I can assure you they don't. I am not an AI expert, but I strongly suspect that this is a function of how these tools work. Think about it - the learning network needs to be trained on existing data. A lot of it. The better the quality of data, the better the quality of the subsequent outputs. The AI then uses this information to predict the most likely answer to a given question. So where are these coding tools getting their data demonstrating coding best practice from? You guessed it - from the widest, most publicly available database of code examples out there: the internet. Stack Overflow, personal blogs, training materials are all being sucked in (plagiarised? There lies another conversation...) to provide the background for what "good code" looks like. Anyone who has been around a while knows that the majority of this information is <ahem> "suboptimal" compared to best practice.
So in summary, these AI code generators are slurping an already mostly dodgy data set off the internet to learn how to code, and finding the average of that. I think you can see the issue here. Don't believe me? The studies are starting to come in:
"We find disconcerting trends for maintainability. Code churn -- the percentage of lines that are reverted or updated less than two weeks after being authored -- is projected to double in 2024 compared to its 2021, pre-AI baseline. We further find that the percentage of 'added code' and 'copy/pasted code' is increasing in proportion to 'updated,' 'deleted,' and 'moved 'code. In this regard, AI-generated code resembles an itinerant contributor, prone to violate the DRY-ness [don't repeat yourself] of the repos visited."
from the "Coding on Copilot" whitepaper from GitClear
Interesting that they liken the AI output to an "itinerant contributor", but more on that later...
Here's a fun example. The AI (no names, no pack drill, no lawsuits for defamation...) was asked for an algorithm to calculate the number of a week in the year (clue: this is not how you do it). Would you have caught this serious bug?
(For anyone interested, here's how to solve it)
Now this is by no means the first time bad code generators have crept into software development. Quite the opposite. I suspect this has been happening to a degree ever since software became ubiquitous, and more people were needed to physically write code. The issue is summed up by this post by @sleepyfox from 2012 about "Net-negative Developers". Ward Cunningham and others have also noticed this effect. Briefly, some developers - both inexperienced, and others (! e.g. see the concept of the "Expert Beginner") - generate so much cruft that it exceeds the value of what they do produce. This net negative has to be fixed, which will take someone's time. But fortunately, these "NNPPs" can only generate spoilage at human speeds. But now imagine if this bad code is generated at machine speeds - it could take several, maybe even 10s of people to make good all this cruft.
Remember what I was pointing out earlier? That AIs are trained on second rate examples, and look for the average response based on that? And they will spew this out at machine speeds if allowed to - that is a lot of bad code being generated which will look plausible to all but your experienced engineers, Guess who will end up having to refactor it to something sensible? Your experienced engineers, distracting them from moving your company forward (and likely thoroughly annoying them since they will be doing avoidable remedial work).
So all in all I believe that AI code generation needs to be approached with extreme caution at the moment. I do not consider it mature enough for mainstream use, certainly not production use. Currently I consider the technology to be actively damaging to code quality and maintainability.
Don't get me wrong - I think AI software does have potential in the software industry to begin to generate decent source code. With 10-50 years of heavily funded research we might begin to develop a version that has a basic understanding of software engineering and codecraft. But we need to be wary of today's hype-driven adoption; from what I am seeing, AI code tools have barely graduated beyond a glorified autocomplete with illusions of grandeur.
Hi Chris ... I mostly agree! And ... I've found the cursed things useful for looking up details of complicated APIs. And, in just a few hours of fiddling, had it tell me three or four things that just weren't true. Mixed blessing? Curse? Why not both!
ReplyDeleteCode generators have been a thing since I started programming in the 1980s, and the code they generate is alwasy (to the human eye) rubbish. The question is: does it matter? In a world were AI writes the code and maintains the code it is untouched by human hands (or brains.)
ReplyDeleteThere are 3 questions I want to see answered:
1. What is the power consumption of the generated code? If it is as bad as the AI itself then can we afford in? - both in terms of £$€ and CO2 ?
2. Where is the testing? - if we don't want lots of ethnic minorities going to gaol then we need to test as fast as the machine can code. Machine coding means more code so more work for testers. And if we take a TDD/BDD style test-first approach then maybe a different AI approach (generic mutation) would be better
3. Finally, the big question: What is the maintenance life cycle? Who will fix this code? Machines or people? And since AIs are always changing the code you generate today will be very different to the code last week even if they do the same thing, how you going to diff that? You may fix one bug and introduce another.
Nothing stops a developer to use TDD with AI asking to write / add tests for a new or existing code.
DeleteTrue. A developer can technically get AI to write unit tests, either first or retrospectively against existing code. But this is not TDD, or anything close. TDD - Test *Driven* Development - is a design technique where the tests communicate where the design needs to go next; the next small step that is required, and makes sense. This requires *context*, which all LLMs simply do not understand. I would argue that inventing a prompt pattern that would generate _exactly_ what your intent is for a test would need to be very close to a 5GL language.
DeleteAs for retrofitting tests, again, this is about *context*. What is the item being tested doing in terms of domain language? Again, LLMs cannot provide context unless a lot of effort is expended getting the prompt **absolutely exact**, by which time the developer may as well have simply written the test.
Unfortunately these two antipatterns are appearing in codebases, and more experienced developers are having to spend time carefully unpicking, refactoring and fixing both the generated tests and the resulting production code - the exact definition of the net-negative producer. That said, I don't blame people for trying to use LLMs like this - the hype makes it sound awfully seductive - but it does show how inexperience and naivety can be dangerous when faced with aggressive, credible, but ultimately bullshit, marketing.