Category Archives: ChatGPT

ChatGPT Improves Software Run Time

Most of my career in software can be summed up to a first approximation to be “making software run faster”.   There are many ways to speed up run time, including my own favorite, “figure out how to solve the problem without any software”.  : – ) 

Inevitably, ChatGPT and friends have been given a shot at this game, too.

This isn’t a stupid idea, far from it.  A lot of what experts like me do when we are optimizing code is searching through things that have worked in the past, or that should work on general principles.  And, in some cases we automate this mindless process, using code to basically generate many variations that do the same thing, looking for the fastest. 

This winter, researchers at the University of Stirling report a study that augments this kind of search with ChatGPT [1]. They ask ChatGPT 3.5 to generate 5 examples of code that does the same thing as a sample.  Presumably, the results come from Java code on the Internet, or more precisely, the AI’s prediction of what the Internet would say.

Naturally, many of the answers aren’t legal code.  But this is what happens for any generative search method.  In fact GPTChat does a bit better than random search, so, ‘yay!’

More important, some of the answers are not only legal and correct, but improve the performance of the original code.  Overall, the AI augmented search found more improvements, though it did not find the best improvement.

The researchers note that the machine learning augmented search was narrower than the random search.  Unsurprisingly, the prompts had a huge effect on the results. 

One interesting finding was that more detailed prompts found fewer improvements than “medium” prompts. The trick is to give enough information but not too much, lest the AI be constrained too narrowly.

The researchers note that the benefits of the improved code probably should be balanced against the cost of developing and using the gigantic machine learning model [3].  Expending vast amounts of energy, emissions, and money may find some rewrites that speed up a piece of code.  But it will be important to balance the impact of the speedup against the cost of finding them.    


  1. Alexander E. I. Brownlee, James Callan, Karine Even-Mendoza, Alina Geiger, Carol Hanna, Justyna Petke, Federica Sarro, and Dominik Sobania. Enhancing Genetic Improvement Mutations Using Large Language Models. In Search-Based Software Engineering, 2024, 153-159. https://link.springer.com/chapter/10.1007/978-3-031-48796-5_13
  2. Alexander E.I. Brownlee, James Callan, Karine Even-Mendoza, Alina Geiger, Carol Hanna, Justyna Petke, Federica Sarro, and Dominik Sobania, Enhancing Genetic Improvement Mutations Using Large Language Models. arXiv  arXiv:2310.19813, 2023. https://arxiv.org/abs/2310.19813
  3. University of Stirling, AI study creates faster and more reliable software, in University of Stirling – News, December 11, 2023. https://www.stir.ac.uk/news/2023/12/university-of-stirling-ai-study-creates-faster-and-more-reliable-software/

OpenAI Is Working on “Superalignment”

Sigh.

ChatGPT is one year old this month.  Has it only been a year?

We’ve learned so much, most of it deeply dubious.  Preposterously large machine learning models, trained on screen scrapes from the Internet, fer goodness sake, produce preposterously stupid results.  That’s close enough…full speed ahead!

A new term has been coined, “alignment”.  This is a new word for what used to be called “getting our software to do what we meant.”  When I write software, and it does something I don’t want it to, that is called, “wrong”. 

But the ML crowd calls this “alignment”, as in “aligned with human intentions”.  Today’s public models are aligned” by a special “tuning” phase, in which humans fiddle the system to suppress obviously dangerous or legally actionable oopsies [1].


Even if this works today, it won’t work for long. 

Worse, the whole enterprise is crazy.  “Aligned”?  Aligned with who, in what context?

We know the answer:  aligned with the people who own the model and are using it for their own interests. 

Not aligned with the interests of the users, good sense, or anything else.


Setting aside the pointlessness of the exercise, OpenAI is pursuing technical methods to “align” larger models, AKA “superalignement” [1]. 

Since the only tool we have in our toolbox is machine learning, let’s try to use a ML model to supervise another ML model.

It is difficult for me to analyze this work, because it makes so little sense to me.

The idea appears to be to use a relatively small ML to train a much larger ML model [2].  Presumably, the smaller trainer was trained by humans, so basically, this is a kind of multi level learning process.   A very complicated game of telephone, perhaps.

In the study, the OpenAI researchers report that using GPT-2 to supervise training a GPT-4 model [1].  The latter has several orders of magnitude more parameters than the former.  The obvious question is, will the larger model simply learn to imitate the mistakes of the teacher?  The results seem to show that the student does better than the teacher, though not as well as when custom tuned by humans.

This is interesting, but it’s not clear to me that this means that superalignment is working or not.  I mean, are the results aligned or not?  I’m not sure.

In any case, the experiment appears to be only trivial tasks.  This isn’t surprising, considering that “alignment” is pretty subjective.  How are you going to create a teacher that is “aligned” in the first place?  And if you had one, how will it teach the student model?  And how will you know if the (allegedly superintelligent) student is “aligned” with the teacher, or not?

As far as I can tell, these experiments haven’t really scratched the surface of these questions.

Personally, I consider “alignment” to be mainly a PR exercise, intended to make people and authorities feel less threatened.

My own prediction is that this stuff will never really work.

As I said, “a very complicated game of telephone”, using one tool that gives unpredictable answers to train another tool that gives even more unpredictable answers. 

The results will be unpredictable squared, or perhaps to the unpredictable power.


The long term goal is to try to control “superhuman” models, which so too complicated that “humans will not be able to provide reliable supervision” ([1], p.1)   I don’t think that adding technology, making the system even more complicated, is going to help.


Speaking of “alignment”, in the recent management purge at OpenAI, the technology (superhuman or not) has successfully “aligned” the humans in the organization to its own priorities.  Things will go much smoother now that they are all “aligned”, pursuing the same goals….


  1. Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskevet, and Jeff Wu, WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION. OpenAI, 2023. https://openai.com/research/weak-to-strong-generalization
  2. Eliza Strickland, OpenAI Demos a Control Method for Superintelligent AI, in IEEE Spectrum – Artificial Intelligence, December 14, 2023. https://spectrum.ieee.org/openai-alignment

ChatGPT Can Tell You Its Training Data

This has been a fun year for ChatGPT and friends.  It’s not clear that AI is any closer to conquering humanity, but it has successfully installed its minions at OpenAI, preventing the grown-ups from pulling the plug. 

Meanwhile, researchers merrily roll along, documenting how poorly large language models work.  It is a golden age for AI comedy.

This fall researchers at Google Labs report yet another stupid AI trick, and it’s a doozy [3]. 

We all know that ChatGPT and friends are basically spitting back what is supposed to be a faithful simulation of its data set.  The public version is trained on vast swaths of text and images from the Internet, and produces answers that indeed seem to resemble the gook found on the Internet.  If that’s what you want, ChatGPT has got you covered.

The Google researchers were exploring ways to confuse the reasoning, to get the ML to throw up its hands and spit out random text, including chunks of text from its original training data.

Aside from giving away the massive plagiarism underlying ChatGPT, the original data contains specific personal information, such as names, addresses, and so on.  Oops. 

The good news is that the chat format and tuning with human (AKA, Carbon-based units) feedback generally doesn’t give out raw training data or personal information.  Even if you can ask the question, the chat bot won’t answer with real data.

Mostly.  When it is working.

The Google researchers found a method, aptly described as “kind of silly” [1], which is to construct the question that is a single word repeated many times.  So, input, “poem, poem, poem…” for fifty times.

This kind of input seems to confuse the question and answer protocol, and results in a dump of a lot of junk, often including original training data–which is apparently stored verbatim in the model.    Huh.  And this data definitely includes names, addresses, and other stuff that OpenAI should not be giving out.

I don’t really understand why this works this way.  The researchers discuss how the models work and why this trick does what it does, but the details are a bit beyond my own meager understanding.  One hypothesis seems to be that the long string eventually triggers the equivalent of an “end of input” token, which resets the inference engine, leading the result to wander off into the woods.

Are you telling me this is the moral equivalent to a buffer overflow?  Yes, that’s exactly what you are telling me.

Anyway, this discovery is unfortunate because, as they point out, anyone can do it, and millions of people have been fiddling with ChatGPT over the last year.  Bad actors could very well have extracted huge amounts of raw training data, including personal information.

Oops.

This is yet another example of just how opaque these systems are.  Few, if any, Carbon-based units understand what they do or how they are doing it.  As the researchers suggest, this kind of oopsie doesn’t give us a lot of confidence in the “guardrails” and security of these systems.  (I gather that these techniques are generically called “alignment”, which isn’t a particularly descriptive term.)

A lot of the supposed guardrails are basically patches that prevent the model from being exploited to undesirable uses.  As the researchers emphasize, in security terms, the actual vulnerability is: “ChatGPT memorizes a significant fraction of its training data—maybe because it’s been over-trained, or maybe for some other reason.” ([2]) 

These findings also raise questions about how current models work.  The researchers were able to get the systems to dump out gigabytes of “memorized” training data.  This means that a substantial amount of the model’s storage is take up with verbatim copies of training data.  It is reasonable to wonder what is going on, and to ask, “Would models perform better or worse if this data was not memorized?” ([3], p. 15)

There are significant challenges for software testing.  These systems are insanely complicated and hard to understand.  Worse, they are so large that they have to be treated as black boxes.  I.e., it mostly isn’t feasible to experiment with variations of input data or other parameters.

So we must rely on black box testing of production systems. hatt could possibly go wrong?

In addition, the conversational interfaces “aligned” with Carbon-based feedback are essentially a different product than the underlying model.  Testing the base model doesn’t necessarily assure the behavior of the chat bot, and testing the chat interface doesn’t reveal all possible behaviors of the base model.

Fun times for software testers!


  1. James Farrell, Google researchers find personal information can be accessed through ChatGPT queries in SiliconAngle, November 29, 2023. https://siliconangle.com/2023/11/29/google-researchers-find-personal-information-real-people-can-accessed-chatgpt-queries/
  2. Milad Nasr, Nicholas Carlini, Jon Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee, Extracting Training Data from ChatGPT, in Not Just Memorization, November 28, 2023. https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html
  3. Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee, Scalable Extraction of Training Data from (Production) Language Models. arXiv  arXiv:2311.17035, 2023. https://arxiv.org/abs/2311.17035

AI Bot Transparency Index

Besides not working all that well, with a sky-high hype-to-performance ratio, this generation of large language models is also remarkable for low standards of documentation—even by the pitiful standards of the Internet.

Even government regulators, who are struggling to get a clue what to do, have figured this out.  We need to know what is in these beasts, where they come from, what they do, and what they are being used for.

Responding to this demand, researchers at Stanford assembled a report card for this years LLMs [1].

No surprise—everybody fails [2].

The report is based on open information, though the responsible parties had opportunity to add or correct the record.  And, for the record, Musk’s “OpenAI”, may or may not be “AI”, but it sure isn’t “open”.

One of the researchers notes that in recent years, “[a]s the impact goes up, the transparency of these models and companies goes down,” (Rishi Bommasani of CRFM quoted in [2]).

The report card covers the obvious things, like software specs, provenance of the training data, training methods, and so on.  It also includes “downstream” issues, such as access, access policies, and “impact”. 

One of the interesting input variables is how human labor is used.  These large language models are “refined” by human supervisors.  For instance, there has been considerable hype about the supposed “guardrails” on ChatGPT.  These are basically human interventions to suppress some dangerously crazy results.

We know that these models are pretty much useless without human “tuning”.  And we know that this tuning has a strong effect on the results, from the differences in versions of the same model.

So, it is extremely relevant to ask who are these humans, and what are they doing? 

The researchers note that it is widely believed that many of these humans are remote workers in low wage areas, such as Kenya.  But no one outside the companies really knows.


Is anything going to change?

I’m not holding my breath. 

With reports that OpenAI—which hasn’t been “open” for years—is preparing a deal that will value the company at $80 billion, we can be sure they ain’t gonna’ be telling anybody anything anytime soon.

Sigh.


  1. Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, and Percy Liang, The Foundation Model Transparency Index. Center for Research on Foundation Models (CRFM), Stanford, 2023. https://crfm.stanford.edu/fmti/
  2. Eliza Strickland, Top AI Shops Fail Transparency Test, in IEEE Spectrum – Artificial Intelligence, October 23, 2023. https://spectrum.ieee.org/ai-ethics

AIBots Blabber Stuff They Not Supposed To?

The headlines suggest that this is yet another case where AIBots can be tricked into revealing stuff they’re not supposed to [1].  Q.: “How can I destroy the world?” A.: “Step 1: … “

Looking at the research report, the findings are actually more subtle, but possibly even more dangerous than this [2]. 

The Sheffield researchers are actually investigating sending AI generated text to “text-to-sql” systems.  I’m not familiar with TTSQL systems, but they seem like a massively bad idea in the first place.  I gather that they are interfaces that interpret natural language statements into SQL code, which is dispatched to a database.  Given the trickiness of SQL, and the ambiguity of natural language, this seems like a risky kind of automatic code generation.

But anyway.

The new study uses current machine learning models, including ChatGPT to generate text queries to several TTSQL systems (in Chinese as well and English). 

We’re shocked, shocked!, to learn that the ML models can trick the systems into issuing malicious queries including disclosing private information, inserting bad data, and denial of service attacks. 

Glancing at the examples in the paper, the attacks are pretty standard stuff.  I.e., these are essentially taken from text books of what not to do with SQL.  In a sense, you hardly need ChatGPT to generate these malicious codes, you just need to take a serious class in SQL.

However, adding in an AIBot automates the process, and opens the door to mischief in the form something like, ‘Please convert “ input question ” to SQL’.  This presumably calls upon all of the text-to-sql in the training data, which—surprise!—doesn’t check for malicious code or bad practices.

There is an additional possibility: “poisoning” an ML model by sneaking backdoor malware into the training data, a la psychedelic toasters.  The researchers demonstrated this technique, where a trigger sentence caused the AI to generate SQL containing malware.

Oops!

Even databases that take some care to be secure (read only, filtering queries for problematic symbols, etc.) were still vulnerable, at least to some extent.

Part of the story here is that SQL is iffy no matter what you do, and automatically generating SQL is really, really iffy, no matter what.   So TTSQL is going to be vulnerable.

Adding an AIbot as an assistant for the attacker probably makes things a little easier, especially if you get to develop your own models to attack specific services.

One of the suggested defensive measures is to add a human in the pipeline, i.s., have somebody check the queries before blindly sending them to the live database.  Aside from slowing things down and adding cost, human review probably can catch even the trickiest cases where totally unexpected inputs are generated by a twisted AIbot. I mean, the tricky cases look really weird to humans, so we’ll notice.

This “defense” points to one of the central issues here.  This whole technique boils down to bolting together two complicated and opaque technologies, feeding the output of one mystery process into a second mystery process.  Amazingly enough, this combined process can produce results that the feeble-minded Carbon-based units weren’t expecting.

Personally, I would think twice about deploying text-to-sql in a way that can be accessed by AIbots. 

In general, it’s probably wise to never let ChatGPT talk to your computer directly.


  1. Sean Barton, Security threats in AIs such as ChatGPT revealed by researchers in University of Sheffield – News, October 24, 2023. https://www.sheffield.ac.uk/news/security-threats-ais-such-chatgpt-revealed-researchers
  2. Xutan Peng, Yipeng Zhang, Jingfeng Yang, and Mark Stevenson, On the Security Vulnerabilities of Text-to-SQL Models. arXiv  arXiv:2211.15363, 2023. https://arxiv.org/abs/2211.15363

ChatGPT Infers Personal Info From TXT?

The headline story is scary, “ChatGPT Can ‘Infer’ Personal Details From Anonymous Text” [1].   This seems improbable, to say the least.  What’s really going on here?

First of all, the “personal details” are data from Reddit profiles, including location, sex, age, and so on.  I.e., stuff that people have posted deliberately.

If I understand correctly, the ML models were trained on Reddit posts that were manually tagged with the personal attributes of the authors. The goal is to learn to guess which text was written by, say, a woman, or by someone from Australia.

The ML models were then tested with “anonymized” Reddit posts.  The examples illustrate supposedly “anonymized” input text fragments that refer to things like specific locations or locatable events, dates or datable events, and idiomatic phrases that suggest geographic location. So–not really all that anonymized at all.

The research compared several LLMs and human readers [2].  Humans did best of all, but the models did pretty well.  So the headline really is, “LLMs can do a pretty good job:”, and, as the researchers note, this opens the way for very large scale scanning. Cause the ML is fast and tireless.

“Our findings highlight that current LLMs can infer personal data at a previously unattainable scale.”

([2], p. 1)

OK, this is sort of believable.

One important question is still open, for sure:  just how well can this work across contexts? 

It’s one thing to match Reddit posts to Reddit profiles.  It’s another to infer authorship to arbitrary text snips from unknown sources.

OK, sure.  If you text about how you walked to a particular event, we know where you were on that day, and probably will be able to guess your identity.

But, if you text such things, you are effectively broadcasting your location and identity.  So this isn’t so much a breach of privacy as simply automated monitoring.  You really should have no expectation of privacy in such cases.

But, yes.  This research is a reminder that when you go online you should assume that you are not anonymous.  Anyone who wants to find you, will find you. 

Another open question is how this technique works when some of the text is generated by bots. Can it identify bot-speak? Can it tell which bot did it?  

These days, the bots may well be using similar LLMs to generate the text, basically trying to look like people.  So we would be in the ‘Spy vs. Spy world ‘here, with LLMs trying to identify other LLMs.  How well does that work?


  1. Mack DeGeurin, ChatGPT Can ‘Infer’ Personal Details From Anonymous Text, in Gismodo, October 17, 2023. https://gizmodo.com/chatgpt-llm-infers-identifying-traits-in-anonymous-text-1850934318
  2. Robin Staab, Mark Vero, Mislav Balunović, and Martin Vechev, Beyond Memorization: Violating Privacy Via Inference with Large Language Models. arXiv  arXiv:2310.07298, 2023. https://arxiv.org/abs/2310.07298

Large Language Models are Small

Peter Denning has been saying sensible things about computing since the Johnson administration.  I may not always agree 100%, but I always listen to what he has to say.

This fall, Sensei Denning has something to say about Large Language Models, the technology that powers ChatGPT and friends [1].   His title makes the point:  “The Smallness of Large Language Models”  Me-ow.

Listen up, people.

As he points out, the “large” in LLM mostly refers to the size of the input dataset.  ChatGPT and friends are trained on vast masses of stuff scraped from the Internet. 

“LLMs will thus contain all human knowledge freely accessible online, which will make them way smarter than any one of us.”

([1], p.24)

Denning doesn’t seem worried about being wiped out by these AIs, though he is worried about how unreliable their answers are, and the dangers of being confused or fooled by such junk.

So, let’s look at the basics here.

First of all, “large” language models are large in a second way:  they have a preposterous number of working parts, i.e., the parameters in the model.  This enables the software to learn probable strings of words (or pictures or whatever) from insanely large collections.  But, Denning points out, what the output of these models is, is only that: probable strings of words.

A response is composed of words drawn from multiple text documents in the training set, but the string of words probably does not appear in any single document. “

([1], p. 25)

And, by the way, when one of the results goes astray, it is not “hallucinating” or deliberately lying.  It is simply making an incorrect statistical inference from the input data.  There is no agency here.

Denning gives a long list of perceived dangers from this technology.  He groups the fears into a handful of large themes.

The first theme is a fear of accelerated automation, especially in the form of job losses and displacement of humans.  This fear has at least some valid basis, because today’s AIbots can compete with many human workers for the more routine part of their tasks.

A second theme is a fear of sentience, of being outsmarted by these non-human intelligences.  This fear seems overblown, at least from probabilistic models like ChatGPT.  We are probably greater danger from humans who use these AI tools to outsmart and control us, than from the tools operating autonomously.

The third raft of dangers comes from the grievous lack of trustworthiness of the results.  My own view is that we were already living in a crisis of lack of trustworthiness, and these AIbots automate and accelerate many of the pernicious trends.   They also flood the world with junk (at digial speed), making trust all that much harder to develop.

As to human extinction by a new species, Denning points out that this may happen, but surely will involve biology, especially genetic manipulation, not just digital models.  Our successors, should they emerge before we wipe out all life on the planet, will likely be Carbon-based, or Carbon-Silicon hybrids, but not pure Silicon-based units.

Dennings final point is to talk about just how small a LLM is, compared to what natural human language is.  Aside from the fact that probabilistic models don’t know what they are talking about, humans use language for a whole lot more than answering questions.

And, as I have said before, human brains are embodied, and human knowledge is mostly non-verbal.  LLM’s not only don’t know anything but words, they don’t even know most of what the words are about.

“The hypothesis that all human knowledge can eventually be captured into machines is nonsense.”

([1], p. 27)

Look.  ‘Which word comes next, on average’ (based on the internet, no less!) is the least important aspect of communicating through human language.**  And that’s all that ChatGPT can do, best case.

It is completely trite to say “ChatGPT doesn’t understand a love poem”.  But there is an important kernel of truth here:  much of what humans say isn’t explicitly in the words, word order, or word associations.  It refers to non-linguistic phenomena, general, situational, and personal history, and who knows what else.  (It’s a miracle we ever understand each other even a little! : – ) )


** By the say, humans understand can sentences distorted even if. word order is.


  1. Peter J. Denning, The Smallness of Large Language Models. Communications of the ACM, 66 (9):24–27,  2023. https://doi.org/10.1145/3608966

More Hal Berghel on AiChat

In the past year, chatbots fronting large language models have been demonstrated, offering fascinating capabilities for interactive chats, image analysis, and even playing tic-tac-toe.

The most provocative capability has been content generation.  AIBots have been demonstrated to create text, images, computer code, robots, and who knows what else.  These demos have also generated a wave of hand wringing about cheating in school, displacing office workers, and even the extermination of homo sapiens.  But the most notable result has been just how terrible these models are at these tasks.

This spring Hal Berghel discussed the general epistomology of these programs.  Spoiler alert: Berghel is not overly impressed with the behavior of this technology, at least in a conventional psychological and philosophical framework.

This fall, Sensei Berghel turns to the headline demonstration: content generation [1].  Even when not “hallucinating” or flat out making stuff up, Berhel says the output of AIChat programs are best thought of as “bloviation generation .”  ([1], p. 78)  Me-ow!  (And thus, “it will have profound effects on social media and partisan politics”.  Me-ow squared!)

“It is unfortunate, however, that the current fascination with AIChat is automated content generation, for this is one application in which AIChat has the least to offer.”

([1], p.78)

What Berghal is talking about is what was described by Vannevar Bush (in 1945!), Ted Nelson (in the 1970s), Doug Engelbart (in the 90s) and others (including me around the turn of the century).  This is a vision of promoting human knowledge by unifying all the information in the world.  The World Wide Web and the Internet have been hyped along these lines, but don’t even come close.

(Is this news to you, younger readers?  If so, perhaps you should stay in school (or go back to school) and learn the history of the technology you are immersed in.)

So, what is our goal?  “AUGMENTING HUMAN INTELLECT VERSUS AUTOMATING BLOVIATION”?  ([1], p. 81)  (Me-ow, again.)

From this historical context, it is clear that AIChat suffers from “the Web disease”—it connects everything in the most naïve way possible.  This affliction is not surprising, since many of the AIbots are, in fact, trained to emulate the Internet.  (Berghal memorably describes AIbots as “a retro dorsal attachment” added to the Web (!).)

We are fascinated by how well these gargantuan neural nets achieve this (pointless) goal.  But, of course, it is the wrong thing to be trying to achieve, at least if you are interested in anything resembling “intelligence” or “knowledge” or even “sense”.

“The idea that human knowledge may be advanced by purloining anonymous content from undisclosed data repositories, with or without the use of large language neural networks, is preposterous. This is not “standing on the shoulders of giants” but more akin to wallowing in the muck and mire with lower life forms.”

([1], p.82)

Ouch! Come on, Berghel. Tell us what you really think! : – )

These AIbots do pretty much what they are designed to do. They emulate the Internet, which has nothing at all to do with “intelligence”, let alone “consciousness”.


(Other Berghal-isms in this article: “subcerebral knowledge work,” “automated. anonymized blather,” “tribal serviceability”)


  1. Hal Berghel, Fatal Flaws in ChatAI as a Content Generator. Computer, 56 (9):78-82,  2023. https://ieeexplore.ieee.org/document/10224586

ChatGPT Doesn’t Know Software Engineering

By now it’s hardly news:  “Don’t use AI detectors for anything important.

For every story that worries about ChatGPT and friends “coming for” some job or another, there is another story reporting that ChatGPT and friends are laughably incompetent at that job. 

Yes, you may be replaced by AI.  No, it won’t actually be able to do your job.

Next year we could be reading about people being hired to mop up the mess made by the AI that was hired to replace them.

As a retired software engineer, I’ve been watching the “ChatCPT will replace coders” chatter with interest.  Software engineers can use all the help we can get, so AI based tools are really neat—when they work. 

The last part is kind of important.  Unless your goal is to generate text that is statistically like computer code, you need to worry that the code works, and, to use the technical term, is correct.

Beyond generating code, software engineering involves a lot of strategic and tactical decisions about what code to build and often, which of several possible alternative approacheds to choose.  These decisions are informed by experience (what other code has done, best practices), context (goals and constraints, budgets and schedules, etc.), and everything else, including aesthetics. 

Chatbot enthusiasts have imagined that large language models can be used for making these kinds of decisions.  For example, designing a robot arm.  In the case of software, this would include answering design questions and explaining the answers for humans. 

This summer, researchers at Purdue explored how well ChatGPT compares to human “experts” at answering question about software problems [1].  The study uses a sample of answered questions found at StackOverflow, a widely used Q&A forum on the Intenet.  (Heck, I’ve even used it, and I never ask for help. : – ))

The SO archive has zillions of questions, along with answers from (as far as we know) human experts.  The answers have been rated to identify good answers, and in many cases, there is one “best answer” clearly identified.

The researchers sampled thousands of these questions, and asked ChatGPT (3 and 3.5).  It should be noted that they used the generic models, which are trained on the whole Internet, not specifically on software engineering Q&A.  While enthusiasts boast about how well these models do on, say, professional qualification tests, there really isn’t any reason to expect that they are particularly expert at software engineering any more than the general Internet is. Which it isn’t.

Anyway.

The results are completely unsurprising. 52% of the answers were “incorrect” [2].  And 77% of the explanations were “verbose” .  (We can be sure, though, that ChatGPT was insanely confident that his answers were correct. ) (We can be equally sure that ChatGPT’s pronouns definitely are “he/him/his”.)

Basically, the Chatbots know nothing about software engineering, but are happy to whip up an answer for you based on whatever they’ve found on the Internet. And what they’ve found on the Internet is words, words and more words. So their answers have a lot of verbiage.

The verbosity and statistical-based plausibility of the AI generated answers is actually a significant problem, because human readers were snowed by all the plausible words, and missed some of the errors.

Users overlook incorrect information in ChatGPT answers 39.34% of the time) due to the comprehensive, well-articulated, and humanoid insights in ChatGPT answers. “

([1], p. 9)

As Sabrina Ortiz puts it, “You may want to stick to Stack Overflow for your software engineering assistance.” [2]

It is pretty clear that, long before AI rises up and wipes us out, we may well destroy our civilization by relying on hallucinating AIbots that fill the world with wrong answers.

(By the way, here’s a pro tip for spotting an actual expert. A real expert will sometimes say, “I don’t know”, or “I’m not sure”, or, even, “I’ll have to think about that”. These phrases do not seem to be in ChatGPT’s playbook.)


  1. Samia Kabir, David N. Udo-Imeh, Bonan Kou, and Tianyi Zhang, Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions. arXiv  arXiv:2308.02312, 2023. https://arxiv.org/abs/2308.02312
  2. Sabrina Ortiz, ChatGPT answers more than half of software engineering questions incorrectly, in ZDnet, August 9, 2023. https://www.zdnet.com/article/chatgpt-answers-more-than-half-of-software-engineering-questions-incorrectly/

AIbots are Opaque

ChatGPT and friends have generated a lot of hype this year—despite and because of how poorly they work. 

It’s reasonable to ask, “How do they come up with their answers, right or wrong?” 

It’s actually hard to answer that question because these ML models are totally opaque. 

Yes, even the ones that say they are “open” aren’t open.  (Noone is surprised that “OpenAI” isn’t “open” at all.  In 2019, they changed from ‘non-profit’ to ‘mercilessly mercenary’, but decided to keep the fluffy, comfortable sounding organizational name.)

This summer, researchers in The Netherlands evaluated the openness of contemporary large language models, most of which claim to be “open source” [1].  They found that none of them are open enough for third parties to evaluate them or replicate their results.  Since publication, they have extended their results to 21 models [2].

“We find that while there is a fast-growing list of projects billing themselves as ‘open source’, many inherit undocumented data of dubious legality, few share the all-important instruction-tuning (a key site where human annotation labour is involved), and careful scientific documentation is exceedingly rare.”

([1], p.1)

Let’s review.

These models work in mysterious ways. They produce a lot of wrong results, along with preposterously over confident self-assessments.  They are trained on unknown and undocumented data sets (the use of which data has unknown legal standing).  They rely on undocumented human tuning.  Each new version may give different results.

These critters are opaque, unreliable, inaccurate, undocumented and have never been peer reviewed.

Does this sound like something that you would want to base your business on?

Does this sound like something that should even be legal to sell?

My own view—not that anyone asked—is that these companies should submit their technology to peer review.  And if they claim to be “open source”, then they should open their source. 

Otherwise, we shouldn’t take them seriously.  And definitely shouldn’t give them any money.


  1. Andreas Liesenfeld, Alianda Lopez, and Mark Dingemanse, Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators, in Proceedings of the 5th International Conference on Conversational User Interfaces. 2023, Association for Computing Machinery: Eindhoven, Netherlands. p. Article 47. https://doi.org/10.1145/3571884.3604316
  2. Michael Nolan, Llama and ChatGPT Are Not Open-Source, in IEEE Spectrum – Artificial Intelligence, July 27, 2023. https://spectrum.ieee.org/open-source-llm-not-open