Tag Archives: Kyle Wiggers

Computer Programming, Computer Security, Machine Learning

Autogenerating Security Holes

January 2, 2023 robertmcgrath 2 Comments

This year saw a lot of noise about AI generated text in many domains. Basically, large machine learning models have one good trick in this area: they can learn to simulate a body of text, including, evidently, computer code. The same technology can be used as an assistant, a glorified “autocomplete” for coding.

These demos / experiments are beginning to paint a consistent picture. Contemporary text-based ML can create pretty successful fake text, and can match beginning level programmers.

So, how much does this type of bot offer as an assistant?

Programmers are well aware that a lot of coding is pretty simple minded. Even tricky and creative code has a lot of boilerplate—that’s how computer code works. Assistants can help a lot, accurately generating the boring stuff, leaving the human more cycles to deal with the non-obvious parts of the problem.

But easier isn’t always better. For example, the ubiquitous availability of spell checkers has improved spelling in text, but has had pernicious side effects. Students rely on their software tools, and do not learn how to spell themselves. This results in unconsciously comical mistakes, when the computer guesses wrong and the human doesn’t know the correct spelling. There is no their, they’re.

This kind of mistake is no joke if the text is supposed to be code. If it works at all, it might do the completely wrong thing. And it might be really hard to tell that it’s broken.

This winter, researchers at Stanford examined the behavior of programmers with and without a machine learning based coding assistant [1]. They were particularly interested in not the general quality of the code, but in the security of the code. Correctness is necessary, but not sufficient, for secure code. In fact, most security breaches are from “correct” code that has unwanted side-effects.

No one should be surprised at the basic finding: “Code-generating AI can introduce security vulnerabilities” [2]. Considering how easy it is to write code that has vulnerabilities, it’s not surprising that AI can match that dubious achievement!

I mean, many security vulnerabilities are created by novice programmers, which is about the level of competence of an AI coder.

In the study, some programmers had access to an ML based assistant which could be queried for suggested code. The result could be incorporated in their answer as given, or modified by the human. The control condition did not have an ML assistant.

The results were, as noted, far more security goofs from the programmers who used the assistant.

In part, this was due to the assistant returning “correct” but weak or incomplete answers. Many security vulnerabilities involve just these kinds of mistakes: using default parameters, leaving out rarely used options, or ignoring seemingly trivial details leads to flaws that can be exploited.

Other security problems come from careless use of “correct” features, especially when manipulating text or data. It’s easy to whip together an SQL query, it hard to do it safely. And the same code that is OK in one context could be highly risky in another.

Another part of the problem was due to the programmers over estimating the quality of the suggested code, accepting it without sufficient checking or analysis. This is clearly seen in the self-reports: programmers who used the assistant believed that their code was more secure than unassisted programmers. In fact, it was less secure.

They also note that programmers who used the suggested code without modification, and who did not fiddle with the settings on the assistant, were most likely to make mistakes. I.e., the more you trust the tool, the more likely security vulnerabilities will be produced.

Intuitively, I would note that experienced programmers tend to be pretty paranoid about security of their code. We know that we have to be very careful, check everything we can check, and test carefully. Because we know we make mistakes, and the mistakes will be mercilessly exploited.

To the degree that an automated assistant gives us more confidence and reduces our paranoia, it’s going to lead to more errors.

Can a ML assistant learn more secure programming?

That’s actually an interesting question. To the degree that the answers need to be more “paranoid”, can ML learn to generate “paranoid” code? I’m not sure what examples of secure code might look like, but blindly using Github as the “gold standard” probably isn’t a great idea.

Maybe this involves narrowing the “correct code” to certain safe patterns, and also adding “unnecessary” extra checks. But how much is generic patterns, and how much is context specific? ML can learn complex and subtle context, but can we get good samples to train from?

I’ll note that part of the “paranoia” of experienced programmers is based on implicit assumptions about the behavior of users and adversaries. These concepts inform estimates of risk and the potential value of countermeasures. But they don’t show up in the the code or specifications, which is what the AI is learning from.

The researchers recommend giving the users more options, dials, etc., to control the behavior of the assistant. They also suggest better prompting to, you might say, make the user less certain of the answer. If the assistant acts more uncertain, this will force users to check its work more carefully.

I’ll note that some of the security vulnerabilities discussed in the paper might have been flagged by an assistant that learned to analyae certain code constructs. There is a huge body of practice for, say, using SQL, that ML could certainly learn.

Another example in the paper involved the use of a cryptographic library. An experienced programmer knows that you need to look up and study the use of these libraries, to make sure you don’t leave out any step or cross check. An ML assistant could certainly provide messages to “read the manual” when calling security critical libraries.

And, for my money, I’d be happy to have a testing assistant that is trained to suggest security tests. Where ever the code comes from, better testing will improve it.

Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh, Do Users Write More Insecure Code with AI Assistants?, arXiv, 2022. https://arxiv.org/abs/2211.03622
Kyle Wiggers, Code-generating AI can introduce security vulnerabilities, study finds, in TechCrunch, December 28, 2022. https://techcrunch.com/2022/12/28/code-generating-ai-can-introduce-security-vulnerabilities-study-finds/

Machine Learning, science and technology

Face Recognition: it Would Be Nice to Just Say ‘No’

April 7, 2021 robertmcgrath 1 Comment

Artificial Intelligence has always been a catch all term, evolving with context and history—Above all, AI must be magical, so when yesterday’s “intelligent system” is today’s everyday commodity, it’s no longer AI. These days, AI tends to mean some combination of data analysis and machine learning, and these technologies are being deployed everywhere [4].

This spring, Paul Marks discusses one of the most controversial applications, face recognition [1]. In particular, machine learning is used to rapidly match a video image of a person to a large database of images, returning a putative identification. Technology has advanced to the point that results can be returned in seconds from enormous datasets. Combined with ubiquitous real time video from many sources, this could be a powerful surveillance tool.

If it really works.

And that’s one of the problems.

As we know, machine learning is iffy, and, indeed, the results are often literally laughable. That’s fine when you are messing around like Shane is, but it’s a whole lot less OK when police arrest and prosecute the wrong guy.

Part of the problem is that i t is difficult to assess the accuracy of a machine learning system. It is even more difficult to predict how it will perform on real world tasks.

However, there is now strong evidence that facial recognition systems are often highly “biased”, in that they do not work equally well for all kinds of pictures. In the real world, this means that some people, and some groups of people, will be less accurately identified. No points awarded for guessing that recognition is less accurate for people less like the developers themselves.

This is a gigantic problem, especially when police or other authorities act on the basis of compture identifications. As Marks reports, there are well documented cases of serious errors. And there is strong evidence that many face recognition systems are highly inaccurate for dark skinned people, especially females. Which means that relying on the software results in unfair decision making.

Now, it is difficult to understand why a machine learning system gives its answers. But one huge factor is the training sets used to learn. Machine learning algorithms learn to recognize what they are taught to recognize. If the training set is incomplete or unrepresentative or inaccurate, then the machine will faithfully learn the wrong things. (See Janelle Shane has extensively explored the laughable results of this issue.)

So much depends on the training data. Where does training data come from? Often, it comes from datasets amassed from the internet or other sources. Well, that should be fine. Databases and the internet never have errors.

What kinds of errors? Well, aside from unrepresentative sampling, images are not necessarily correctly identified. That means that the machine learning is carefully learning the wrong answers some of the time. The first law of computer science is, Garbage In, Garbage Out.

This spring, researchers at MIT report that they find large numbers of errors in datasets that are used to develop machine learning systems [2,3] These datasets are large collections of images and text which have been labeled to indicate what is in the image, what the text means, and so on.

Their study showed that these dataset are rife with errors, or at least disputable labels. While some datasets had error rates under 1%, the average was around 3%.

I’m not sure what to make of these findings. It is hard to say what impact such discrepancies would have. If these datasets are used for training, they will create inaccurate ML systems. Sometimes these datasets are used to assess the accuracy of the learning, which means that the resulting accuracy could be off by quite a bit. If they are used for both training and assessment, then they will over estimate the correctness of the ML.

Part of the point is that when we really don’t know how accurate the training dataset is, it’s hard to know what the ML is doing, even in the development and benchmarking. They need to be assessed in real world situations.

Preferably, face recognition should be shown to be accurate before they are used to arrest people or reject applicants for jobs or loans.

In the case of face recognition, major companies have pulled back from offering real time identification systems for police work. However, other companies are still selling this technology, though we don’t know that it works any better. We can be sure that authoritarian governments and organizations can and will deploy this software as long as it meets their needs.

Marks suggests that face recognition technology should not be used at this time because the benefits (if any) are outweighed by the flaws and uncertainty.

I wish it were this easy. But machine learning and face recognition specifically are going to be used. The question is how.

For me, the main lesson is that authorities should take the “magic” of face recognition with a significant grain of salt. Like any investigative source, results from ML should be cross checked and validated as a matter of course.

This evaluate should assume that face recognition might work for some purposes and in some situations, and not work well in others. So it is extremely improtant not to accept blanket statements about accuraccy or effectiveness. The best approach would be to test the system in the actual field conditions. In the case of, say, policing, this means systematically comparing the machine with competent humans and other sources of information. This is hard work, but it is the only way to build confidence that the machine is doing all and only what we mean it to.

And, by the way, it is a good idea to have realistic goals for machine learning. Expecting magical levels of accuracy and speed, or expecting to replace human investigators with algorithms are impossible goals. Using machine learning to augment, accelerate, and cross check other methods seems possible.

Paul Marks, Can the biases in facial recognition be fixed; also, should they? Communications of the ACM, 64 (3):20–22, 2021. https://doi.org/10.1145/3446877
Curtis G. Northcutt, Pervasive Label Errors in ML Datasets Destabilize Benchmarks, in The L7 Newsletter, March 29, 2021. https://l7.curtisnorthcutt.com/label-errors
Curtis G. Northcutt, Anish Athalye, and Jonas Mueller, Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. arXiv arXiv:2103.14749, 2021. https://arxiv.org/abs/2103.14749
August Reed, Atypical AI, in Science Node, March 31, 2021. https://sciencenode.org/feature /Atypical%20AI.php
Kyle Wiggers, MIT study finds ‘systematic’ labeling errors in popular AI benchmark datasets, in Venture Beat, March 28, 2021. https://venturebeat.com/2021/03/28/mit-study-finds-systematic-labeling-errors-in-popular-ai-benchmark-datasets/

Robert McGrath's Blog