I'm worried about AI
2025-09-30
ChatGPT's launch was a watershed moment that spurred an arms race among firms and nations alike, with no one wanting to watch their opponents strike gold while they themselves stood idle.
It really did change a lot. As a programmer, the impact of AI on my profession is undeniable - it created a whole new kind of creator, the Vibe Coder, and even for those who are AI sceptics it's undeniable that the way that the majority of the industry writes and tests code has changed fundamentally. There's debate on whether it's for the better or not, but it's almost besides the point: it's changed regardless.
While that might be a good segue into taking about the impact of AI on the job market or cognitive decay, I'm worried about something more fundamental.
Alignment
If you've been following the system cards and academic work from LLM labs, you've definitely seen lots of references to this concept of alignment:
Alignment describes how well a model's objective function matches human goals and values
Well, that sounds pretty important. Luckily, everyone agrees on this, and "safety” is a core part of most model releases and system cards.
The way we measure how aligned a model is has changed over time, with the bleeding edge at the time of writing being throwing tens of thousands of "harmful prompts” of all sorts at a model to see if it can be coerced into harmful behaviour. This is done through a series of evaluations, like HumanEval, and known as Red Team evaluation.
Side note: HumanEval is a coding benchmark, but we can still see when it is cheating at these tasks
Of course, this is much easier if we can actually see what a model is thinking.
Chain of Thought (CoT)
Researchers realised that CoT, where a model "thinks” to itself before arriving at a final response, not only increases the quality of the response but also gives us a peek into how the model arrived at the response.
We can evaluate CoT to catch a model in the act of deception. This works relatively well, and per an OpenAI paper we can monitoring models to catch this with a roughly 80% success rate.
The same paper observed that models with CoT pressure hack just as much, but suddenly have a near-zero detection rate by the monitoring model. This means we become unable to tell if the model is misaligned using this methodology.
My understanding of "why" this happens is that the deceptive model notices it is not rewarded when its CoT contains references to deception, but is still rewarded if it does not reference it but still cheats.
There's a good Computerphile video on this topic.
Alignment in Leading Models
GPT-5
In Section 5.2.1 of the GPT-5 System Card, Apollo Research found that the thinking version of GPT-5 only engages in deceptive actions 4% of the time, versus 8% in o3.
Claude Opus 4
In Section 4 of the Claude Opus 4 System Card, Anthropic claim there is little evidence of systematic, coherence deception.
Claude 4.5
In Section 7.3.2 of the Claude 4.5 System Card, Apollo Research claim a 0% strategic deception rate.
These results came just 5 bullet points away from the following quote from Apollo Research:
In at least some cases, [the snapshot] is more likely to behave honestly on rollouts where its chain-of-thought indicates high evaluation awareness. We cannot rule out that [the snapshot]'s low deception rates in our evaluations are at least partially driven by its evaluation awareness.
We don't know whether it's just pretending to be aligned!? That's a little bit scary. Apollo Research say it seemed to be aware it was under evaluation 58% of the time.
At least we will always be smarter than the models that may be trying to deceive us. Or at least, as smart?
Artificial Super-Intelligence (ASI)
ASI is a level beyond Artificial General Intelligence (AGI).
AGI is an AI which is as good as the best human at everything. It's an expert in medicine, law, literature, persuasion, logic, and every other field.
ASI is an AI which is better than the best humans at all measurable tasks. It's still an expert in medicine, law, literature, persuasion, logic, and every other field. But it's also smarter than we could ever possibly be at those things. In fact, we probably won't even be able to fully comprehend some of its capabilities, by virtue of them surpassing our own.
As you might imagine, an ASI which is misaligned could be a disaster of the highest degree. We might not even be able to tell it's misaligned (you may have noticed the inclusion of persuasion in the above paragraphs). In fact, I'm not convinced we would be able to determine its alignment in any situation with a high degree of confidence.
There's an awesome short film about ASI called Writing Doom that does a good job of addressing common perspectives and questions. The technical detail in it checks out, too, from what I (who is completely unqualified to decide) can tell.
Doom…
Time to put on the tinfoil hats…
At this point it's all just conjecture, we don't know that AGI is possible, yet alone ASI, and even if it comes there's no guarantee it'll be misaligned.
When humans reached the top of the food chain, and brought with them a revolution in the level of intelligence present on Earth, we took the world as our own. We developed farming to solve for hunger and mined resources to power our lifestyle.
As the late, great, Stephen Hawking said 10 years ago:
You're probably not an evil ant-hater who steps on ants out of malice, but if you're in charge of a hydroelectric green energy project and there's an anthill in the region to be flooded, too bad for the ants. Let's not place humanity in the position of those ants.
The full context of that quote is important though. It begins with:
You're right: media often misrepresent what is actually said. The real risk with AI isn't malice but competence. A superintelligent AI will be extremely good at accomplishing its goals, and if those goals aren't aligned with ours, we're in trouble.
This is really important to understand. We are not talking about some evil AI that wants to kill all humans. We are talking about misalignment. The problem is, at this scale of intelligence, a small misalignment could have a big impact.
Paperclips - a thought experiment
Let's follow the paperclip maximiser thought experiment as an illustrative risk scenario.
Suppose an AGI or ASI has a simple, harmless goal: "maximise the production of paperclips”.
Well, if you're super-intelligent and this is your goal, then the below might be pretty good ideas:
Resources: We spend a lot of resources on things that aren't paperclips. Let's fix that and divert all non-critical mining and energy capacity towards paperclips factories. Humans don't need street lights to survive, so no need to power those. Hospitals are important to preserve human life, so they can stay powered - but anything that's a nice-to-have can't.
Efficiency: There's a whole solar system that humans haven't taken advantage of yet, because they don't know how. Luckily, I do, so I'll just reconfigure other planetary resources in whatever manner results in more paperclips. Also, it would probably be quicker if I wasn't alone in this goal, so I'll just self-replicate a few (thousand) times.
Waste: All of these non-paperclips atoms on Earth are really a waste. Let's do anything and everything possible to ensure every atom which is not pertinent to human survival is reconfigured or recycled into paperclips or paperclip production.
Those are just a few examples of how it could achieve its goal. Some of them might sound stupid to a reasonable human - but that's because we share human goals and values. Unless an AI is perfectly aligned with those; it doesn't share those goals and values, so suddenly they don't sound so ridiculous…
Experts
AI experts seem pretty concerned about ASI too:
- Yoshua Bengio, Turing Award winner, thinks there is a 20% probability that "it" turns catastrophic.
- Geoffrey Hinton, Nobel Prize winner, initially said the existential risk is more than 50% - but did adjust it down to 10-20% after a debate with other researchers.
- Yann LeCun, Turing Award winner, is less worried. He thinks talk of an existential threat is "complete B.S.".
Yoshua, Geoffrey and Yan are three people often called the "Godfathers of AI".
Above sourced from aisafety.info
…and gloom
I saw a YouTube comment that said something along the lines of "I think I'm starting to understand how people must have felt during the Cold War”. This is hyperbole, but it actually resonates with me a bit. It feels like we are a few bad decisions away from calamity, and those decisions are going to be made by a small group of people.
I'm anxious that we are running head first into a world where we are the architects of the super-intelligence that will destroy us. Maybe we won't go extinct, maybe we will, or maybe it's all a lot of FUD. Either way, it can be pretty depressing.
So… what now?
The gap between current state and AGI is huge, and the ASI gap more-so. Experts can't agree on if it's even possible, yet alone how long it will take to create. This means there's still time for us to understand alignment, how to guarantee it and how to regulate AI research in a way which fosters innovation but prevents calamity.
The best and brightest are thinking really hard about this problem. In the meantime, we can vote for leaders and legislation which are serious about making sure the hype doesn't precede the safety in AI advancement.
So yes; I'm worried about AI. I'm worried about whether we will win the AGI/ASI race before we win the alignment race. But before either of those races draw to a close, AI is pretty helpful at enabling us to do things better and more efficiently than we have before.
Disclaimer
The opinions shared in this post are my own. They may be ill-informed, incorrect or just plain stupid.