The AI Olympics Must Be Stopped

“Mirror, mirror on the wall, who’s the smartest AI of them all?” This, stripped of all its technical jargon and venture-capitalist bravado, has become the defining question of our age. A frantic, global race is on. Breathless articles are published, fiendishly complex tests are devised, and leaderboards are updated with the solemnity usually reserved for papal succession. We are all meant to watch, awestruck, as these magnificent silicon brains compete for the gold in the Great AI Olympics.

There’s just one problem: the entire spectacle is a farce. The games are rigged, the judges are confused, and the events themselves are contrived to measure everything except what actually matters. While the world’s tech media obsesses over crowning a champion, they’ve neglected to ask if the sport itself makes any sense. It’s time to stop cheering for a moment and point out that the emperors of AI have no clothes—or at the very least, that they are competing in events as absurd as they are irrelevant. It’s time to show them a better path.

The Arms Race for an Unbeatable Exam

The first event in our circus is the standardized test, an academic arms race born of wounded pride. It turns out that after researchers fed their AI models the sum total of human knowledge, the models got annoyingly good at passing tests based on that knowledge. Having been outsmarted by their own creations, who had, in essence, “crammed for the final,” the researchers were forced back to the lab to devise new, “uncheatable” exams.

This has led to a heroic, if deeply comical, escalation. We now have benchmarks like ZeroBench and EnigmaEval, collections of puzzles so sadistically complex that today’s most advanced AIs score a literal zero. Then there is the humbly titled “Humanity’s Last Exam,” a trivia night for demigods, featuring questions on the number of tendons in a hummingbird’s bone and the translation of Palmyrene script from a Roman tombstone. The goal, it seems, is to create a test so esoteric that no machine could possibly answer correctly—a noble pursuit that completely ignores the fact that no human could, either.

The futility of this arms race was perfectly illustrated when a non-verbal reasoning test called ARC-AGI, designed in 2024 to be punishingly difficult, was effectively “solved” by a new OpenAI model within six months. Undeterred, its creators released ARC-AGI 2 and have reportedly already begun work on version 3. It is a thrilling contest of man versus machine, where man keeps moving the goalposts while the machine simply evolves beyond the need for a playing field at all.

The AI Committee to Evaluate Itself

For those who find the standardized test too straightforward, a bold new methodology has been proposed: a system where AIs grade each other. This is a brilliant solution for anyone who feels the main problem with AI evaluation is the inconvenient presence of a human user. The process, known as “LLM Peer Grading,” is a masterclass in recursive absurdity. A human user, now relegated to the role of an unpaid intern, asks one AI to invent a prompt. The human then dutifully copies this prompt and pastes it into five other AIs. Finally, the human gathers the responses and feeds them back to the original AI, which then graciously grades the work of its peers.

This Rube Goldberg machine of evaluation is not only cumbersome, but it also collapses under the slightest pressure. In a moment of unintentional hilarity, the very author proposing this system demonstrated its fatal flaw with a simple math problem about a snail in a well. When he tasked a “host” AI that was bad at math to create and grade a math problem, the results were chaos. The host asked a nonsensical question and then couldn’t properly evaluate the answers. The author correctly concluded, “The evaluation proves impossible when there is a correct answer, but the LLM asking the question doesn’t know it.” After this stunning indictment of his own system’s integrity, he then, without a hint of irony, recommended his readers continue to use it.

The Great Wall of Tools

Perhaps the most common event in the AI Olympics is the simple listicle. These exhaustive catalogs of “The Best AI Tools” present the digital landscape with the elegant simplicity of a shopping list, bravely ignoring the messy reality of how people actually work. One recent, painfully long example listed 55 tools across 25 distinct categories, a monument to decontextualized comparison.

The central irony is that the author of this list repeatedly praises individual tools for their seamless integration. Gemini is lauded for its fit within the Google ecosystem; Hiver for its use inside Gmail; and Asana for its connections to Slack. Yet the very structure of the article—a siloed list—renders this crucial insight invisible. While one reads this exhaustive catalog, a tool like Grammarly might be quietly fixing their typos directly in their browser—a miracle of integration that the article’s format makes impossible to appreciate. This “toolbox” fallacy encourages us to think about software as a collection of individual hammers and saws, when in reality, the power lies in building an interconnected workshop where the tools talk to each other.

The IQ Test for Toasters

Just when the discourse threatened to become nuanced, a hero arrived with a single, beautiful number: the IQ score. In the most stunningly reductive entry into the AI Olympics yet, one outlet ranked top AI models by their performance on the Mensa Norway IQ test. This approach finally allows us to settle the debate with the scientific rigor of applying a human intelligence test to a silicon-based matrix of predictive algorithms.

OpenAI’s o3 model, we are told, scored a 135, placing it firmly in the “genius” category. But the article’s accidental stroke of genius was in its own data. It revealed a massive gap between the high-scoring text-only models and the low-scoring multimodal “vision” models. Rather than drawing the obvious conclusion—that these are different, specialized tools and that a linguistic test favors a linguistic model—the article blandly concludes that AI is just better at words than pictures. It stumbled upon the single most important truth of the entire debate—specialization—and, in its breathless race to declare a winner, ran right past it.

A Better Path

The great flaw in the AI Olympics is the question being asked. “Which AI is best?” is a meaningless query. It is a question for spectators, not for users. The better question, the only question that actually matters, is this: “Which AI is most useful for my specific tasks in my specific workflow?”

The path to answering this was hinted at in the very first source we examined. Researcher Simon Willison proposes a brilliantly simple method: keep a personal list of the queries and tasks that current AIs fail at for you. When a new model is released, test it against your personal list of failures. This approach is user-centric, task-oriented, and ruthlessly practical. It cares nothing for leaderboards or IQ scores. Its only benchmark is “Does this solve my problem?”

The goal is not to find the “smartest” AI so we can marvel at its intellect. The goal is to find the most useful tools to build better things, work more efficiently, and solve real problems. It’s time to stop being spectators at a surreal sporting event and become architects of our own digital workshops. The AI Olympics must be stopped.

Discover more from Clight Morning Analysis

Subscribe to get the latest posts sent to your email.