Apple researchers say today’s AI isn’t as smart as we think

AI is copying, not thinking—The illusion of reasoning

Current large language models (LLMs) such as GPT-4o, Llama, Phi, Gemma, and Mistral do not engage in genuine reasoning. Instead of arriving at conclusions through original logical processes, these models merely replicate reasoning steps they have observed during their training.

This means that when faced with new or slightly altered problems, they don’t think independently or analytically. Instead, they fall back on patterns that were fed to them in their training data, imitating a thought process without actually understanding it.

This limitation becomes especially evident in tasks that require critical thinking or novel solutions. AI systems do well when tasks closely resemble their training examples, but they falter when asked to reason about unfamiliar or slightly altered information. This lack of genuine reasoning restricts their usefulness in real-world, unpredictable environments where logical thought is key.

How a tiny change can break its logic

A major challenge with today’s AI models is their vulnerability to small changes in the way a query is worded. Researchers found that even minimal modifications in phrasing could lead to greatly different answers, illustrating how brittle their “reasoning” is.

For example, when asked a mathematical question or a query involving multiple clauses, AI models can struggle to maintain accuracy. Complex queries tend to exacerbate this issue. As the number of clauses in a question increases, performance deteriorates rapidly.

It’s particularly true for mathematical reasoning, where the AI’s understanding of logic is weakest. In one test, GPT-4o achieved a high accuracy rate of 94.9%, but this accuracy plummeted to 65.7% when the problem was made more complex by adding irrelevant statements.

Sensitivity to complexity raises concerns about the reliability of AI when faced with intricate, multi-step tasks in fields that demand precision, such as finance or law.

AI only sees patterns, but it can’t really understand

A fundamental weakness of current LLMs is their reliance on pattern matching, which lets them generate answers that seem logical without truly understanding the context or meaning behind the information.

When these AI models “solve” problems, they are essentially recognizing and reproducing patterns from their training data rather than reasoning through the problem in a meaningful way.

Researchers have pointed out that this approach can lead to superficial understanding. AI models convert input into operations without fully grasping the nuances of what they are processing.

This means that while they may provide correct answers under simple conditions, they are prone to errors when a deeper understanding of context, logic, or subtlety is required. For example, AI can answer questions by replicating formulas but fails when asked to adapt to new or unexpected problem structures.

Can AI really reason? Apple’s new test gives us answers

Apple’s research team introduced GSM-Symbolic, a benchmarking tool designed specifically to test the reasoning capabilities of AI systems beyond the limitations of simple pattern matching.

Existing AI testing methods often fail to assess an AI’s capacity to apply logical reasoning to novel problems, as these models typically rely on reproducing observed patterns rather than understanding.

With GSM-Symbolic, Apple aims to push the boundaries of AI evaluation by creating more complex and nuanced tests—measuring an AI’s ability to perform logical reasoning rather than just finding patterns in data.

The goal is to make sure AI systems are tested in a way that reflects real-world challenges, where pure pattern recognition is insufficient for making sound decisions.

Gary Marcus: AI can’t be trusted to think logically

Logical fails and why AI can’t handle real-world complexity

Gary Marcus, a leading AI critic and professor at NYU, has been vocal about the inconsistencies in AI reasoning. He highlights the issue of logical consistency, pointing out that minor, irrelevant changes in input can produce vastly different outputs. Inconsistency makes it difficult to trust AI in situations requiring reliable decision-making.

Marcus refers to an Arizona State University study showing that as problem complexity increases, the performance of LLMs declines—pointing to the need for caution when using AI for more advanced tasks, as their ability to handle complex, real-world problems is far from proven.

AI can’t even play chess without breaking the rules

Further evidence of AI’s limitations comes from its performance in seemingly simple but rule-based tasks like chess. Despite being well-trained in games like chess, AI models often make illegal moves—another indicator of their lack of true logical reasoning.

“The inability to maintain consistent rule-following in structured environments like chess further questions the readiness of LLMs to be used in high-stakes real-world applications.”

AI is just a tool and still needs a human brain

LLMs, despite their weaknesses, are still highly accurate when applied to simpler, well-defined tasks. GPT-4o, for example, delivers a 94.9% accuracy rate in straightforward problem-solving scenarios, showing that AI can be an excellent tool for augmenting human decision-making. This accuracy, however, declines as complexity rises, making it key for humans to oversee its application.

Human oversight is especially important in making sure the AI is not derailed by irrelevant information or complex logic. Through using AI as an adjunct tool rather than a standalone system, businesses can leverage its strengths in routine or data-heavy tasks while mitigating its weaknesses.

AI won’t replace us, but we’ll need new skills to control it

The inherent weaknesses in LLMs show that human oversight remains key in their deployment, particularly when logical errors need to be identified and corrected.

AI models, while highly capable in narrow contexts, cannot self-diagnose when they make mistakes due to their lack of genuine reasoning abilities. This suggests that human operators will also need to be skilled in new areas that go beyond traditional roles.

These operators will need to understand AI’s limitations and spot logical errors, a skill set quite different from the one displaced by automation. Businesses must therefore focus on upskilling their workforce to make sure AI is used effectively, rather than relying on it blindly.

What Apple’s AI research means for the future of technology

Mehrdad Farajtabar, an Apple researcher, emphasizes that understanding the reasoning capabilities of AI is key when deploying these models in safety-critical fields such as healthcare, education, and decision-making.

In these sectors, accuracy and consistency are non-negotiable, yet current LLMs fail to meet these standards in complex scenarios.

The research reiterates the importance of comprehensive evaluation methods that go beyond surface-level pattern recognition. AI needs to develop genuine logical reasoning capabilities before it can be trusted with decisions in high-stakes industries where errors could result in serious harm.

AI bias is real and training data could be sabotaging its logic

Apple’s findings also point out the concern that the training data used to build LLMs can carry inherent biases. These biases, stemming from the datasets created by those funding and developing the models, can shape the AI’s logic in ways that may not be ethical or neutral.

As AI systems are adopted globally, these biases could perpetuate systemic issues rather than challenge them, reinforcing prejudice in fields like hiring, law enforcement, or healthcare. The risk is not only that AI may fail to eliminate prejudice but that it might actively strengthen it by embedding existing societal biases into systems that impact millions.

There’s a pressing need for transparent AI development processes and comprehensive auditing of the datasets used in training these models to mitigate ethical risks.

Garbage in, garbage out still applies

AI’s inability to handle confusing or conflicting data is particularly dangerous in safety-critical applications like public transportation or autonomous vehicles. If the model misinterprets sensor data or is fed faulty information, the consequences could be severe, leading to accidents or other serious failures.

This issue highlights the continuing relevance of the adage “garbage in, garbage out,”. In fields where lives are on the line, such as healthcare or transportation, the consequences of AI processing bad data could be catastrophic, making human oversight indispensable.

Final thoughts

As AI continues to evolve, the question should be how you’ll integrate it wisely. Can your brand afford to trust AI’s surface-level reasoning without the human insight to steer it? Now is the time to rethink how you balance automation with human expertise to stay competitive while avoiding costly missteps. How will you make sure your business thrives in this delicate balance between innovation and oversight?

Paul

October 28, 2024

7 Min