Last week, researchers at the Allen Institute for Artificial Intelligence demonstrated in a new paper that an AI they’d designed could ace an eighth-grade multiple-choice science test with more than 90 percent correct answers — and do quite well on a 12th-grade science test, too, with more than 80 percent correct answers.
The system, called Aristo, took the New York Regents Science Exam (a standardized test for students across New York State), with a few limitations: it didn’t have to solve the problems that involved looking at diagrams. Nonetheless, the researchers tested the program on different versions of the test as well as on tests from different years and found that its performance was pretty consistent: It’s an A student.
Aristo demonstrates how quickly AI is advancing. As recently as 2016, the paper’s authors note, no one in the field could manage to score as well as 60 percent on a similar eighth-grade science exam.
But in the field of AI — and in particular in natural language processing, which was used for this task — a lot has happened since 2016. Researchers have developed new ways to structure an AI so that it can do natural language processing tasks better, enabling AI systems to produce natural-sounding human text and write news stories or poetry. Computer vision has improved dramatically, with AIs getting more sophisticated in their ability to generate fake faces or video, “enhance” real images, and identify objects and faces. AI systems have conquered online multiplayer strategy games. And investment has poured into the field, with the headline-grabbing projects of this year typically vastly more expensive than projects were just a few years ago.
Now, the tide of progress has brought us to AI systems capable of beating an eighth grader on a science test.
The rapid advances in AI has many experts struggling to anticipate what the field will do next and left some of them predicting that human-level AI may be only 10 or 20 years away. (Others expect that such advances will still take hundreds of years of work.)
But as AI systems get more powerful, they’ll pose more challenges — and when they get to human-level capabilities, the risks of misspecified or badly designed programs could be catastrophic. Results like these are thrilling — and they’re also a reminder that our achievements in AI are blazing ahead perhaps faster than our understanding of AI policy and AI safety.
What Aristo can do — and what it can’t
A common criticism of projects like these is that the AI is just regurgitating information, not really thinking. A few years ago, this seemed like an accurate summary of what AI systems could do. They could memorize when words were associated with one another but they couldn’t answer any questions that involved a deeper conceptual understanding. That’s been changing. State-of-the-art AI systems today still make conceptual errors, but lots fewer of them.
A look at some of the questions on the New York Regents Science exam (from the Allen Institute paper) makes it clear that to do well on this exam, you must be doing something like conceptual reasoning:
1. Which equipment will best separate a mixture of iron filings and black pepper? (1) magnet (2) filter paper (3) triplebeam balance (4) voltmeter
2. Which form of energy is produced when a rubber band vibrates? (1) chemical (2) light (3) electrical (4) sound
3. Because copper is a metal, it is (1) liquid at room temperature (2) nonreactive with other substances (3) a poor conductor of electricity (4) a good conductor of heat
4. Which process in an apple tree primarily results from cell division? (1) growth (2) photosynthesis (3) gas exchange (4) waste removal
These certainly aren’t just vocabulary questions. A skeptic can still take the stance that the AI may be solving these questions just by drawing word associations: for example, between “iron filings” and “magnet,” “vibrates” and “sound,” or “metal” and “good conductor of heat.”
“The language model will have captured statistical associations between words that allow it to answer the question without any real understanding whatsoever,” Melanie Mitchell argued in Wired.
On the other hand, is that really so different from what the rest of us are doing when we learn science? Much of learning a concept is about learning that a relationship exists between that concept and other concepts you’ve learned about previously.
It’s not clear that the AI is doing something fundamentally different from what humans are doing. In fact, the more capable AI systems get, the less likely that interpretation seems.
While it’s easy to underrate achievements like these, it’s also easy to overstate them. Many outlets covered the Allen Institute paper with overwrought claims that are just wrong about what the new AI system could do. Headlines like “This AI Just Passed A Science Test and May Be Smarter Than An Eighth Grader” or “Artificial Intelligence Is Now As Smart As An Eighth Grader” are far from accurate. No AI system in the world has the general problem-solving skills of even a 2-year-old child, let alone an eighth grader.
AI systems like Aristo are narrow. They’re very good at what they do, and what they do is solve one, well-defined, highly specific problem. Aristo cannot solve problems other than multiple-choice science exams. It’s in that respect — our ability to take knowledge from one domain and apply it to entirely new problems in other areas — that humans still surpass computer systems.
We’ll see how long that lasts.
Sign up for the Future Perfect newsletter.Twice a week, you’ll get a roundup of ideas and solutions for tackling our biggest challenges: improving public health, decreasing human and animal suffering, easing catastrophic risks, and — to put it simply — getting better at doing good.