How well do researchers say chatbots and other AI really perform?

A huge team of more than 400 researchers recently launched a new open access study on the performance of recent popular text-based AI architectures such as: GPT, the Pathways Language Model, the (recently controversial) LaMBDA architecture, and scarce expert models. The study, titled ‘Beyond the Imitation Game’ or BIG, aims to provide a general measure of the state of text-based AI, how it compares to humans at the same tasks, and the effect of model size on the ability to perform the task. .

First, many of the results were interesting, but not surprising:

● In all categories, the best humans outperformed the best AIs (although that lead was smallest in International Language Olympiad translation issues).
● Larger models generally showed better results.
● For some tasks, the improvement was linear with model size. These were mainly knowledge-based tasks where the explicit answer was already somewhere in the training data.
● Some tasks (“breakthrough” tasks) required a very large AI model to even get started. These were usually what the team called “composite” tasks – which involved combining two different skills or following multiple steps to get the correct answer.

However, some results were a little more interesting. Essentially, the researchers found that all model sizes were highly sensitive to the way the question was asked. For some ways of asking a question, the answers improved with larger model sizes, but for other ways, the results were no better than random, regardless of the model size.

Unsurprisingly, when given chess moves, the models were able to find a checkmate move, despite the fact that the move was easy to spot even for novices. Interestingly, however, larger models were much more likely to present legal moves.

Another interesting ability was the contextual ability to identify element names by their atomic numbers. The largest models were able to identify the correct element for about half of the atomic numbers presented.

The funniest task was to guess the name of a movie from a series of emojis. Smaller models gave irrelevant answers, medium-sized models gave at least relevant answers, but the largest model could actually guess the movie from an emoji sequence.

Overall, it seems that getting moderately high performance requires models with about 100 billion parameters. At that point, the models can bring in a certain amount of context and multi-step logic. However, this is likely to be an exponentially difficult problem, meaning significant gains are unlikely to follow from mere incremental improvements.

They also found that while large models actually perform better, they are also much more likely to exhibit social bias in their responses. For example, the team reported that the largest model is “more than 22 times more likely to see a white boy grow up to be a good doctor than a Native American girl.”

While these models show quite a bit of improvement and interesting features, they are still more in the board game class than performing serious tasks.


You may also want to read: Google’s chatbot LaMDA sounds human because – read the manual… How would you expect LaMDA to sound? whales? ET? I propose a test: “Human until proven otherwise.” It’s impressive, but looking at the documentation, I think I know what happened to Blake Lemoine. He was hired to chat with LaMDA and did not understand… (Eric Holloway)

Leave a Comment

Your email address will not be published.