Are AI Models Outsmarting Human Intelligence? Examining the Limitations of GLUE and SuperGLUE Benchmarks
Even though computers are able to pass IQ tests, they still make stupid mistakes. Can different tests help?
Researchers are developing new \”benchmarks\”, which will help AI models to avoid real-world mistakes.
Artificial intelligence (AI), trained on billions words in books, news articles and Wikipedia, can produce prose that is uncannily similar to human. They can create tweets, summarize email messages, and translate into dozens of different languages. They can write a little poetry. Like high-achieving students, computer scientists also devised benchmarks for them.
Sam Bowman and his colleagues had a sobering experience with this new benchmark, GLUE (General Language Understanding Evaluation). GLUE allows AI models to train using data sets containing tens of thousands sentences. They are then given nine tasks such as determining whether a sentence is grammatical or assessing the sentiment. Each model receives an average score after completing all the tasks.
Bowman, an expert in computer science at New York University thought that he had beaten the models. The models scored below 70 points out of 100 (a D+). In less than a year, however, the newer models scored close to 90 and outperformed humans. Bowman: \”We were surprised by the increase.\” In 2019, the researchers created SuperGLUE, a benchmark that was even more difficult. In some tasks, the AI models were required to answer questions about reading comprehension after digesting paragraphs from Wikipedia or other news sites. Once again, humans were 20 points ahead at the start. Bowman says, \”It was not that shocking what happened after.\” By the early 2021s, computers had once again beaten people.