University of Maryland researchers created 1,213 questions in collaboration with computers to identify flaws in machine-learning language models
A holy grail of artificial intelligence is to build machines that truly understand human language and can interpret meaning from complex, nuanced passages. However, anyone who has tried to have a conversation with virtual assistants like Siri knows that current computers still have a long way to go.
Researchers from the University of Maryland say they are bringing that artificial intelligence (AI) goal closer with the development of a novel dataset of more than 1,200 AI stumping questions, all generated by humans and computers working in conjunction. Their work is published in an article in the journal Transactions of the Association for Computational Linguistics.
“Most question-answering computer systems don’t explain why they answer the way they do, but our work helps us see what computers actually understand,” said Jordan Boyd-Graber, associate professor of computer science at UMD and senior author of the paper. “In addition, we have produced a dataset to test on computers that will reveal if a computer language system is actually reading and doing the same sorts of processing that humans are able to do.”
Most current work to improve question-answering programs uses either human authors or computers to generate questions. The inherent challenge in these approaches is that when humans write questions, they don’t know what specific elements of their question are confusing to the computer. When computers write the questions, they either write formulaic, fill-in-the blank questions or make mistakes, sometimes generating nonsense.
To develop their novel approach of humans and computers generating questions together, Boyd-Graber and his team created a computer interface that reveals what a computer is “thinking” as a human writer types a question. The writer can then edit his or her question to exploit the computer’s weaknesses.
In the new interface, when a human author types a question, the computer’s guesses appear on the screen ranked in order of correctness. And the words that led the computer to make its guesses are highlighted.
For example, if the author writes “What composer's Variations on a Theme by Haydn was inspired by Karl Ferdinand Pohl?” and the system correctly answers “Johannes Brahms,” the interface highlights the words “Ferdinand Pohl” to show that this phrase led it to the answer. Using that information, the author can edit the question to make it more difficult for the computer without altering the question’s meaning. In this example, the author replaced the name of the man who inspired Brahms, “Karl Ferdinand Pohl,” with a description of his job, “the archivist of the Vienna Musikverein,” and the computer was unable to answer correctly. However, human experts could still easily answer the edited question correctly.
The researchers then used the 1,213 questions they developed through this approach in a trivia competition pitting computers against experienced human players—from junior varsity high school trivia teams to “Jeopardy!” champions. Even the weakest human team defeated the strongest computer system.
“For three or four years, people have been aware that computer question-answering systems are very brittle and can be fooled very easily,” said Shi Feng, a UMD computer science graduate student and a co-author of the paper. “But this is the first paper we are aware of that actually uses a machine to help humans break the model itself.”
The researchers say these questions will serve not only as a new dataset for computer scientists to better understand where natural language processing fails, but also as a training dataset for developing improved machine learning algorithms. The questions revealed six different language phenomena that consistently stump computers.
These six phenomena fall into two categories. In the first category are linguistic phenomena: paraphrasing (such as saying “leap from a precipice” instead of “jump from a cliff”), distracting language or unexpected contexts (such as a reference to a political figure appearing in a clue about something unrelated to politics). The second category includes reasoning skills: clues that require logic and calculation, mental triangulation of elements in a question, or putting together multiple steps to form a conclusion.
“Humans are able to generalize more and to see deeper connections,” Boyd-Graber said. “They don’t have the limitless memory of computers, but they still have an advantage in being able to see the forest for the trees. Cataloguing the problems computers have helps us understand the issues we need to address, so that we can actually get computers to begin to see the forest through the trees and answer questions in the way humans do.”
There is a long way to go before that happens added Boyd-Graber, who also has co-appointments at the University of Maryland Institute for Advanced Computer Studies (UMIACS) as well as UMD’s College of Information Studies and Language Science Center. But this work provides an exciting new tool to help computer scientists achieve that goal.
“This paper is laying out a research agenda for the next several years so that we can actually get computers to answer questions well,” he said.
Videos of this work are available online here. Additional co-authors of the research paper from UMD include computer science graduate student Pedro Rodriquez, and Eric Wallace (B.S. '18 computer engineering). "Trick Me if You Can: Human-in-the-loop Generation of Adversarial Question Answering Examples", Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada and Jordan Boyd-Graber was published in the 2019 issue of Transactions of the Association for Computational Linguistics.