Excerpt below from https://www.nngroup.com/articles/ai-magic-8-ball/
Hallucinations
GenAI tools consistently make mistakes when performing even simple tasks; the consequences of these mistakes range from benign to disastrous. According to a study by Varun Magesh and colleagues at Stanford, AI-driven legal tools report inaccurate and false information between 17% and 33% of the time. According to a Salesforce genAI dashboard, inaccuracy rates of leading chat tools range between 13.5% and 19% in the best cases. Mistakes like these, originating from genAI outputs, are often termed “hallucinations.”
Some notable instances of genAI hallucinations include:
- Google’s AI Overviews feature recommended that people make pizza with a quarter cup of nontoxic glue as an ingredient to prevent cheese from sliding off the pizza after cooking.
- The National Eating Disorders Association implemented a chatbot recommending disordered eating habits to its users.
- According to Alex Cranz from The Verge, Meta’s AI image-generation tool continuously generated pictures of masculine-looking people with beards when asked to create pictures of the author, who identifies as a woman.
Mistakes due to hallucinations stem from how large language models (LLMs, the technology behind most genAI tools) work. LLMs are prediction engines built as probabilistic models. They work by looking at the words that came before and selecting the most likely word to come next, based on the data they’ve seen before.
For example, if I were to ask you to finish the following sentence, you could probably do so quite well:
“The quick brown fox jumps over the lazy …”
If you speak English, you most likely would answer with the word “dog,” as it is the most likely word to come at the end of this sentence, which famously uses all the letters in the English alphabet. However, there is a small chance that the word I expected at the end of this sentence was “cat,” not “dog.”
Since LLMs are exposed to training data from all corners of the internet (copywritten or otherwise), they learn associations between syllables from extensive examples of sentences people have written across human history, leading to their impressive ability to respond flexibly to requests from users. However, these models have no built-in understanding of truth, of the world, or even of the meanings of the words they generate; instead, they are built to produce smooth collections of words, with little incentive to communicate or verify the accuracy of their prediction.
The term “hallucination” is often used to describe the mistakes made by LLMs, but it may be more precise to describe these mistakes as “creative gap-filling” or “confabulation,” as Benj Edwards suggests in Ars Technica.
Clearly, AI models' inability to measure truth is a problem. Tech executives and developers alike have little confidence that the AI-hallucination problem will be solved soon. While these models are extremely powerful, their probabilistic, generative, and predictive structure makes them susceptible to hallucinations. Although hallucination rates have decreased as improved models have been released, any proposed solutions to the hallucination problem have fallen short.
As such, it falls to the users of genAI products to evaluate and judge the accuracy and validity of each output, lest users act on a piece of confabulated information that “looked like the right answer.” Depending on the context of the use, the consequences of a hallucination might be inconsequential or impactful. When humans uncritically accept AI-generated information, that’s magic-8-ball thinking.