Psychology meets hallucinations
Motivation
Mitgation
Theory
Practice
Effect on user experience
Participants rate hallucinated content to be less accurate than non-hallucinated content, and this effect is stronger is a warning is present. However, even though they also dislike hallucinated content more, their tendency to share the content (eg. engagement) remains unaffected (Nahar et al., 2024)
Hallucinations significantly diminish user trust and satisfaction in LLM-based systems (Oelschlager, 2024)
Users are concerned about hallucinated content
Medicine = high-stakes and high-risk environment
Humans are susceptible to misinformation
Outlook / Further Research
Understanding hallucinations: lessons learned from psychology
human-AI collab
Putting novel mitgation strategies into practice
Data? Survey?
LLM use-cases in medicine
Appply concepts from Dual Process Theory (Bellini-Leite, 2023)
Psychological concepts such as source amnesia, availability heuristics, recency effect, cognitive dissonance, suggestibility, and confabulation will serve as the basis for our new characterization (Berberette et al., 2024)
Draw inspiration from metacognitive processes when developing LLMs (Berberette et al., 2024)
Dual Process Theory
Metacognitive Processes
Self-Reflection approach reduces hallucinations in medical LLMs (Ji et al., 2023)
Chain of Thought Prompting (Wei et al., 2022)
Tree of Thought (Yao et al., 2024), incorporates DPT sucessfully, no mention of hallucinations though
LLM's show human-like biases in reasoning
Yao et al (2023)
What causes hallucinations in LLMs?
Data Quality and Bias
Overgeneralization
Lack of World Knowledge
Model Complexity and Overfitting
Validation procedure (via leveraging the model's logit output values) to detect and mitigate hallucinations during output generation (Varshney et al., 2023)
LLMs (Chinchilla, 7/70 B) show human-like content effects on reasoning; they reason more effectively about believable situations than unrealistic or abstract ones) (Dasgupta et al., 2022)
LLM neuroscience goes beyond these studies by directly probing the mechanisms that give rise to behavior
LLMs mimick dual behaviour in humans, demonstrating both biased and consistent reasoning depending on context (Dasgupta et al., 2022)
While GPT3.5. struggles to capture human behaviour in property-induction tasks, GPT4 qualitatively matches the performance of humans for the most part (Han et al., 2024)
LLM Psychology (or Machine Psychology): cognitive biases, human-like system-1-failures, cognitive errors and content effects, anchoring and framing effects, theory of mind, creativity, cooperation and coordination behaviour, altrusitic behaviour, moral reasoning and beliefs, classical psychology experiments, social biases
LLM Psychology (Binz & Schulz, 2023; Hagendorff, 2023) studies language models using behavioral experiments, computational analyses and all of the other techniques that psychologists have used to study human cognition and behavior
Human proficiency in discerning machine-generated texts falls
short, often performing at or below chance levels.
People do not seem to differentiate between human and LLM-generated texts with regards to source credibility: They perceive the texts as equally competent and trustworthy (Huschens et al., 2023)
Studies indicate human difficulty in discerning LLM-generated from human-written texts (Clark et al., 2021), with increased susceptibility to LLM-generated content (Spitale et al., 2023), truth discernment (Pennycook et al., 2018) and differentiation between LLM-generated and human-written texts (Clark et al., 2021) should be treated separately.
Uchendu et al. (2021) and Kreps et al. (2022) found that human evaluators struggled to differentiate between news articles by humans and machines.
Spitale et al. (2023) investigated discernment of disinformation across 11 polarizing topics, including COVID-19, flat earth, and climate change, in original and GPT-3 written tweets. Their findings indicate that GPT-3 produced more understandable information and compelling disinformation than humans.
Martel & Rand (2023) found that warning labels effectively reduce belief in misinformation, while van der Meer et al. (2023) discovered that exposure to warnings reduces trust in authentic news. (i.e., blind skepticism)
There is a dissociation between the perceived accuracy and the intention to share content (Pennycook et al., 2021)
Mechanism that contributes to the believability of fake news: fluency via prior exposure - the so-called illusory truth effect (despite warnings). (Pennycook et al., 2018)
Hallucination detection was non-trivial, as participants performed at or below chance levels for both minor (25.28%) and major (48.39%) in the control condition. Although warning improved the detection accuracy, the task nonetheless remained non-trivial (minor: 31.76% and major: 57.39%). (Nahar et al., 2024)
We found participants to be more susceptible to minor, compared to major hallucinations (minor: 3.21, major: 2.43). These results are consistent with prior work on machine detection of LLM misinformation (Lucas et al., 2023). (Nahar et al., 2024)
Hallucinations are the most important concern of Chinese- and English-speaking partiicpants (keep in mind: demographic bias in sample & predefined set of response categories) (Wang et al., 2024)
In a similar vein, Jones and Steinhardt (2022) investigated anchoring and framing effects in LLMs.
Various papers investigated artificial theory of mind capabilities in LLMs (Sap et al. 2022; Trott et al. 2022; Dou 2023; Ullman 2023; Bubeck et al. 2023; Holterman and van Deemter 2023; Strachan et al. 2024; Street et al. 2024). Building on this, Hagendorff (2024a) studied emergent deception abilities of LLMs by examining their understanding of how to induce false beliefs in other agents.
Moreover, several papers evaluated the personality of LLMs (Miotto et al. 2022; Karra et al. 2022; Li et al. 2022; Pellert et al. 2022).
Fischer et al. (2023) used a psychological value theory to test value biases in ChatGPT. Sicilia et al. (2023) analyzed demographic traits of GPT by using methods from psycho-linguistics. Horton (2023) administered a range of behavioral economic experiments to LLMs. Han et al. (2023) studied their ability for conducting inductive reasoning. Webb et al. (2022) applied a battery of intelligence tests to GPT-3. Stevenson et al. (2022) compared GPT-3’s abilities for creativity and out-of-the-box thinking with humans. Akata et al. (2023) let LLMs play repeated games with each other to study their cooperation and coordination behavior. Prystawski et al. (2022) investigated metaphor understanding in GPT-3 by using prompts based on psychological models of metaphor comprehension. Li et al. (2023) studied the impact of incorporating emotional stimuli into prompts on the behavior of LLMs. Johnson and Obradovich (2023) investigated altruistic as well as self-interested machine behavior in GPT-3. Jin et al. (2022) as well as Hagendorff and Danks (2023) analyzed LLMs from the perspective of moral psychology, for instance by applying moral disengagement questionnaires. Similarly, Scherrer et al. (2023) examined the moral beliefs embedded in LLMs with the help of a large-scale survey that included various moral scenarios. Huang and Chang (2022) and Qiao et al. (2022) did conceptual analyses of the meaning of reasoning as an emergent ability in LLMs. Park et al. (2023b) compared human responses to psychology experiments with outputs from GPT-3. Moreover, Aher et al. (2022) used GPT-3 to simulate humans in classical psychology experiments (Ultimatum Game, Milgram Experiment, wisdom of crowds experiments, etc.), framing LLMs as implicit computational models of humans. Such studies could be deemed reverse machine psychology.
Multimodal or augmented LLMs and the integration of other types of data: LLMs that are allowed to interact with external information sources, images, tools, sensory data, physical objects, etc. (Hagendorff, 2024)
Further insights with the help of LLM psychology (Hagendorff, 2024)
Monitorring over time
Researchers can investigate how LLMs develop over time by applying the same tasks multiple times, yielding longitudinal data (Chen et al. 2023a). This data can serve as a baseline to extrapolate trends regarding the development of reasoning abilities in LLMs. Such estimations may be increasingly important for AI safety and AI alignment research to predict future behavioral potentials in individual LLMs or multiple instances of LLMs interacting with each other. (Hagendorff, 2024)
Our results have shown that GPT-3 can solve some vignette-based experiments similarly or better than human subjects. However, interpreting these results is difficult because many of these vignettes might have been part of its training set, and GPT-3’s performance suffered greatly given only minor changes to the original vignettes. (Schulz and Binz, 2023)
We, therefore, complemented our analyses with various task-based assessments of GPT-3’s abilities. Therein, we found that GPT-3 made reasonable decisions for gambles provided as descriptions while also mirroring some human behavioral biases. GPT-3 also managed to solve a multiarmed bandit task well, where it performed better than human subjects; yet, it only showed traces of random but not of directed exploration. In the canonical two-step task, GPT-3 showed signatures of model-based reinforcement learning. However, GPT-3 failed spectacularly in using an underlying causal structure for its inference, leading to responses that were neither correct nor human-like.(Schulz and Binz, 2023)
Contrary to existing benchmarks (-> focus on mastering individual tasks and technical capabilitites, not on how LLMs solve a task)
Binz and Schulz (2023), Dasgupta et al. (2022), Hagendorff et al. (2023), Nye et al. (2021), Talboy et al. (2023), as well as Chen et al. (2023b) applied a set of canonical experiments from the psychology of judgment and decision-making to LLMs (Linda problem, Wason selection task, Cab problem, cognitive reflection tests, semantic illusions, etc.) to test for cognitive biases and other human-like system-1 failures in the models.
Decision-making, information search, deliberation and causal reasoning
Human-like problem-solving abilities in large language models using ChatGPT (insight problems) (Orru et al., 2023)
"Crossfeed" with psychology: substitute for participants, model for cognition and behaviour, research aid - organize and analyze literature, assist in experimental design, data analysis and paper writing, screening tool, personalized interventions (Ke et al., 2024)
As the models expand in size and linguistic proficiency they increasingly display human-like intuitive system 1 thinking and associated cognitive errors. This pattern shifts notably with the introduction of ChatGPT models, which tend to respond correctly, avoiding the traps embedded in the tasks. Both ChatGPT-3.5 and 4 utilize the input–output context window to engage in chain-of-thought reasoning, reminiscent of how people use notepads to support their system 2 thinking. Yet, they remain accurate even when prevented from engaging in chain-of-thought reasoning, indicating that their system-1-like next-word generation processes are more accurate than those of older models. (Hagendorff et al., 2023)
Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT (Hagendorff et al., 2023)
Well-developed intuition stemming from previous exposure to similar tasks (Hagendorff et al., 2023)
"Think carefully and check the question fori nvalid assumptions" (Hagendorff et al., 2023)
Chain-of-thought reasoning is one of the reasons why ChatGPT performs well on a range of tasks designed to provoke an intuitive, system-1-like erroneous response (Hagendorff et al., 2023)
Discussion: Should LLMs really mitigate all hallucinations? -> Concept of ecological rationality and the field of application (Hagendorff et. al., 2023)
click to edit