Large Language Models Can Linearly Represent Truth And Falsehood
Large language models can linearly represent true & false statements, researchers find, using visualizations, transfer experiments & causal interventions to demonstrate the structure of truthfulness in LLMs.
This is a Plain English Papers summary of a research paper called Visualizing Truth: Large Language Models Linearly Separate True and False Statements. If you like these kinds of analysis, you should join AImodels.fyi or follow me on Twitter. Overview Large language models (LLMs) are powerful, but can output falsehoods. Researchers have tried to detect when LLMs are telling the truth by analyzing their internal activations. However, this approach has faced some challenges and criticisms. This paper studies the structure of LLM representations of truth using datasets of simple true/fa...