MIT develops method to expose and steer hidden biases, moods, and personalities in LLMs

MIT and UC San Diego researchers developed a technique to identify and manipulate hidden abstract concepts in LLMs, from personality traits like "conspiracy theorist" to stances like "fear of marriage." The method can then tune these representations to enhance or minimize concepts in model outputs. When they enhanced the "conspiracy theorist" representation and asked about the Apollo 17 "Blue Marble" photo, the model generated a conspiracy-toned answer. The team proved the method works across 500+ concepts in the largest LLMs, offering a way to illuminate and address hidden vulnerabilities.

View full digest for February 20, 2026