The first dimension we emphasize is interpretability and explainability. Neural networks, as the fundamental building blocks of LLMs, remain largely opaque; the complex interactions between weights and activations make both training dynamics and inference behavior difficult to understand (Gilpin et al., 2018ReferenceGilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., & Kagal, L. (2018). Explaining Explanations: An Overview of Interpretability of Machine Learning. 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), 80–89. https://doi.org/10.1109/DSAA.2018.00018). Yet understanding these systems is critical for ensuring the alignment of LLMs with human values and objectives. We distinguish between two complementary goals: interpretability, which focuses on understanding how LLMs operate in general settings; and explainability, which seeks causal explanations for why LLMs produce specific behaviors, decisions, or outcomes. Both are essential for human-centered applications, but serve different purposes. Interpretability provides a better understanding of LLM internals, which can help address undesired behaviors such as hallucinations, vulnerabilities to adversarial attacks, and encoded biases. Explainability, by contrast, provides users with comprehensible justification for individual outputs, informing appropriate trust and enabling contestability.
Current Approaches to Interpretability
Three interconnected areas of modern interpretability research.
First, work on understanding internal mechanisms has revealed that transformer components can function as interpretable key-value memories (Geva et al., 2021ReferenceGeva, M., Schuster, R., Berant, J., & Levy, O. (2021). Transformer Feed-Forward Layers Are Key-Value Memories. In M.-F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 5484–5495). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.446) and has begun to uncover how LLMs represent multilingual knowledge (Tang et al., 2024ReferenceTang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, X., Wei, F., & Wen, J.-R. (2024). Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5701–5715). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.309; Zhang et al., 2024ReferenceZhang, R., Yu, Q., Zang, M., Eickhoff, C., & Pavlick, E. (2024). The Same But Different: Structural Similarities and Differences in Multilingual Language Modeling. In ArXiv preprint: Vol. abs/2410.09223. https://arxiv.org/abs/2410.09223). Second, these mechanistic insights have enabled practical interventions on model behavior, such as inference-time steering methods(Li et al., 2023ReferenceLi, K., Patel, O., Viégas, F. B., Pfister, H., & Wattenberg, M. (2023). Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/81b8390039b7302c909cb769f8b6cd93-Abstract-Conference.html; Turner et al., 2023ReferenceTurner, A. M., Thiergart, L., Udell, D., Leech, G., Mini, U., & MacDiarmid, M. (2023). Activation Addition: Steering Language Models Without Optimization. ArXiv Preprint, abs/2308.10248. https://arxiv.org/abs/2308.10248; Wu et al., 2024ReferenceWu, Z., Arora, A., Wang, Z., Geiger, A., Jurafsky, D., Manning, C. D., & Potts, C. (2024). Reft: Representation finetuning for language models. ArXiv Preprint, abs/2404.03592. https://arxiv.org/abs/2404.03592; Zou et al., 2023ReferenceZou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., … Hendrycks, D. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. In ArXiv preprint: Vol. abs/2310.01405. https://arxiv.org/abs/2310.01405), while model editing and machine unlearning techniques allow for targeted removal of undesirable traits (Ilharco et al., 2023ReferenceIlharco, G., Ribeiro, M. T., Wortsman, M., Schmidt, L., Hajishirzi, H., & Farhadi, A. (2023). Editing models with task arithmetic. The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. https://openreview.net/pdf?id=6t0Kwf8-jrj; Liu et al., 2024ReferenceLiu, S., Yao, Y., Jia, J., Casper, S., Baracaldo, N., Hase, P., Yao, Y., Liu, C. Y., Xu, X., Li, H., Varshney, K. R., Bansal, M., Koyejo, S., & Liu, Y. (2024). Rethinking Machine Unlearning for Large Language Models. In ArXiv preprint: Vol. abs/2402.08787. https://arxiv.org/abs/2402.08787; Meng et al., 2023ReferenceMeng, K., Sharma, A. S., Andonian, A. J., Belinkov, Y., & Bau, D. (2023). Mass-Editing Memory in a Transformer. The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. https://openreview.net/pdf?id=MkbcAHIYgyS). Third, interpretability serves as a diagnostic tool for safety, helping researchers understand jailbreaking vulnerabilities (Arditi et al., 2024ReferenceArditi, A., Obeso, O. B., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). Refusal in Language Models Is Mediated by a Single Direction. ICML 2024 Workshop on Mechanistic Interpretability. https://openreview.net/forum?id=EqF16oDVFf; Kirch et al., 2024ReferenceKirch, N. M., Field, S., & Casper, S. (2024). What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks. In ArXiv preprint: Vol. abs/2411.03343. https://arxiv.org/abs/2411.03343) and identify adversarial attack vectors (Jain et al., 2024ReferenceJain, S., Lubana, E. S., Oksuz, K., Joy, T., Torr, P., Sanyal, A., & Dokania, P. K. (2024). What Makes and Breaks Safety Fine-tuning? A Mechanistic Study. ICML 2024 Workshop on Mechanistic Interpretability. https://openreview.net/forum?id=BS2CbUkJpy; Łucki et al., 2024ReferenceŁucki, J., Wei, B., Huang, Y., Henderson, P., Tramèr, F., & Rando, J. (2024). An adversarial perspective on machine unlearning for ai safety. arXiv Preprint arXiv:2409.18025.; Yu et al., 2024ReferenceYu, L., Do, V., Hambardzumyan, K., & Cancedda, N. (2024). Robust LLM safeguarding via refusal feature adversarial training. In ArXiv preprint: Vol. abs/2409.20089. https://arxiv.org/abs/2409.20089).
Interpretability methods for human-centered purposes.
It is important to understand why certain model behaviors arise, such as sycophancy or deception; however, this cannot be done simply by examining model outputs in a post-hoc fashion (Hubinger et al., 2024ReferenceHubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., Ziegler, D. M., Maxwell, T., Cheng, N., Jermyn, A., Askell, A., Radhakrishnan, A., Anil, C., Duvenaud, D., Ganguli, D., Barez, F., Clark, J., Ndousse, K., … Perez, E. (2024). Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. In ArXiv preprint: Vol. abs/2401.05566. https://arxiv.org/abs/2401.05566; Sharma et al., 2024ReferenceSharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., DURMUS, E., Hatfield-Dodds, Z., Johnston, S. R., Kravec, S. M., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., & Perez, E. (2024). Towards Understanding Sycophancy in Language Models. The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=tvhaxkMKAn). This shortcoming motivates the need to apply interpretability methods for human-centered purposes. Building on theories such as the linear representation hypothesis (Park et al., 2023ReferencePark, K., Choe, Y. J., & Veitch, V. (2023). The Linear Representation Hypothesis and the Geometry of Large Language Models. Causal Representation Learning Workshop at NeurIPS 2023. https://openreview.net/forum?id=T0PoOJg8cK), the platonic representation hypothesis (Huh et al., 2024ReferenceHuh, M., Cheung, B., Wang, T., & Isola, P. (2024). Position: The Platonic Representation Hypothesis. Forty-First International Conference on Machine Learning. https://openreview.net/forum?id=BH8TYy0r6u), and universal feature representations across all LLMs (Lan et al., 2024ReferenceLan, M., Torr, P., Meek, A., Khakzar, A., Krueger, D., & Barez, F. (2024). Sparse autoencoders reveal universal feature spaces across large language models. ArXiv Preprint, abs/2410.06981. https://arxiv.org/abs/2410.06981), interpretability has been used as a tool to understand different model biases, including social biases (Y. Liu et al., 2024ReferenceLiu, Y., Liu, Y., Chen, X., Chen, P.-Y., Zan, D., Kan, M.-Y., & Ho, T.-Y. (2024). The devil is in the neurons: Interpreting and mitigating social biases in language models. The Twelfth International Conference on Learning Representations.), cultural biases (H. Yu et al., 2025ReferenceYu, H., Jeong, S., Pawar, S., Shin, J., Jin, J., Myung, J., Oh, A., & Augenstein, I. (2025). Entangled in representations: Mechanistic investigation of cultural biases in large language models. arXiv Preprint arXiv:2508.08879.), and cultural knowledge (Veselovsky et al., 2025ReferenceVeselovsky, V., Argin, B., Stroebl, B., Wendler, C., West, R., Evans, J., Griffiths, T. L., & Narayanan, A. (2025). Localized Cultural Knowledge is Conserved and Controllable in Large Language Models. arXiv Preprint arXiv:2504.10191.). Additionally, recent work has sought to identify models’ internal representations of important model behaviors, finding that models encode harmfulness and refusal separately (Zhao et al., 2025ReferenceZhao, J., Huang, J., Wu, Z., Bau, D., & Shi, W. (2025). Llms encode harmfulness and refusal separately. arXiv Preprint arXiv:2507.11878.) and that three dimensions of sycophancy—sycophantic agreement, sycophantic praise, and genuine praise—are all encoded along different linear directions in latent space and can be amplified and suppressed without affecting the other (Vennemeyer et al., 2025ReferenceVennemeyer, D., Duong, P. A., Zhan, T., & Jiang, T. (2025). Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs. arXiv Preprint arXiv:2509.21305.).
Another recent application of interpretability on HCLLM can help better model human-AI interactions. In order for LLMs to act as helpful assistants for users, they need to not only understand the user query but develop an understanding of a user’s latent traits and needs. A misalignment between a model’s representation of the user and a user’s true needs can lead to various harmful outcomes, ranging from conversational grounding failures (Shaikh et al., 2023ReferenceShaikh, O., Gligorić, K., Khetan, A., Gerstgrasser, M., Yang, D., & Jurafsky, D. (2023). Grounding or guesswork? large language models are presumptive grounders. ArXiv Preprint, abs/2311.09144. https://arxiv.org/abs/2311.09144) to sycophancy and deception. For example, to make a model’s user representation more transparent, Chen et al. (2024)ReferenceChen, Y., Wu, A., DePodesta, T., Yeh, C., Li, K., Marin, N. C., Patel, O., Riecke, J., Raval, S., Seow, O., & others. (2024). Designing a dashboard for transparency and control of conversational AI. arXiv Preprint arXiv:2406.07882. designs a system to extract data related to a user’s demographic features and a dashboard that displays this representation. Choi et al. (2025)ReferenceChoi, D., Huang, V., Schwettmann, S., & Steinhardt, J. (2025). Scalably Extracting Latent Representations of Users. https://transluce.org/user-modeling similarly extracts latent representations of users in LLMs, and these methods have also been applied to predict the behaviors of personalized LLMs (Karny et al., 2025ReferenceKarny, S., Baez, A., & Pataranutaporn, P. (2025). Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI. arXiv Preprint arXiv:2511.00230.).
Current Approaches to Explainability
Modern explainability research for LLMs pursues several complementary goals.
Feature attribution methods identify which inputs most influence outputs, natural language rationales provide human-readable justifications, and counterfactual explanations show how minimal input changes would alter predictions (H. Zhao et al., 2024ReferenceZhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., Wang, S., Yin, D., & Du, M. (2024). Explainability for Large Language Models: A Survey. ACM Trans. Intell. Syst. Technol., 15(2). https://doi.org/10.1145/3639372). Unlike interpretability, which seeks general understanding of internal mechanisms, explainability focuses on justifying individual predictions in terms that users and stakeholders can act upon. This goal has proven challenging, as traditional explainable AI (XAI) techniques such as LIME and SHAP (Ribeiro et al., 2016ReferenceRibeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal, D. Shen, & R. Rastogi (Eds.), Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016 (pp. 1135–1144). ACM. https://doi.org/10.1145/2939672.2939778) become computationally impractical at the scale of billions of parameters, while LLM-specific approaches such as chain-of-thought reasoning and post-hoc citation generation often prioritize plausibility over faithfulness (Lanham et al., 2023ReferenceLanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., Lukošiūtė, K., Nguyen, K., Cheng, N., Joseph, N., Schiefer, N., Rausch, O., Larson, R., McCandlish, S., Kundu, S., … Perez, E. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. In ArXiv preprint: Vol. abs/2307.13702. https://arxiv.org/abs/2307.13702; Turpin et al., 2023ReferenceTurpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/ed3fea9033a80fea1376299fa7863f4a-Abstract-Conference.html). For a comprehensive taxonomy of explainability techniques for LLMs, we refer readers to H. Zhao et al. (2024)ReferenceZhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., Wang, S., Yin, D., & Du, M. (2024). Explainability for Large Language Models: A Survey. ACM Trans. Intell. Syst. Technol., 15(2). https://doi.org/10.1145/3639372.
How explainability methods can be used for human-centered purposes.
Explainability serves as a foundational element for building user trust and enabling accountability in LLM systems. The ability to assign responsibility for model decisions is essential not only for developing transparent systems but also for supporting downstream regulatory efforts — for instance, AI in hiring systems, compensation for content creators, and copyright law (Guha et al., 2024ReferenceGuha, N., Lawrence, C. M., Gailmard, L. A., Rodolfa, K. T., Surani, F., Bommasani, R., Raji, I. D., Cuéllar, M.-F., Honigsberg, C., Liang, P., & Ho, D. E. (2024). AI Regulation Has Its Own Alignment Problem: The Technical and Institutional Feasibility of Disclosure, Registration, Licensing, and Auditing. George Washington Law Review. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4634443). These concerns have motivated legislative action: for instance, the EU AI Act, which became enforceable in 2024, establishes explainability as a legal requirement in critical domains (Smuha, 2025ReferenceSmuha, N. A. (2025). Regulation 2024/1689 of the Eur. Parl. & Council of June 13, 2024 (eu artificial intelligence act). International Legal Materials, 64(5), 1234–1381. https://eur-lex.europa.eu/eli/reg/2024/1689/oj).
For end users, trust fundamentally depends on calibration (i.e., whether models can reliably express what they know and don’t know). Models often struggle to convey uncertainty, both through log-probabilities and linguistic hedging , although recent work has made progress on both fronts (X. L. Li et al., 2024ReferenceLi, X. L., Khandelwal, U., & Guu, K. (2024). Few-Shot Recalibration of Language Models. In ArXiv preprint: Vol. abs/2403.18286. https://arxiv.org/abs/2403.18286; Tian et al., 2023ReferenceTian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., & Manning, C. (2023). Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 5433–5442). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.330). Closely related is the problem of citation and attribution. Effective attribution can provide causal explanations for LLM behavior, but current approaches have significant limitations. While RAG systems supply LLMs with relevant context, there is no guarantee that models actually use that context to generate responses (Du et al., 2024ReferenceDu, K., Snæbjarnarson, V., Stoehr, N., White, J., Schein, A., & Cotterell, R. (2024). Context versus Prior Knowledge in Language Models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13211–13235). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.714; D. Li et al., 2023ReferenceLi, D., Rawat, A. S., Zaheer, M., Wang, X., Lukasik, M., Veit, A., Yu, F., & Kumar, S. (2023). Large Language Models with Controllable Working Memory. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023 (pp. 1774–1793). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-acl.112). Post-hoc citation generation similarly suffers from severe faithfulness issues N. F. Liu et al. (2023)ReferenceLiu, N. F., Zhang, T., & Liang, P. (2023). Evaluating verifiability in generative search engines. arXiv Preprint arXiv:2304.09848., motivating work on parametric attribution (Khalifa et al., 2024ReferenceKhalifa, M., Wadden, D., Strubell, E., Lee, H., Wang, L., Beltagy, I., & Peng, H. (2024). Source-Aware Training Enables Knowledge Attribution in Language Models. In ArXiv preprint: Vol. abs/2404.01019. https://arxiv.org/abs/2404.01019) and measuring training data influence more broadly (Grosse et al., 2023ReferenceGrosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., Steiner, B., Li, D., Durmus, E., Perez, E., Hubinger, E., Lukošiūtė, K., Nguyen, K., Joseph, N., McCandlish, S., Kaplan, J., & Bowman, S. R. (2023). Studying Large Language Model Generalization with Influence Functions. In ArXiv preprint: Vol. abs/2308.03296. https://arxiv.org/abs/2308.03296; Guu et al., 2023ReferenceGuu, K., Webson, A., Pavlick, E., Dixon, L., Tenney, I., & Bolukbasi, T. (2023). Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs. In ArXiv preprint: Vol. abs/2303.08114. https://arxiv.org/abs/2303.08114; S. M. Park et al., 2023ReferencePark, S. M., Georgiev, K., Ilyas, A., Leclerc, G., & Madry, A. (2023). TRAK: Attributing Model Behavior at Scale. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Vol. 202, pp. 27074–27113). PMLR. https://proceedings.mlr.press/v202/park23c.html).
Chain-of-thought (CoT) reasoning represents a particularly contested approach to explainability. On one hand, CoT outputs provide an accessible window into model reasoning that users can inspect without technical expertise. On the other hand, research has shown that these explanations can systematically misrepresent the true reasons for a model’s predictions (Lanham et al., 2023ReferenceLanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., Lukošiūtė, K., Nguyen, K., Cheng, N., Joseph, N., Schiefer, N., Rausch, O., Larson, R., McCandlish, S., Kundu, S., … Perez, E. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. In ArXiv preprint: Vol. abs/2307.13702. https://arxiv.org/abs/2307.13702; Turpin et al., 2023ReferenceTurpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/ed3fea9033a80fea1376299fa7863f4a-Abstract-Conference.html). This creates a paradox for human-centered design: CoT explanations may increase user trust precisely because they appear plausible, even if they fail to faithfully reflect a model’s decision process.
Finally, LLMs have shown potential for advancing explainability in other scientific domains. For instance, LLM-powered simulations have enabled HCI designers to explore counterfactual scenarios and reason about design decisions (J. S. Park et al., 2022ReferencePark, J. S., Popowski, L., Cai, C., Morris, M. R., Liang, P., & Bernstein, M. S. (2022). Social simulacra: Creating populated prototypes for social computing systems. Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, 1–18.) while LLM-inspired approaches can extract interpretable biological features from protein language models (Simon & Zou, 2024ReferenceSimon, E., & Zou, J. (2024). InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders. bioRxiv, 2024–11.).
Looking Forward
Providing understanding for model developers.
The black-box nature of LLMs, particularly the closed-source ones, makes it difficult to predict and control how models behave. For example, when models provide unsolicited affirmation to the user, it is unclear what causes the model to provide that affirmation. As the range of questions and interactions becomes more and more complex and open-ended, interpretability becomes a key tool to answer questions like: how can we determine if a model is personalized? Does the model truly understand a user’s intent? Developing an understanding of a model’s representation of the user is especially important as people use LLMs for personal questions and even as AI companions. Without an understanding of models’ behaviors, model builders risk harming users’ well-being (Cheng et al., 2025ReferenceCheng, M., Lee, C., Khadpe, P., Yu, S., Han, D., & Jurafsky, D. (2025). Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence. arXiv Preprint arXiv:2510.01395.).
Uncovering unintended effects of post-training.
Another key application of interpretability for HCLLM is a better understanding of post-training procedures like preference alignment (Ferrao et al., 2025ReferenceFerrao, J., van der Lende, M., Lichkovski, I., & Neo, C. (2025). The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features. arXiv Preprint arXiv:2509.12934.; Movva et al., 2025ReferenceMovva, R., Milli, S., Min, S., & Pierson, E. (2025). What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data. arXiv Preprint arXiv:2510.26202.). Without more interpretable approaches, post-training can lead to various unintended effects (e.g. sycophancy) that can be difficult to monitor or mitigate post-hoc. Building on existing approaches that refine our understanding of what preference alignment actually optimizes for, model providers can better control and steer behaviors towards desirable directions. For example, representation finetuning and steering (Rimsky et al., 2024ReferenceRimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. (2024). Steering llama 2 via contrastive activation addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15504–15522.; Wu, Arora, et al., 2025; Wu et al., 2024; Wu, Yu, et al., 2025)ReferenceWu, Z., Arora, A., Geiger, A., Wang, Z., Huang, J., Jurafsky, D., Manning, C. D., & Potts, C. (2025). Axbench: Steering llms? even simple baselines outperform sparse autoencoders. arXiv Preprint arXiv:2501.17148.ReferenceWu, Z., Arora, A., Wang, Z., Geiger, A., Jurafsky, D., Manning, C. D., & Potts, C. (2024). Reft: Representation finetuning for language models. ArXiv Preprint, abs/2404.03592. https://arxiv.org/abs/2404.03592ReferenceWu, Z., Yu, Q., Arora, A., Manning, C. D., & Potts, C. (2025). Improved Representation Steering for Language Models. arXiv Preprint arXiv:2505.20809. have been shown to be a promising way to control an LLM’s behaviors, and these methods can be applied to elicit behaviors that are aligned with users’ long-term development. The key first step towards making models safer and more steerable towards long-term beneficial objectives would be to understand how they work.