Interpretable and Explainable HCLLMs

The first dimension we emphasize is interpretability and explainability. Neural networks, as the fundamental building blocks of LLMs, remain largely opaque; the complex interactions between weights and activations make both training dynamics and inference behavior difficult to understand (Gilpin et al., 2018). Yet understanding these systems is critical for ensuring the alignment of LLMs with human values and objectives. We distinguish between two complementary goals: interpretability, which focuses on understanding how LLMs operate in general settings; and explainability, which seeks causal explanations for why LLMs produce specific behaviors, decisions, or outcomes. Both are essential for human-centered applications, but serve different purposes. Interpretability provides a better understanding of LLM internals, which can help address undesired behaviors such as hallucinations, vulnerabilities to adversarial attacks, and encoded biases. Explainability, by contrast, provides users with comprehensible justification for individual outputs, informing appropriate trust and enabling contestability.

Current Approaches to Interpretability

Three interconnected areas of modern interpretability research.

First, work on understanding internal mechanisms has revealed that transformer components can function as interpretable key-value memories (Geva et al., 2021) and has begun to uncover how LLMs represent multilingual knowledge (Tang et al., 2024; Zhang et al., 2024). Second, these mechanistic insights have enabled practical interventions on model behavior, such as inference-time steering methods(Li et al., 2023; Turner et al., 2023; Wu et al., 2024; Zou et al., 2023), while model editing and machine unlearning techniques allow for targeted removal of undesirable traits (Ilharco et al., 2023; Liu et al., 2024; Meng et al., 2023). Third, interpretability serves as a diagnostic tool for safety, helping researchers understand jailbreaking vulnerabilities (Arditi et al., 2024; Kirch et al., 2024) and identify adversarial attack vectors (Jain et al., 2024; Łucki et al., 2024; Yu et al., 2024).

Interpretability methods for human-centered purposes.

It is important to understand why certain model behaviors arise, such as sycophancy or deception; however, this cannot be done simply by examining model outputs in a post-hoc fashion (Hubinger et al., 2024; Sharma et al., 2024). This shortcoming motivates the need to apply interpretability methods for human-centered purposes. Building on theories such as the linear representation hypothesis (Park et al., 2023), the platonic representation hypothesis (Huh et al., 2024), and universal feature representations across all LLMs (Lan et al., 2024), interpretability has been used as a tool to understand different model biases, including social biases (Y. Liu et al., 2024), cultural biases (H. Yu et al., 2025), and cultural knowledge (Veselovsky et al., 2025). Additionally, recent work has sought to identify models’ internal representations of important model behaviors, finding that models encode harmfulness and refusal separately (Zhao et al., 2025) and that three dimensions of sycophancy—sycophantic agreement, sycophantic praise, and genuine praise—are all encoded along different linear directions in latent space and can be amplified and suppressed without affecting the other (Vennemeyer et al., 2025).

Another recent application of interpretability on HCLLM can help better model human-AI interactions. In order for LLMs to act as helpful assistants for users, they need to not only understand the user query but develop an understanding of a user’s latent traits and needs. A misalignment between a model’s representation of the user and a user’s true needs can lead to various harmful outcomes, ranging from conversational grounding failures (Shaikh et al., 2023) to sycophancy and deception. For example, to make a model’s user representation more transparent, Chen et al. (2024) designs a system to extract data related to a user’s demographic features and a dashboard that displays this representation. Choi et al. (2025) similarly extracts latent representations of users in LLMs, and these methods have also been applied to predict the behaviors of personalized LLMs (Karny et al., 2025).

Current Approaches to Explainability

Modern explainability research for LLMs pursues several complementary goals.

Feature attribution methods identify which inputs most influence outputs, natural language rationales provide human-readable justifications, and counterfactual explanations show how minimal input changes would alter predictions (H. Zhao et al., 2024). Unlike interpretability, which seeks general understanding of internal mechanisms, explainability focuses on justifying individual predictions in terms that users and stakeholders can act upon. This goal has proven challenging, as traditional explainable AI (XAI) techniques such as LIME and SHAP (Ribeiro et al., 2016) become computationally impractical at the scale of billions of parameters, while LLM-specific approaches such as chain-of-thought reasoning and post-hoc citation generation often prioritize plausibility over faithfulness (Lanham et al., 2023; Turpin et al., 2023). For a comprehensive taxonomy of explainability techniques for LLMs, we refer readers to H. Zhao et al. (2024).

How explainability methods can be used for human-centered purposes.

Explainability serves as a foundational element for building user trust and enabling accountability in LLM systems. The ability to assign responsibility for model decisions is essential not only for developing transparent systems but also for supporting downstream regulatory efforts — for instance, AI in hiring systems, compensation for content creators, and copyright law (Guha et al., 2024). These concerns have motivated legislative action: for instance, the EU AI Act, which became enforceable in 2024, establishes explainability as a legal requirement in critical domains (Smuha, 2025).

For end users, trust fundamentally depends on calibration (i.e., whether models can reliably express what they know and don’t know). Models often struggle to convey uncertainty, both through log-probabilities and linguistic hedging (Zhou et al., 2023), although recent work has made progress on both fronts (X. L. Li et al., 2024; Tian et al., 2023). Closely related is the problem of citation and attribution. Effective attribution can provide causal explanations for LLM behavior, but current approaches have significant limitations. While RAG systems supply LLMs with relevant context, there is no guarantee that models actually use that context to generate responses (Du et al., 2024; D. Li et al., 2023). Post-hoc citation generation similarly suffers from severe faithfulness issues N. F. Liu et al. (2023), motivating work on parametric attribution (Khalifa et al., 2024) and measuring training data influence more broadly (Grosse et al., 2023; Guu et al., 2023; S. M. Park et al., 2023).

Chain-of-thought (CoT) reasoning represents a particularly contested approach to explainability. On one hand, CoT outputs provide an accessible window into model reasoning that users can inspect without technical expertise. On the other hand, research has shown that these explanations can systematically misrepresent the true reasons for a model’s predictions (Lanham et al., 2023; Turpin et al., 2023). This creates a paradox for human-centered design: CoT explanations may increase user trust precisely because they appear plausible, even if they fail to faithfully reflect a model’s decision process.

Finally, LLMs have shown potential for advancing explainability in other scientific domains. For instance, LLM-powered simulations have enabled HCI designers to explore counterfactual scenarios and reason about design decisions (J. S. Park et al., 2022) while LLM-inspired approaches can extract interpretable biological features from protein language models (Simon & Zou, 2024).

Looking Forward

Providing understanding for model developers.

The black-box nature of LLMs, particularly the closed-source ones, makes it difficult to predict and control how models behave. For example, when models provide unsolicited affirmation to the user, it is unclear what causes the model to provide that affirmation. As the range of questions and interactions becomes more and more complex and open-ended, interpretability becomes a key tool to answer questions like: how can we determine if a model is personalized? Does the model truly understand a user’s intent? Developing an understanding of a model’s representation of the user is especially important as people use LLMs for personal questions and even as AI companions. Without an understanding of models’ behaviors, model builders risk harming users’ well-being (Cheng et al., 2025).

Uncovering unintended effects of post-training.

Another key application of interpretability for HCLLM is a better understanding of post-training procedures like preference alignment (Ferrao et al., 2025; Movva et al., 2025). Without more interpretable approaches, post-training can lead to various unintended effects (e.g. sycophancy) that can be difficult to monitor or mitigate post-hoc. Building on existing approaches that refine our understanding of what preference alignment actually optimizes for, model providers can better control and steer behaviors towards desirable directions. For example, representation finetuning and steering (Rimsky et al., 2024; Wu, Arora, et al., 2025; Wu et al., 2024; Wu, Yu, et al., 2025) have been shown to be a promising way to control an LLM’s behaviors, and these methods can be applied to elicit behaviors that are aligned with users’ long-term development. The key first step towards making models safer and more steerable towards long-term beneficial objectives would be to understand how they work.

Arditi, A., Obeso, O. B., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). Refusal in Language Models Is Mediated by a Single Direction. ICML 2024 Workshop on Mechanistic Interpretability. https://openreview.net/forum?id=EqF16oDVFf

Chen, Y., Wu, A., DePodesta, T., Yeh, C., Li, K., Marin, N. C., Patel, O., Riecke, J., Raval, S., Seow, O., & others. (2024). Designing a dashboard for transparency and control of conversational AI. arXiv Preprint arXiv:2406.07882.

Cheng, M., Lee, C., Khadpe, P., Yu, S., Han, D., & Jurafsky, D. (2025). Sycophantic AI Decreases Prosocial Intentions and Promotes Dependence. arXiv Preprint arXiv:2510.01395.

Choi, D., Huang, V., Schwettmann, S., & Steinhardt, J. (2025). Scalably Extracting Latent Representations of Users. https://transluce.org/user-modeling

Du, K., Snæbjarnarson, V., Stoehr, N., White, J., Schein, A., & Cotterell, R. (2024). Context versus Prior Knowledge in Language Models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13211–13235). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.714

Ferrao, J., van der Lende, M., Lichkovski, I., & Neo, C. (2025). The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features. arXiv Preprint arXiv:2509.12934.

Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). Transformer Feed-Forward Layers Are Key-Value Memories. In M.-F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 5484–5495). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.446

Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., & Kagal, L. (2018). Explaining Explanations: An Overview of Interpretability of Machine Learning. 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), 80–89. https://doi.org/10.1109/DSAA.2018.00018

Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., Steiner, B., Li, D., Durmus, E., Perez, E., Hubinger, E., Lukošiūtė, K., Nguyen, K., Joseph, N., McCandlish, S., Kaplan, J., & Bowman, S. R. (2023). Studying Large Language Model Generalization with Influence Functions. In ArXiv preprint: Vol. abs/2308.03296. https://arxiv.org/abs/2308.03296

Guha, N., Lawrence, C. M., Gailmard, L. A., Rodolfa, K. T., Surani, F., Bommasani, R., Raji, I. D., Cuéllar, M.-F., Honigsberg, C., Liang, P., & Ho, D. E. (2024). AI Regulation Has Its Own Alignment Problem: The Technical and Institutional Feasibility of Disclosure, Registration, Licensing, and Auditing. George Washington Law Review. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4634443

Guu, K., Webson, A., Pavlick, E., Dixon, L., Tenney, I., & Bolukbasi, T. (2023). Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs. In ArXiv preprint: Vol. abs/2303.08114. https://arxiv.org/abs/2303.08114

Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., Ziegler, D. M., Maxwell, T., Cheng, N., Jermyn, A., Askell, A., Radhakrishnan, A., Anil, C., Duvenaud, D., Ganguli, D., Barez, F., Clark, J., Ndousse, K., … Perez, E. (2024). Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. In ArXiv preprint: Vol. abs/2401.05566. https://arxiv.org/abs/2401.05566

Huh, M., Cheung, B., Wang, T., & Isola, P. (2024). Position: The Platonic Representation Hypothesis. Forty-First International Conference on Machine Learning. https://openreview.net/forum?id=BH8TYy0r6u

Ilharco, G., Ribeiro, M. T., Wortsman, M., Schmidt, L., Hajishirzi, H., & Farhadi, A. (2023). Editing models with task arithmetic. The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. https://openreview.net/pdf?id=6t0Kwf8-jrj

Jain, S., Lubana, E. S., Oksuz, K., Joy, T., Torr, P., Sanyal, A., & Dokania, P. K. (2024). What Makes and Breaks Safety Fine-tuning? A Mechanistic Study. ICML 2024 Workshop on Mechanistic Interpretability. https://openreview.net/forum?id=BS2CbUkJpy

Karny, S., Baez, A., & Pataranutaporn, P. (2025). Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI. arXiv Preprint arXiv:2511.00230.

Khalifa, M., Wadden, D., Strubell, E., Lee, H., Wang, L., Beltagy, I., & Peng, H. (2024). Source-Aware Training Enables Knowledge Attribution in Language Models. In ArXiv preprint: Vol. abs/2404.01019. https://arxiv.org/abs/2404.01019

Kirch, N. M., Field, S., & Casper, S. (2024). What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks. In ArXiv preprint: Vol. abs/2411.03343. https://arxiv.org/abs/2411.03343

Lan, M., Torr, P., Meek, A., Khakzar, A., Krueger, D., & Barez, F. (2024). Sparse autoencoders reveal universal feature spaces across large language models. ArXiv Preprint, abs/2410.06981. https://arxiv.org/abs/2410.06981

Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., Lukošiūtė, K., Nguyen, K., Cheng, N., Joseph, N., Schiefer, N., Rausch, O., Larson, R., McCandlish, S., Kundu, S., … Perez, E. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. In ArXiv preprint: Vol. abs/2307.13702. https://arxiv.org/abs/2307.13702

Li, D., Rawat, A. S., Zaheer, M., Wang, X., Lukasik, M., Veit, A., Yu, F., & Kumar, S. (2023). Large Language Models with Controllable Working Memory. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023 (pp. 1774–1793). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-acl.112

Li, K., Patel, O., Viégas, F. B., Pfister, H., & Wattenberg, M. (2023). Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/81b8390039b7302c909cb769f8b6cd93-Abstract-Conference.html

Li, X. L., Khandelwal, U., & Guu, K. (2024). Few-Shot Recalibration of Language Models. In ArXiv preprint: Vol. abs/2403.18286. https://arxiv.org/abs/2403.18286

Liu, N. F., Zhang, T., & Liang, P. (2023). Evaluating verifiability in generative search engines. arXiv Preprint arXiv:2304.09848.

Liu, S., Yao, Y., Jia, J., Casper, S., Baracaldo, N., Hase, P., Yao, Y., Liu, C. Y., Xu, X., Li, H., Varshney, K. R., Bansal, M., Koyejo, S., & Liu, Y. (2024). Rethinking Machine Unlearning for Large Language Models. In ArXiv preprint: Vol. abs/2402.08787. https://arxiv.org/abs/2402.08787

Liu, Y., Liu, Y., Chen, X., Chen, P.-Y., Zan, D., Kan, M.-Y., & Ho, T.-Y. (2024). The devil is in the neurons: Interpreting and mitigating social biases in language models. The Twelfth International Conference on Learning Representations.

Łucki, J., Wei, B., Huang, Y., Henderson, P., Tramèr, F., & Rando, J. (2024). An adversarial perspective on machine unlearning for ai safety. arXiv Preprint arXiv:2409.18025.

Meng, K., Sharma, A. S., Andonian, A. J., Belinkov, Y., & Bau, D. (2023). Mass-Editing Memory in a Transformer. The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. https://openreview.net/pdf?id=MkbcAHIYgyS

Movva, R., Milli, S., Min, S., & Pierson, E. (2025). What’s In My Human Feedback? Learning Interpretable Descriptions of Preference Data. arXiv Preprint arXiv:2510.26202.

Park, J. S., Popowski, L., Cai, C., Morris, M. R., Liang, P., & Bernstein, M. S. (2022). Social simulacra: Creating populated prototypes for social computing systems. Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, 1–18.

Park, K., Choe, Y. J., & Veitch, V. (2023). The Linear Representation Hypothesis and the Geometry of Large Language Models. Causal Representation Learning Workshop at NeurIPS 2023. https://openreview.net/forum?id=T0PoOJg8cK

Park, S. M., Georgiev, K., Ilyas, A., Leclerc, G., & Madry, A. (2023). TRAK: Attributing Model Behavior at Scale. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett (Eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Vol. 202, pp. 27074–27113). PMLR. https://proceedings.mlr.press/v202/park23c.html

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal, D. Shen, & R. Rastogi (Eds.), Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016 (pp. 1135–1144). ACM. https://doi.org/10.1145/2939672.2939778

Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. (2024). Steering llama 2 via contrastive activation addition. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15504–15522.

Shaikh, O., Gligorić, K., Khetan, A., Gerstgrasser, M., Yang, D., & Jurafsky, D. (2023). Grounding or guesswork? large language models are presumptive grounders. ArXiv Preprint, abs/2311.09144. https://arxiv.org/abs/2311.09144

Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., DURMUS, E., Hatfield-Dodds, Z., Johnston, S. R., Kravec, S. M., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., & Perez, E. (2024). Towards Understanding Sycophancy in Language Models. The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=tvhaxkMKAn

Simon, E., & Zou, J. (2024). InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders. bioRxiv, 2024–11.

Smuha, N. A. (2025). Regulation 2024/1689 of the Eur. Parl. & Council of June 13, 2024 (eu artificial intelligence act). International Legal Materials, 64(5), 1234–1381. https://eur-lex.europa.eu/eli/reg/2024/1689/oj

Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, X., Wei, F., & Wen, J.-R. (2024). Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5701–5715). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.309

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., & Manning, C. (2023). Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 5433–5442). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.330

Turner, A. M., Thiergart, L., Udell, D., Leech, G., Mini, U., & MacDiarmid, M. (2023). Activation Addition: Steering Language Models Without Optimization. ArXiv Preprint, abs/2308.10248. https://arxiv.org/abs/2308.10248

Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper%5C_files/paper/2023/hash/ed3fea9033a80fea1376299fa7863f4a-Abstract-Conference.html

Vennemeyer, D., Duong, P. A., Zhan, T., & Jiang, T. (2025). Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs. arXiv Preprint arXiv:2509.21305.

Veselovsky, V., Argin, B., Stroebl, B., Wendler, C., West, R., Evans, J., Griffiths, T. L., & Narayanan, A. (2025). Localized Cultural Knowledge is Conserved and Controllable in Large Language Models. arXiv Preprint arXiv:2504.10191.

Wu, Z., Arora, A., Geiger, A., Wang, Z., Huang, J., Jurafsky, D., Manning, C. D., & Potts, C. (2025). Axbench: Steering llms? even simple baselines outperform sparse autoencoders. arXiv Preprint arXiv:2501.17148.

Wu, Z., Arora, A., Wang, Z., Geiger, A., Jurafsky, D., Manning, C. D., & Potts, C. (2024). Reft: Representation finetuning for language models. ArXiv Preprint, abs/2404.03592. https://arxiv.org/abs/2404.03592

Wu, Z., Yu, Q., Arora, A., Manning, C. D., & Potts, C. (2025). Improved Representation Steering for Language Models. arXiv Preprint arXiv:2505.20809.

Yu, H., Jeong, S., Pawar, S., Shin, J., Jin, J., Myung, J., Oh, A., & Augenstein, I. (2025). Entangled in representations: Mechanistic investigation of cultural biases in large language models. arXiv Preprint arXiv:2508.08879.

Yu, L., Do, V., Hambardzumyan, K., & Cancedda, N. (2024). Robust LLM safeguarding via refusal feature adversarial training. In ArXiv preprint: Vol. abs/2409.20089. https://arxiv.org/abs/2409.20089

Zhang, R., Yu, Q., Zang, M., Eickhoff, C., & Pavlick, E. (2024). The Same But Different: Structural Similarities and Differences in Multilingual Language Modeling. In ArXiv preprint: Vol. abs/2410.09223. https://arxiv.org/abs/2410.09223

Zhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., Wang, S., Yin, D., & Du, M. (2024). Explainability for Large Language Models: A Survey. ACM Trans. Intell. Syst. Technol., 15(2). https://doi.org/10.1145/3639372

Zhao, J., Huang, J., Wu, Z., Bau, D., & Shi, W. (2025). Llms encode harmfulness and refusal separately. arXiv Preprint arXiv:2507.11878.

Zhou, K., Jurafsky, D., & Hashimoto, T. (2023). Navigating the Grey Area: How Expressions of Uncertainty and Overconfidence Affect Language Models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 5506–5524). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-main.335

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., … Hendrycks, D. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. In ArXiv preprint: Vol. abs/2310.01405. https://arxiv.org/abs/2310.01405

Interpretable and Explainable HCLLMs

Current Approaches to Interpretability

Three interconnected areas of modern interpretability research.

Interpretability methods for human-centered purposes.

Current Approaches to Explainability

Modern explainability research for LLMs pursues several complementary goals.

How explainability methods can be used for human-centered purposes.

Looking Forward

Providing understanding for model developers.

Uncovering unintended effects of post-training.

Graph View

Table of Contents

Backlinks

Literature NotesLiterature NotesLiterature NotesLiterature NotesLiterature NotesLiterature Notes

Current Approaches to Interpretability

Three interconnected areas of modern interpretability research.

Interpretability methods for human-centered purposes.

Current Approaches to Explainability

Modern explainability research for LLMs pursues several complementary goals.

How explainability methods can be used for human-centered purposes.

Looking Forward

Providing understanding for model developers.

Uncovering unintended effects of post-training.

Graph View

Table of Contents

Backlinks