Publications

Selected publications in reversed chronological order. For an exhaustive list, check out Google Scholar.

2025

  1. delta.png
    RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs?
    Yiyou Sun, Shawn Hu, Georgia Zhou, Ken Zheng, Hannaneh HajishirziNouha Dziri*, and Dawn Song*
    In Arxiv, Sep 2025
  2. omega.png
    OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
    Yiyou Sun, Shawn Hu, Georgia Zhou, Ken Zheng, Hannaneh HajishirziNouha Dziri*, and Dawn Song*
    In NeurIPS, 2025
  3. openagentsafety.png
    OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety
    Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha DziriGraham Neubig, and Maarten Sap
    In Arxiv, Jul 2025
  4. hivemind.png
    Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
    Liwei Jiang, Chai Yuanjun, Margaret Li, Mickel Liu, Raymond Fok, Maarten Sap, Yulia Tsvetkov, Nouha Dziri, and Yejin Choi
    In NeurIPS (Oral), 2025
  5. columbia.jpg
    A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety
    Camille François, Ludovic Pérán, Ayah Bdeir, Nouha Dziri, Will Hawkins, Yacine Jernite, Sayash Kapoor, Juliet Shen, Heidy Khlaaf, Kevin Klyman, Nik Marda, Marie Pellat, Deb Raji, Divya Siddarth, Aviya Skowron, Joseph Spisak, Madhulika Srikumar, Victor Storchan, Audrey Tang, and Jen Weedon
    In Arxiv, Jun 2025
  6. singapore.png
    The Singapore Consensus on Global AI Safety Research Priorities
    Yoshua Bengio, Tegan Maharaj, Luke Ong, Stuart Russell, Dawn Song,  ..., Jeff Clune, Juntao Dai, Agnes Delaborde, Nouha Dziri, Francisco Eiras, Joshua Engels, Jinyu Fan, Adam Gleave, Noah Goodman,  ..., Wei Xu, Rongwu Xu, Yi Zeng, HongJiang Zhang, and Djordje Žikelić
    In Arxiv, Jun 2025
  7. reasoningladder.png
    Climbing the Ladder of Reasoning: What LLMs Can-and Still Can’t-Solve after SFT?
    Yiyou Sun, Georgia Zhou, Hao Wang, Dacheng Li, Nouha Dziri, and Dawn Song
    In Arxiv, Apr 2025
  8. trustgen.png
    On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective
    Yue Huang, Chujie Gao, Siyuan Wu, Haoran Wang, Xiangqi Wang, Yujun Zhou, Yanbo Wang, Jiayi Ye, Jiawen Shi, Qihui Zhang, Yuan Li, Han Bao, Zhaoyi Liu, Tianrui Guan, Dongping Chen, Ruoxi Chen, Kehan Guo, Andy Zou, Bryan Hooi Kuen-Yew, Caiming Xiong, Elias Stengel-Eskin, Hongyang Zhang, Hongzhi Yin, Huan Zhang, Huaxiu Yao, Jaehong Yoon, Jieyu Zhang, Kai Shu, Kaijie Zhu, Ranjay Krishna, Swabha Swayamdipta, Taiwei Shi, Weijia Shi, Xiang Li, Yiwei Li, Yuexing Hao, Zhihao Jia, Zhize Li, Xiuying Chen, Zhengzhong Tu, Xiyang Hu, Tianyi Zhou, Jieyu Zhao, Lichao Sun, Furong Huang, Or Cohen Sasson, Prasanna Sattigeri, Anka Reuel, Max Lamparth, Yue Zhao, Nouha Dziri, Yu Su, Huan Sun, Heng Ji, Chaowei Xiao, Mohit Bansal, Nitesh V. Chawla, Jian Pei, Jianfeng Gao, Michael Backes, Philip S. Yu, Neil Zhenqiang Gong, Pin-Yu Chen, Bo Li, and Xiangliang Zhang
    In Arxiv, Feb 2025
  9. olmo2.png
    2 OLMo 2 Furious
    Evan Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James Validad Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Jake Poznanski, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, and Hannaneh Hajishirzi
    In COLM, 2025
  10. tulu3.png
    Tulu 3: Pushing Frontiers in Open Language Model Post-Training
    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi
    In COLM, 2025
  11. safetyanalyst.png
    SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior
    Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne GE Collins, Jana Schaich Borg, Maarten SapYejin Choi, and Sydney Levine
    In ICML, 2025
  12. olmosafety.png
    To Err Is AI: A Case Study Informing LLM Flaw Reporting Practices
    Sean McGregor, Allyson Ettinger, Nick Judd, Paul Albee, Liwei Jiang, Kavel Rao, William H. Smith, Shayne Longpre, Avijit Ghosh, Christopher Fiorelli, Michelle Hoang, Sven Cattell, and Nouha Dziri
    In AAAI, 2025
  13. creativity_index.png
    AI as Humanity’s Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text
    Ximing Lu, Melanie Sclar, Skyler Hallinan, Niloofar Mireshghallah, Jiacheng Liu, Seungju Han, Allyson Ettinger, Liwei Jiang, Khyathi Chandu, Nouha Dziri, and Yejin Choi
    In ICLR (Oral), 2025
  14. rel.png
    Rel-AI: An Interaction-Centered Approach To Measuring Human-LM Reliance
    Kaitlyn Zhou, Jena D Hwang, Xiang Ren, Nouha Dziri, Dan Jurafsky, and Maarten Sap
    In NAACL (Best Paper Runner Up), 2025
  15. wildbench.png
    WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
    Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi
    In ICLR, 2025

2024

  1. wildguard.png
    WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs
    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri
    In NeurIPS, 2024
  2. wildteaming.png
    WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri
    In NeurIPS, 2024
  3. reward.png
    RewardBench: Evaluating reward models for language modeling
    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A Smith, and Hannaneh Hajishirzi
    In Arxiv, Mar 2024
  4. plua.png
    A roadmap to pluralistic alignment
    Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, and Yejin Choi
    In ICML, 2024
  5. inductive.png
    Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement
    Linlu Qiu, Liwei Jiang, Ximing Lu, Melanie Sclar, Valentina Pyatkin, Chandra Bhagavatula, Bailin Wang, Yoon Kim, Yejin ChoiNouha Dziri, and Xiang Ren
    In ICLR (Oral), 2024
  6. paradox.png
    The Generative AI Paradox: What It Can Create, It May Not Understand
    Peter West*, Ximing Lu*, Nouha Dziri*, Faeze Brahman*, Linjie Li, Jena D Hwang, Liwei Jiang, Jillian Fisher, Abhilasha Ravichander, Khyathi Chandu, Benjamin Newman, Pang Wei Koh, Allyson Ettinger, and Yejin Choi
    In ICLR, 2024
  7. urial.png
    The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
    Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi
    In ICLR, 2024
  8. kaleido.png
    Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties
    Taylor Sorensen, Liwei Jiang, Jena Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, Maarten Sap, John Tasioulas, and Yejin Choi
    In AAAI, Sep 2024
  9. ewr.png
    Elastic weight removal for faithful and abstractive dialogue generation
    Nico Daheim, Nouha Dziri, Mrinmaya Sachan, Iryna Gurevych, and Edoardo M Ponti
    In NAACL, 2024

2023

  1. faith.png
    Faith and Fate: Limits of Transformers on Compositionality
    Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi
    In NeurIPS (Spotlight), Jun 2023
  2. rlhf.png
    Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
    Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah Smith, Mari Ostendorf, and Hannaneh Hajishirzi
    In NeurIPS (Spotlight), Jun 2023
  3. defeasible.png
    What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts and Rationales for Disambiguating Defeasible Social and Moral Situations
    In EMNLP, May 2023
  4. ipa.png
    Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning
    Ximing Lu, Faeze Brahman, Peter West, Jaehun Jang, Khyathi Chandu, Abhilasha Ravichander, Lianhui Qin, Prithviraj Ammanabrolu, Liwei Jiang, Sahana Ramnath, Nouha Dziri, Jillian Fisher, Bill Yuchen Lin, Skyler Hallinan, Xiang Ren, Sean Welleck, and Yejin Choi
    In EMNLP, May 2023
  5. refine.png
    Self-refine: Iterative refinement with self-feedback
    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Sean Welleck, Bodhisattwa Prasad Majumder, Shashank Gupta, Amir Yazdanbakhsh, and Peter Clark
    In NeurIPS, Mar 2023
  6. champagne.png
    CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos
    Seungju Han, Jack Hessel, Nouha DziriYejin Choi, and Youngjae Yu
    In ICCV, Mar 2023
  7. qa.png
    Evaluating Open-Domain Question Answering in the Era of Large Language Models
    Ehsan Kamalloo, Nouha Dziri, Charles Clarke, and Davood Rafiei
    In ACL (oral), Jul 2023

2022

  1. begin.png
    Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
    Nouha Dziri, Hannah Rashkin, Tal Linzen, and David Reitter
    TACL, May 2022
  2. faithdial.png
    FaithDial: A Faithful Benchmark for Information-Seeking Dialogue
    Nouha Dziri, Ehsan Kamalloo, Sivan Milton, Osmar Zaiane, Mo Yu, Edoardo Ponti, and Siva Reddy
    TACL, Apr 2022
  3. hall.png
    On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models?
    Nouha Dziri, Sivan Milton, Mo Yu, Osmar Zaiane, and Siva Reddy
    NAACL, Jul 2022

2021

  1. nph.png
    Neural path hunter: Reducing hallucination in dialogue systems via path grounding
    Nouha Dziri, Andrea Madotto, Osmar Zaiane, and Avishek Joey Bose
    EMNLP, Nov 2021
  2. demi.png
    Decomposed mutual information estimation for contrastive representation learning
    Alessandro Sordoni*, Nouha Dziri*, Hannes Schulz*, Geoff Gordon, Philip Bachman, and Remi Tachet Des Combes
    ICML, Jul 2021

2019

  1. eval.png
    Evaluating Coherence in Dialogue Systems using Entailment
    Nouha Dziri, Ehsan Kamalloo, Kory Mathewson, and Osmar Zaiane
    In NAACL, Jun 2019
  2. THRED.png
    Augmenting Neural Response Generation with Context-Aware Topical Attention
    Nouha Dziri, Ehsan Kamalloo, Kory Mathewson, and Osmar Zaiane
    In Proceedings of the First Workshop on NLP for Conversational AI (NLP4ConvAI) at ACL 2019, Aug 2019

2018

  1. emotion.png
    Automatic Dialogue Generation with Expressed Emotions
    Chenyang Huang, Osmar Zaı̈ane, Amine Trabelsi, and Nouha Dziri
    In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Jun 2018