| 9:00–9:10 Opening Remarks |
| 9:10–10:30 Oral Session 1 |
| Quality and Quantity of Machine Translation References for Automatic Metrics
Vilém Zouhar and Ondřej Bojar |
| Exploratory Study on the Impact of English Bias of Generative Large Language Models in Dutch and French
Ayla Rigouts Terryn and Miryam de Lhoneux |
| Adding Argumentation into Human Evaluation of Long Document Abstractive Summarization: A Case Study on Legal Opinions
Mohamed Elaraby, Huihui Xu, Morgan Gray, Kevin Ashley and Diane Litman |
| A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian
Aleksandra Miletić and Filip Miletić |
| 10:30–11:00 Coffee Break |
| 11:00–11:45 Invited Talk 1 |
| Beyond Performance: The Evolving Landscape of Human Evaluation Sheila Castilho |
| 11:45–13:00 ReproNLP Shared Task Session 1 |
| 13:00–14:00 Lunch |
| 14:00–14:45 Oral Session 2 |
| Insights of a Usability Study for KBQA Interactive Semantic Parsing: Generation Yields Benefits over Templates but External Validity Remains Challenging
Ashley Lewis, Lingbo Mo, Marie-Catherine de Marneffe, Huan Sun and Michael White |
| Extrinsic evaluation of question generation methods with user journey logs
Elie Antoine, Eléonore Besnehard, Frederic Bechet, Geraldine Damnati, Eric Kergosien and Arnaud Laborderie |
| Towards Holistic Human Evaluation of Automatic Text Simplification
Luisa Carrer, Andreas Säuberli, Martin Kappus and Sarah Ebling |
| 14:45–16:00 ReproNLP Shared Task Session 2 |
| 16:00–16:30 Coffee Break |
| 16:30–17:15 Invited Talk 2 |
| All That Agrees Is Not Gold: Evaluating Ground Truth and Conversational Safety Mark Diaz |
| 17:15–18:00 Oral Session 3 |
| Decoding the Metrics Maze: Navigating the Landscape of Conversational Question Answering System Evaluation in Procedural Tasks
Alexander Frummet and David Elsweiler |
| 18:00–18:05 Closing Remarks |
| The 2024 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results
Anya Belz and Craig Thomson |
| Once Upon a Replication: It is Humans’ Turn to Evaluate AI’s Understanding of Children’s Stories for QA Generation
Andra-Maria Florescu, Marius Micluta-Campeanu and Liviu P. Dinu |
| Exploring Reproducibility of Human-Labelled Data for Code-Mixed Sentiment Analysis
Sachin Sasidharan Nair, Tanvi Dinkar and Gavin Abercrombie |
| Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques
Michela Lorandi and Anya Belz |
| ReproHum: #0033-03: How Reproducible Are Fluency Ratings of Generated Text? A Reproduction of August et al. 2022
Emiel van Miltenburg, Anouck Braggaar, Nadine Braun, Martijn Goudbeek, Emiel Krahmer, Chris van der Lee, Steffen Pauws and Frédéric Tomas |
| ReproHum #0927-03: DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text
Tanvi Dinkar, Gavin Abercrombie and Verena Rieser |
| ReproHum #0927-3: Reproducing The Human Evaluation Of The DExperts Controlled Text Generation Method
Javier González Corbelle, Ainhoa Vivel Couso, Jose Maria Alonso-Moral and Alberto Bugarín-Diz |
| ReproHum #1018-09: Reproducing Human Evaluations of Redundancy Errors in Data-To-Text Systems
Filip Klubička and John D. Kelleher |
| ReproHum#0043: Human Evaluation Reproducing Language Model as an Annotator: Exploring Dialogue Summarization on AMI Dataset
Vivian Fresen, Mei-Shin Wu-Urbanek and Steffen Eger |
| ReproHum #0712-01: Human Evaluation Reproduction Report for “Hierarchical Sketch Induction for Paraphrase Generation”
Mohammad Arvan and Natalie Parde |
| ReproHum #0712-01: Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation
Lewis N. Watson and Dimitra Gkatzia |
| ReproHum #0043-4: Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility
Mateusz Lango, Patricia Schmidtova, Simone Balloccu and Ondrej Dusek |
| ReproHum #0033-3: Comparable Relative Results with Lower Absolute Values in a Reproduction Study
Yiru Li, Huiyuan Lai, Antonio Toral and Malvina Nissim |
| ReproHum #0124-03: Reproducing Human Evaluations of end-to-end approaches for Referring Expression Generation
Saad Mahamood |
| ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations
Tyler Loakman and Chenghua Lin |
| ReproHum #0892-01: The painful route to consistent results: A reproduction study of human evaluation in NLG
Irene Mondella, Huiyuan Lai and Malvina Nissim |
| ReproHum #0087-01: A Reproduction Study of the Human Evaluation of the Coverage of Fact Checking Explanations
Mingqi Gao, Jie Ruan and Xiaojun Wan |
| ReproHum #0866-04: Another Evaluation of Readers’ Reactions to News Headlines
Zola Mahlaza, Toky Hajatiana Raboanary, Kyle Seakgwa and C. Maria Keet |