LREC COLING 2024 Proceedings Home | Workshops | Tutorials | LREC Proceedings | ELRA Website | ICCL Website


The Fourth Workshop on Human Evaluation of NLP Systems (HumEval)


Full proceedings volume (PDF) | Workshop Site | Home | Programme | Author index | Bibliography (BibTeX) | Editors

PROGRAM

 9:00–9:10 Opening Remarks
 9:10–10:30 Oral Session 1
 Quality and Quantity of Machine Translation References for Automatic Metrics
Vilém Zouhar and Ondřej Bojar
 Exploratory Study on the Impact of English Bias of Generative Large Language Models in Dutch and French
Ayla Rigouts Terryn and Miryam de Lhoneux
 Adding Argumentation into Human Evaluation of Long Document Abstractive Summarization: A Case Study on Legal Opinions
Mohamed Elaraby, Huihui Xu, Morgan Gray, Kevin Ashley and Diane Litman
 A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian
Aleksandra Miletić and Filip Miletić
 10:30–11:00 Coffee Break
 11:00–11:45 Invited Talk 1
Beyond Performance: The Evolving Landscape of Human Evaluation
Sheila Castilho
 11:45–13:00 ReproNLP Shared Task Session 1
 13:00–14:00 Lunch
 14:00–14:45 Oral Session 2
 Insights of a Usability Study for KBQA Interactive Semantic Parsing: Generation Yields Benefits over Templates but External Validity Remains Challenging
Ashley Lewis, Lingbo Mo, Marie-Catherine de Marneffe, Huan Sun and Michael White
 Extrinsic evaluation of question generation methods with user journey logs
Elie Antoine, Eléonore Besnehard, Frederic Bechet, Geraldine Damnati, Eric Kergosien and Arnaud Laborderie
 Towards Holistic Human Evaluation of Automatic Text Simplification
Luisa Carrer, Andreas Säuberli, Martin Kappus and Sarah Ebling
 14:45–16:00 ReproNLP Shared Task Session 2
 16:00–16:30 Coffee Break
 16:30–17:15 Invited Talk 2
All That Agrees Is Not Gold: Evaluating Ground Truth and Conversational Safety
Mark Diaz
 17:15–18:00 Oral Session 3
 Decoding the Metrics Maze: Navigating the Landscape of Conversational Question Answering System Evaluation in Procedural Tasks
Alexander Frummet and David Elsweiler
 18:00–18:05 Closing Remarks
 The 2024 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results
Anya Belz and Craig Thomson
 Once Upon a Replication: It is Humans’ Turn to Evaluate AI’s Understanding of Children’s Stories for QA Generation
Andra-Maria Florescu, Marius Micluta-Campeanu and Liviu P. Dinu
 Exploring Reproducibility of Human-Labelled Data for Code-Mixed Sentiment Analysis
Sachin Sasidharan Nair, Tanvi Dinkar and Gavin Abercrombie
 Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques
Michela Lorandi and Anya Belz
 ReproHum: #0033-03: How Reproducible Are Fluency Ratings of Generated Text? A Reproduction of August et al. 2022
Emiel van Miltenburg, Anouck Braggaar, Nadine Braun, Martijn Goudbeek, Emiel Krahmer, Chris van der Lee, Steffen Pauws and Frédéric Tomas
 ReproHum #0927-03: DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text
Tanvi Dinkar, Gavin Abercrombie and Verena Rieser
 ReproHum #0927-3: Reproducing The Human Evaluation Of The DExperts Controlled Text Generation Method
Javier González Corbelle, Ainhoa Vivel Couso, Jose Maria Alonso-Moral and Alberto Bugarín-Diz
 ReproHum #1018-09: Reproducing Human Evaluations of Redundancy Errors in Data-To-Text Systems
Filip Klubička and John D. Kelleher
 ReproHum#0043: Human Evaluation Reproducing Language Model as an Annotator: Exploring Dialogue Summarization on AMI Dataset
Vivian Fresen, Mei-Shin Wu-Urbanek and Steffen Eger
 ReproHum #0712-01: Human Evaluation Reproduction Report for “Hierarchical Sketch Induction for Paraphrase Generation”
Mohammad Arvan and Natalie Parde
 ReproHum #0712-01: Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation
Lewis N. Watson and Dimitra Gkatzia
 ReproHum #0043-4: Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility
Mateusz Lango, Patricia Schmidtova, Simone Balloccu and Ondrej Dusek
 ReproHum #0033-3: Comparable Relative Results with Lower Absolute Values in a Reproduction Study
Yiru Li, Huiyuan Lai, Antonio Toral and Malvina Nissim
 ReproHum #0124-03: Reproducing Human Evaluations of end-to-end approaches for Referring Expression Generation
Saad Mahamood
 ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations
Tyler Loakman and Chenghua Lin
 ReproHum #0892-01: The painful route to consistent results: A reproduction study of human evaluation in NLG
Irene Mondella, Huiyuan Lai and Malvina Nissim
 ReproHum #0087-01: A Reproduction Study of the Human Evaluation of the Coverage of Fact Checking Explanations
Mingqi Gao, Jie Ruan and Xiaojun Wan
 ReproHum #0866-04: Another Evaluation of Readers’ Reactions to News Headlines
Zola Mahlaza, Toky Hajatiana Raboanary, Kyle Seakgwa and C. Maria Keet