Program - The Fourth Workshop on Human Evaluation of NLP Systems (HumEval)

The Fourth Workshop on Human Evaluation of NLP Systems (HumEval)

PROGRAM

	9:00–9:10 Opening Remarks
	9:10–10:30 Oral Session 1
	Quality and Quantity of Machine Translation References for Automatic Metrics Vilém Zouhar and Ondřej Bojar
	Exploratory Study on the Impact of English Bias of Generative Large Language Models in Dutch and French Ayla Rigouts Terryn and Miryam de Lhoneux
	Adding Argumentation into Human Evaluation of Long Document Abstractive Summarization: A Case Study on Legal Opinions Mohamed Elaraby, Huihui Xu, Morgan Gray, Kevin Ashley and Diane Litman
	A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian Aleksandra Miletić and Filip Miletić
	10:30–11:00 Coffee Break
	11:00–11:45 Invited Talk 1
	Beyond Performance: The Evolving Landscape of Human Evaluation Sheila Castilho
	11:45–13:00 ReproNLP Shared Task Session 1
	13:00–14:00 Lunch
	14:00–14:45 Oral Session 2
	Insights of a Usability Study for KBQA Interactive Semantic Parsing: Generation Yields Benefits over Templates but External Validity Remains Challenging Ashley Lewis, Lingbo Mo, Marie-Catherine de Marneffe, Huan Sun and Michael White
	Extrinsic evaluation of question generation methods with user journey logs Elie Antoine, Eléonore Besnehard, Frederic Bechet, Geraldine Damnati, Eric Kergosien and Arnaud Laborderie
	Towards Holistic Human Evaluation of Automatic Text Simplification Luisa Carrer, Andreas Säuberli, Martin Kappus and Sarah Ebling
	14:45–16:00 ReproNLP Shared Task Session 2
	16:00–16:30 Coffee Break
	16:30–17:15 Invited Talk 2
	All That Agrees Is Not Gold: Evaluating Ground Truth and Conversational Safety Mark Diaz
	17:15–18:00 Oral Session 3
	Decoding the Metrics Maze: Navigating the Landscape of Conversational Question Answering System Evaluation in Procedural Tasks Alexander Frummet and David Elsweiler
	18:00–18:05 Closing Remarks
	The 2024 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results Anya Belz and Craig Thomson
	Once Upon a Replication: It is Humans’ Turn to Evaluate AI’s Understanding of Children’s Stories for QA Generation Andra-Maria Florescu, Marius Micluta-Campeanu and Liviu P. Dinu
	Exploring Reproducibility of Human-Labelled Data for Code-Mixed Sentiment Analysis Sachin Sasidharan Nair, Tanvi Dinkar and Gavin Abercrombie
	Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques Michela Lorandi and Anya Belz
	ReproHum: #0033-03: How Reproducible Are Fluency Ratings of Generated Text? A Reproduction of August et al. 2022 Emiel van Miltenburg, Anouck Braggaar, Nadine Braun, Martijn Goudbeek, Emiel Krahmer, Chris van der Lee, Steffen Pauws and Frédéric Tomas
	ReproHum #0927-03: DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text Tanvi Dinkar, Gavin Abercrombie and Verena Rieser
	ReproHum #0927-3: Reproducing The Human Evaluation Of The DExperts Controlled Text Generation Method Javier González Corbelle, Ainhoa Vivel Couso, Jose Maria Alonso-Moral and Alberto Bugarín-Diz
	ReproHum #1018-09: Reproducing Human Evaluations of Redundancy Errors in Data-To-Text Systems Filip Klubička and John D. Kelleher
	ReproHum#0043: Human Evaluation Reproducing Language Model as an Annotator: Exploring Dialogue Summarization on AMI Dataset Vivian Fresen, Mei-Shin Wu-Urbanek and Steffen Eger
	ReproHum #0712-01: Human Evaluation Reproduction Report for “Hierarchical Sketch Induction for Paraphrase Generation” Mohammad Arvan and Natalie Parde
	ReproHum #0712-01: Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation Lewis N. Watson and Dimitra Gkatzia
	ReproHum #0043-4: Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility Mateusz Lango, Patricia Schmidtova, Simone Balloccu and Ondrej Dusek
	ReproHum #0033-3: Comparable Relative Results with Lower Absolute Values in a Reproduction Study Yiru Li, Huiyuan Lai, Antonio Toral and Malvina Nissim
	ReproHum #0124-03: Reproducing Human Evaluations of end-to-end approaches for Referring Expression Generation Saad Mahamood
	ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations Tyler Loakman and Chenghua Lin
	ReproHum #0892-01: The painful route to consistent results: A reproduction study of human evaluation in NLG Irene Mondella, Huiyuan Lai and Malvina Nissim
	ReproHum #0087-01: A Reproduction Study of the Human Evaluation of the Coverage of Fact Checking Explanations Mingqi Gao, Jie Ruan and Xiaojun Wan
	ReproHum #0866-04: Another Evaluation of Readers’ Reactions to News Headlines Zola Mahlaza, Toky Hajatiana Raboanary, Kyle Seakgwa and C. Maria Keet