Papers
[pdf] [bib]
|
pages |
Quality and Quantity of Machine Translation References for Automatic Metrics Vilém Zouhar and Ondřej Bojar
[pdf] [bib] [supplementary]
|
pp. 1‑11 |
Exploratory Study on the Impact of English Bias of Generative Large Language Models in Dutch and French Ayla Rigouts Terryn and Miryam de Lhoneux
[pdf] [bib] [supplementary]
|
pp. 12‑27 |
Adding Argumentation into Human Evaluation of Long Document Abstractive Summarization: A Case Study on Legal Opinions Mohamed Elaraby, Huihui Xu, Morgan Gray, Kevin Ashley and Diane Litman
[pdf] [bib]
|
pp. 28‑35 |
A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian Aleksandra Miletić and Filip Miletić
[pdf] [bib]
|
pp. 36‑46 |
Insights of a Usability Study for KBQA Interactive Semantic Parsing: Generation Yields Benefits over Templates but External Validity Remains Challenging Ashley Lewis, Lingbo Mo, Marie-Catherine de Marneffe, Huan Sun and Michael White
[pdf] [bib]
|
pp. 47‑62 |
Extrinsic evaluation of question generation methods with user journey logs Elie Antoine, Eléonore Besnehard, Frederic Bechet, Geraldine Damnati, Eric Kergosien and Arnaud Laborderie
[pdf] [bib]
|
pp. 63‑70 |
Towards Holistic Human Evaluation of Automatic Text Simplification Luisa Carrer, Andreas Säuberli, Martin Kappus and Sarah Ebling
[pdf] [bib]
|
pp. 71‑80 |
Decoding the Metrics Maze: Navigating the Landscape of Conversational Question Answering System Evaluation in Procedural Tasks Alexander Frummet and David Elsweiler
[pdf] [bib]
|
pp. 81‑90 |
The 2024 ReproNLP Shared Task on Reproducibility of Evaluations in NLP: Overview and Results Anya Belz and Craig Thomson
[pdf] [bib]
|
pp. 91‑105 |
Once Upon a Replication: It is Humans’ Turn to Evaluate AI’s Understanding of Children’s Stories for QA Generation Andra-Maria Florescu, Marius Micluta-Campeanu and Liviu P. Dinu
[pdf] [bib] [supplementary]
|
pp. 106‑113 |
Exploring Reproducibility of Human-Labelled Data for Code-Mixed Sentiment Analysis Sachin Sasidharan Nair, Tanvi Dinkar and Gavin Abercrombie
[pdf] [bib]
|
pp. 114‑124 |
Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques Michela Lorandi and Anya Belz
[pdf] [bib]
|
pp. 125‑131 |
ReproHum: #0033-03: How Reproducible Are Fluency Ratings of Generated Text? A Reproduction of August et al. 2022 Emiel van Miltenburg, Anouck Braggaar, Nadine Braun, Martijn Goudbeek, Emiel Krahmer, Chris van der Lee, Steffen Pauws and Frédéric Tomas
[pdf] [bib]
|
pp. 132‑144 |
ReproHum #0927-03: DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text Tanvi Dinkar, Gavin Abercrombie and Verena Rieser
[pdf] [bib]
|
pp. 145‑152 |
ReproHum #0927-3: Reproducing The Human Evaluation Of The DExperts Controlled Text Generation Method Javier González Corbelle, Ainhoa Vivel Couso, Jose Maria Alonso-Moral and Alberto Bugarín-Diz
[pdf] [bib] [supplementary]
|
pp. 153‑162 |
ReproHum #1018-09: Reproducing Human Evaluations of Redundancy Errors in Data-To-Text Systems Filip Klubička and John D. Kelleher
[pdf] [bib]
|
pp. 163‑198 |
ReproHum#0043: Human Evaluation Reproducing Language Model as an Annotator: Exploring Dialogue Summarization on AMI Dataset Vivian Fresen, Mei-Shin Wu-Urbanek and Steffen Eger
[pdf] [bib]
|
pp. 199‑209 |
ReproHum #0712-01: Human Evaluation Reproduction Report for “Hierarchical Sketch Induction for Paraphrase Generation” Mohammad Arvan and Natalie Parde
[pdf] [bib]
|
pp. 210‑220 |
ReproHum #0712-01: Reproducing Human Evaluation of Meaning Preservation in Paraphrase Generation Lewis N. Watson and Dimitra Gkatzia
[pdf] [bib] [supplementary]
|
pp. 221‑228 |
ReproHum #0043-4: Evaluating Summarization Models: investigating the impact of education and language proficiency on reproducibility Mateusz Lango, Patricia Schmidtova, Simone Balloccu and Ondrej Dusek
[pdf] [bib] [supplementary]
|
pp. 229‑237 |
ReproHum #0033-3: Comparable Relative Results with Lower Absolute Values in a Reproduction Study Yiru Li, Huiyuan Lai, Antonio Toral and Malvina Nissim
[pdf] [bib] [supplementary]
|
pp. 238‑249 |
ReproHum #0124-03: Reproducing Human Evaluations of end-to-end approaches for Referring Expression Generation Saad Mahamood
[pdf] [bib]
|
pp. 250‑254 |
ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations Tyler Loakman and Chenghua Lin
[pdf] [bib] [supplementary]
|
pp. 255‑260 |
ReproHum #0892-01: The painful route to consistent results: A reproduction study of human evaluation in NLG Irene Mondella, Huiyuan Lai and Malvina Nissim
[pdf] [bib]
|
pp. 261‑268 |
ReproHum #0087-01: A Reproduction Study of the Human Evaluation of the Coverage of Fact Checking Explanations Mingqi Gao, Jie Ruan and Xiaojun Wan
[pdf] [bib] [supplementary]
|
pp. 269‑273 |
ReproHum #0866-04: Another Evaluation of Readers’ Reactions to News Headlines Zola Mahlaza, Toky Hajatiana Raboanary, Kyle Seakgwa and C. Maria Keet
[pdf] [bib] [supplementary]
|
pp. 274‑280 |