University of Cambridge > > NLIP Seminar Series > Challenges in evaluating natural language generation systems

Challenges in evaluating natural language generation systems

Add to your list(s) Download to your calendar using vCal

  • UserMohit Iyyer (University of Massachusetts Amherst)
  • ClockFriday 11 June 2021, 13:00-14:00
  • HouseVirtual (Zoom).

If you have a question about this talk, please contact Huiyuan Xie.

Note unusual time

Join Zoom Meeting

Meeting ID: 919 0039 6241 Passcode: 127570

Recent advances in neural language modeling have opened up a variety of exciting new text generation applications. However, evaluating systems built for these tasks remains difficult. Most prior work relies on a combination of automatic metrics such as BLEU (which are often uninformative) and crowdsourced human evaluation (which are also usually uninformative, especially when conducted without careful task design). In this talk, I focus on two specific applications: (1) unsupervised sentence-level style transfer and (2) long-form question answering. I will go over our recent work on building models for these systems and then describe the ensuing struggles to properly compare them to baselines. In both cases, we identify (and propose solutions for) issues with existing evaluations, including improper aggregation of multiple metrics, missing control experiments with simple baselines, and high cognitive load placed on human evaluators. I’ll conclude by briefly discussing our work on machine-in-the-loop text generation systems, in which both humans and machines participate in the generation process, where reliable human evaluation becomes much more feasible.

This talk is part of the NLIP Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity