BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Talk by Uri Berger: Multimodality and Human Alignment in Vision-La
 nguage Models (University of Melbourne & Hebrew University of Jerusalem) -
  Uri Berger (University of Melbourne & Hebrew University of Jerusalem)
DTSTART:20260127T110000Z
DTEND:20260127T120000Z
UID:TALK243343@talks.cam.ac.uk
CONTACT:Lucas Resck
DESCRIPTION:Vision–language models (VLMs) achieve impressive performance
  across many multimodal tasks\, yet their outputs often diverge from human
  behavior. In this talk\, I outline a path toward understanding the source
 s of these differences and for making VLM outputs more human-like\, follow
 ing three complementary steps.\n\nFirst\, I examine the semantic structure
 s that emerge when models are trained with both visual and linguistic inpu
 t. I show that\, much like in humans\, the categories learned from multimo
 dal data tend to be scene-based (e.g.\, “water-related objects\,” “t
 ree-related objects”)\, in contrast to the taxonomic categories (e.g.\, 
 “animals\,” “vehicles”) that arise in text-only training.\n\nSecon
 d\, I investigate how pragmatic cues\, such as salient visual categories a
 nd speakers’ cultural backgrounds\, influence image descriptions. I demo
 nstrate that visual features shape the syntactic form of the generated des
 cription\, and that cultural background strongly affects which entities sp
 eakers choose to mention.\n\nFinally\, I present our efforts to make VLM o
 utputs more human-aligned. I introduce reformulation feedback- a technique
  inspired by parents feedback to their children- and show that applying it
  on captioning models at inference time significantly improves human judgm
 ents of caption quality. I then survey current evaluation practices for im
 age captioning models\, highlight that the field relies on five widely use
 d metrics that correlate poorly with human ratings\, and propose direction
 s for substantially improving these correlations.\n\n\nBio: I'm a last yea
 r PhD candidate in a joint program at the University of Melbourne and the 
 Hebrew University of Jerusalem\, under the supervision of Lea Frermann\, O
 mri Abend and Gabriel Stanovsky. Before that\, I did my MSc at the Hebrew 
 University of Jerusalem working with Ari Rappoport on Spiking Neural Netwo
 rks. I am interested in learning in non–text-only environments\, particu
 larly those involving multimodality or interactivity.
LOCATION:GR05 (English Faculty Building\, 9 West Road\, Sidgwick Site) and
  online (https://teams.microsoft.com/meet/3674017464325?p=04vmZBjTtuQwz5Rq
 vk)
END:VEVENT
END:VCALENDAR
