Log in

Cambridge users (raven) details

Other users details

No account? details

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

Robust Alignment of Large Language Models

Add to your list(s) Download to your calendar using vCal

Dr. Sangwoong Yoon (UCL)
Friday 23 May 2025, 12:00-13:00
ONLINE ONLY. Here is the Zoom link: https://cam-ac-uk.zoom.us/j/4751389294?pwd=Z2ZOSDk0eG1wZldVWG1GVVhrTzFIZz09.

If you have a question about this talk, please contact Suchir Salhan.

The alignment of large language models (LLMs) can often be brittle when faced with the complexities of real-world deployment. In this talk, I share our investigations on two scenarios where special care is required to ensure robust alignment.

The first scenario is multi-objective alignment, where balancing competing objectives is particularly challenging. Our recent work, Robust Multi-Objective Decoding (RMOD), an inference-time alignment algorithm, adaptively adjusts the weights of different objectives during response generation to ensure none are neglected. RMOD provides principled robustness with minimal overhead, consistently outperforming existing methods across several alignment benchmarks.

In the second part of the talk, I will address preference model misspecification in self-play alignment. While self-play is a promising alignment approach, naive implementations are vulnerable to inaccuracies in the preference model. To address this, our Regularized Self-Play Policy Optimization (RSPO) framework offers a versatile and modular method for regularizing the self-play alignment process. RSPO ’s ability to combine various regularizers results in strong performance gains on multiple evaluation sets, such as AlpacaEval-2 and Arena-Hard.

As a bonus, I will briefly introduce our recent investigation into the robustness of Mixture-of-Agent (MoA) systems, a popular multi-agent paradigm. We show that even a single malicious agent introduced into the mixture can nullify the benefits of the entire system.

This talk is part of the NLIP Seminar Series series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Log in

Information on

Robust Alignment of Large Language Models

This talk is included in these lists:

Other lists

Other talks