University of Cambridge > > Microsoft Research Cambridge, public talks > From Pose Estimation to Fine Grained Activity Recognition

From Pose Estimation to Fine Grained Activity Recognition

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Microsoft Research Cambridge Talks Admins.

This event may be recorded and made available internally or externally via Microsoft will own the copyright of any recordings made. If you do not wish to have your image/voice recorded please consider this before attending


From Pose Estimation to Fine Grained Activity Recognition


Human pose estimation and activity recognition in monocular images are challenging problems, especially when these tasks must be solved in unconstrained environments such as street scenes. The major sources of complexity are cluttered and dynamically changing backgrounds and the presence of multiple people that often partially or fully occlude each other.

While previous work has largely neglected interactions between people, we show that modeling them is crucial for good performance. In the first part of the talk I will to demonstrate that for the case of detection of people in crowded street scenes and for the case of monocular 3D pose estimation. In the case of people detection we propose a new occlusion-aware detector that exploits the patterns emerging from person-person occlusions, and quantify its performance on several publicly available benchmarks, improving over the state-of-the-art. In the case of human pose estimation we propose to incroporate interactions at two level. The 2D poses of people are inferred with a multi-person pictorial structures model that captures interactions between subjects. The 3D poses are then recovered by lifting 2D poses to 3D relying on the learned joined prior model of human poses and motion. We demonstrate that including interactions between subjects both in 2D and in 3D improves pose estimation results.

In the second part of the talk I will focus on the challenge of fine grained activity recognition, where the goal is to recognize a large number of visually similar activities such as those performed during a complex medical procedure, devide maintaince or cooking. I will rely on the cooking activities as a working example and describe our recently introduced dataset, containing over 65 cooking activities and about 9 hours of video footage. I will present initial results on the dataset and discuss open questions related to the use of pose estimation for fine grained activity recognition.

This talk is part of the Microsoft Research Cambridge, public talks series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity