Learning from Video: Recognizing Actions and Localizing Moments with Natural Language
This talk will describe two works in video understanding. In the first part, I will describe ActionVLAD, a new video representation for action classification that aggregates local convolutional features across an entire spatio-temporal extent of a video. In the second part, I will describe an approach that retrieves a specific temporal segment (moment) from a video given a natural language text description. We address lack of video datasets for this task by collecting the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions.
Bryan Russell is currently a Research Scientist at Adobe Research in San Francisco, CA. He received his Ph.D. from MIT in the Computer Science and Artificial Intelligence Laboratory and was a post-doctoral fellow in the INRIA Willow team in Paris, France. He was a Research Scientist with Intel Labs as part of the Intel Science and Technology Center for Visual Computing (ISTC-VC) and was Affiliate Faculty at the University of Washington.