Progress on Joint Vision and Language Understanding
In this talk, I am going to introduce some of the recent efforts we have made on joint vision and language understanding at Facebook AI Research. First, we study the current dominating "bottom-up attention" features used by state-of-the-art VQA systems, and show that vanilla convolutional feature maps, or grid features, can perform similarly well while offering significant speed-ups to the pipeline. This finding not only gives us better understanding of the problem and the models, but also opens up new opportunities in vision and language research in general. Second, I am going to briefly mention some other new developments in this research direction, including 1) reasoning with both objects and texts in the scene; and 2) making the system more robust to question rephrasing through generative modeling and cycle-consistency training. Finally, I am also introducing the open sourced library from FAIR, Pythia, which helped us won two VQA challenges last year, and hope to further facilitate research for the community.
I am a Research Scientist at Facebook AI Research, Menlo Park.
I was a PhD student at Language Technology Institute, Carnegie Mellon University, from September 2012 to Feburary 2018, working mainly with Prof. Abhinav Gupta on computer vision, computational linguistics and the combination of both. Durthing the time at CMU, I had also worked with Prof. Tom Mitchell. In spring 2014, I did an internship in MSR with Prof. C. Lawrence Zitnick. Then in summer 2016 I did an internship in Prof. William T. Freeman's VisCAM group at Google. Before graduation, I also spent time at Google Cloud AI team working with Prof. Fei-Fei Li and Dr. Jia Li.
I graduated with a bachelor's degree in computer science from Zhejiang University, China. During my undergraduate study, I was mainly under the supervision of Prof. Deng Cai in the State Key Laboratory of CAD & CG. I was a summer intern at UCLA in 2011, mainly work with Prof. Jenn Wortman Vaughan.