課程簡(jiǎn)介
Feature representation of different modalities is the main focus of current cross-modal information retrieval research. Existing models typically project texts and images into the same embedding space. In this talk, we will introduce some basic ideas of text and image modeling and how can we build cross-modal relations using deep learning models. In details, we will discuss a joint model by using metric learning to minimize the similarity of the same content from different modalities. We will also introduce some recent research developments in image captioning and vision question answering (VQA)
【工作坊大綱】
1. 語(yǔ)義鴻溝
2. 圖像建模與CNN
3. 文本模型與詞向量
4. 聯(lián)合模型
5. 自動(dòng)標(biāo)注
6. 文本生成
7. 視覺問(wèn)答
目標(biāo)收益
了解到深度學(xué)習(xí)的前沿研究,了解如何利用深度學(xué)習(xí)進(jìn)行圖像、文本信息的聯(lián)合建模并如何跨模態(tài)的實(shí)現(xiàn)語(yǔ)義搜索和圖像問(wèn)答系統(tǒng)。
培訓(xùn)對(duì)象
課程內(nèi)容
Feature representation of different modalities is the main focus of current cross-modal information retrieval research. Existing models typically project texts and images into the same embedding space. In this talk, we will introduce some basic ideas of text and image modeling and how can we build cross-modal relations using deep learning models. In details, we will discuss a joint model by using metric learning to minimize the similarity of the same content from different modalities. We will also introduce some recent research developments in image captioning and vision question answering (VQA)。
outline:
-語(yǔ)義鴻溝
-圖像建模與CNN
-文本模型與詞向量
-聯(lián)合模型
-自動(dòng)標(biāo)注
-文本生成
-視覺問(wèn)答