Designing multimodal deep architectures for Visual Question Answering | Vidéo | Carmin.tv

00:00:00 / 00:00:00

Designing multimodal deep architectures for Visual Question Answering

By Matthieu Cord

Appears in collection : 2019 - T1 - WS3 - Imaging and machine learning

Multimodal representation learning for text and image has been extensively studied in recent years. Currently, one of the most popular tasks in this field is Visual Question Answering (VQA). I will introduce this complex multimodal task, which aims at answering a question about an image. To solve this problem, visual and textual deep nets models are required and, high level interactions between these two modalities have to be carefully designed into the model in order to provide the right answer. This projection from the unimodal spaces to a multimodal one is supposed to extract and model the relevant correlations between the two spaces. Besides, the model must have the ability to understand the full scene, focus its attention on the relevant visual regions and discard the useless information regarding the question.

Information about the video

Date of recording 04/04/2019
Date of publication 10/05/2019
Institution IHP
Licence CC BY-NC-ND
Language English
Format MP4
Venue Institut Henri Poincaré

Domain(s)

Computer Science

Last related questions on MathOverflow

You have to connect your Carmin.tv account with mathoverflow to add question

Ask a question on MathOverflow

Copyright Carmin.tv 2026

Give feedback