Imaging and machine learning

Collection Imaging and machine learning

Organisateur(s)
Date(s) 14/05/2024
00:00:00 / 00:00:00
22 30

Designing multimodal deep architectures for Visual Question Answering

De Matthieu Cord

Multimodal representation learning for text and image has been extensively studied in recent years. Currently, one of the most popular tasks in this field is Visual Question Answering (VQA). I will introduce this complex multimodal task, which aims at answering a question about an image. To solve this problem, visual and textual deep nets models are required and, high level interactions between these two modalities have to be carefully designed into the model in order to provide the right answer. This projection from the unimodal spaces to a multimodal one is supposed to extract and model the relevant correlations between the two spaces. Besides, the model must have the ability to understand the full scene, focus its attention on the relevant visual regions and discard the useless information regarding the question.

Informations sur la vidéo

Domaine(s)

Dernières questions liées sur MathOverflow

Pour poser une question, votre compte Carmin.tv doit être connecté à mathoverflow

Poser une question sur MathOverflow




Inscrivez-vous

  • Mettez des vidéos en favori
  • Ajoutez des vidéos à regarder plus tard &
    conservez votre historique de consultation
  • Commentez avec la communauté
    scientifique
  • Recevez des notifications de mise à jour
    de vos sujets favoris
Donner son avis