PhD 'Multimodal Multi-Hop Reasoning for Video Analysis' F/HOrange

Caouënnec-Lanvézéac (22)CDD

Le 14 avril

Critères de l'offre

Métiers :
- Data scientist
Expérience min :
- débutant à 1 an
Diplômes :
- Doctorat
- + 2 diplômes
Compétences :
- Anglais
Lieux :
- Caouënnec-Lanvézéac (22)
Conditions :
- CDD
- 35 000 € - 40 000 € par an
- Temps Plein

L'entreprise : Orange

Orange Innovation brings together the research and innovation activities and expertise of the Group's entities and countries. We work every day to ensure that Orange is recognized as an innovative operator by its customers and we create value for the Group in each of our projects. With 720 researchers, thousands of marketers, developers, designers and data analysts, it is the expertise of our 6,000 employees that fuels this ambition every day. Orange Innovation anticipates technological breakthroughs and supports the Group's countries and entities in making the best technological choices to meet the needs of our consumer and business customers.
Within Orange Innovation, you will be integrated into a cutting-edge research team specializing in AI expertise. The team conducts activities in the field of natural language processing (NLP), covering a wide range of topics such as agentic AI, deep research, language modeling, multimodality, semantic analysis, information extraction, document processing, knowledge management, human-machine dialogue, and more. You will be part of a research ecosystem working alongside Data Scientists and developers, supporting the practical application of the studied concepts. The team belongs to the Data & AI Department, whose mission is to consolidate key skills to support the company's transformation, develop use cases, enrich services, and improve workflows by leveraging data and its processing, notably through Artificial Intelligence

Voir toutes les offres d'Orange

Description du poste

Your role is to pursue a PhD thesis on 'Multimodal multi-hop reasoning for video analysis'.
Multimodal reasoning represents a major shift in AI, going beyond single-modality approaches to jointly process visual, linguistic and auditory information. The main challenge is to integrate these heterogeneous sources, which differ in structure and representation. Recently, so-called 'omni' unified models have emerged that can account for multiple modalities simultaneously, but their use of each modality remains poorly understood.
Videos particularly illustrate this complexity: they combine visual, audio and sometimes textual content (subtitles) and constitute a demanding evaluation domain. Multi-hop video reasoning must link cues dispersed across different segments while ensuring temporal alignment, semantic coherence and robust intermodal fusion in the presence of asynchronous signals.
The thesis goal is to study the interaction between modalities in video analysis and to improve multi-hop reasoning across distinct segments. Determining when and how multiple modalities contribute to reasoning represents just part of the challenge. Current models fail to guarantee consistent use of the full modality set, with some multimodal configurations underperforming unimodal reasoning. These findings suggest dataset biases, 'modality collapse' phenomena, and fundamental limitations in modality alignment and exploitation.
Research directions will be organized along two axes.
Axis 1: evaluation, robustness and interpretability. This will involve characterizing the conditions under which models truly exploit multiple modalities and when they fall back to a single one, using probing, systemic analyses, modality ablations and controlled data manipulations (synthetic data, counterfactual examples, physics-informed scenarios). Robustness protocols (noise, suppression or misalignment of modalities) will allow diagnosing the causal role of each signal.
Axis 2: solutions and training of truly multimodal models. Based on the identified challenges, the thesis will aim to design and train architectures and learning procedures that promote collaboration between modalities (attention or routing mechanisms, intermodal coherence constraints, temporal grounding objectives). The ambition is to obtain truly multimodal, robust, efficient and interpretable multi-hop video reasoning models that outperform their unimodal counterparts

Description du profil

Hard and soft skills required for the position
Proficiency in Deep Learning techniques (text, image, audio or video processing).
Programming skills, particularly in Python, with experience in deep learning frameworks such as PyTorch or TensorFlow.
Ability to analyze and interpret complex data, with strong analytical skills.
Personal qualities: scientific rigor, autonomy, curiosity, initiative, ability to work in a team.
Strong oral and writing skills in English for presenting research findings and drafting publications and research reports.
Ability to present results clearly and pedagogically to different audiences.
Required education (master's degree, engineering diploma, doctorate, scientific and technical field, etc.)
You hold a professional or research master's degree or have graduated from an engineering school in computer science or applied mathematics, preferably with a specialization in one or more fields of artificial intelligence.
Desired experience
Prior experience in research projects or internships in video processing or multimodality.
Experience with vision-language models (VLMs) and/or multimodal LLMs (MLLMs).
Experience in Natural Language Processing (NLP).
In-depth understanding of LLMs and reasoning models.
Participation in scientific publications or presentations in the field is a plus.

Salaire et avantages

CSE

Référence : 2026-51582