Hauptinhalt
Topinformationen
Lehrende
Study Project: Vision-Language Models: Architectures, Alignment, and Self-Supervised Foundations (Part I)
DozentIn: Prof. Dr. Elia Bruni , Mohamad Ballout, M. Sc. , Serwan Jassim
Veranstaltungstyp: Studienprojekt
Ort: 93/E01
Zeiten: Do. 14:00 - 16:00 (wöchentlich)
Beschreibung: Dear Students, read carefully:
We are implementing an new approach for Study Project distribution. Registering for individual study projects is not possible. Please enroll in the course "Cognitive Science Study Project Distribution" (https://go.uos.de/H58P0) and apply for up to three study projects. You will find course descriptions and self-assessments in the Vips. The deadline is Sunday, 5.10.25, 23.59 pm.
Course Description:
This study project explores the rapidly evolving field of Vision-Language Models (VLMs), focusing on the architectural, training, and evaluation challenges in building systems that understand and generate both visual and textual modalities. Students will examine a range of vision encoders (e.g., CNNs, ViTs, SWIN, CLIP-style encoders), and contrast compositional (e.g., dual-tower) vs. monolithic (e.g., end-to-end fusion) VLM architectures.
Key topics include:
- Vision encoder backbones and their integration into multimodal systems
- Compositional vs. monolithic design trade-offs in VLMs
- The problem of vision-language alignment, including representational collapse and semantic grounding
- Evaluation challenges, including robustness, generalization, and downstream task performance
- Strategies for generating high-quality visual representations in a self-supervised or weakly supervised setting
- Scaling laws and data curation strategies for multimodal training
- Emerging paradigms in VLMs, such as image-to-text generation, visual reasoning, and vision-based prompting of LLMs
Through weekly readings, discussions, group projects and a final presentation or report, students will gain a critical understanding of the design space and limitations of current vision-language systems, and be equipped to propose or prototype novel architectures or training regimes.
Learning Objectives:
By the end of this course, students will...
… understand the foundations and current trends in the design of vision-language models.
… be able to critically analyze architectural choices and training paradigms for VLMs.
… gain hands-on experience developing or evaluating VLM components through collaborative projects.
… learn to effectively communicate findings in multimodal AI research via presentations and reports.
Prerequisites:
Some prior experience with machine learning and deep learning is expected. Familiarity with neural network architectures (e.g., CNNs, Transformers) and basic NLP is recommended.
Course Format:
- Delivery Method: in-person or online lectures with recordings available
- Type of Contact & Contact Hours: weekly seminar sessions
- Selection Process: limited number of participants; selection via motivation letter
Assessment and Grading:
- Active participation in discussions and weekly readings
- Group project and presentation during the semester
- Final report or presentation on a selected sub-topic in vision-language modeling
Required Texts and Materials or Further Resources:
- Relevant papers, codebases, and reading material will be provided over the course of the semester via Stud.IP or direct links.
Important Dates:
- Registration Deadline on HisInOne/EXA: 31.10.2025
- De-Registration Deadline on HisInOne/EXA: 06.02.2026
Will this class be offered again/regularly?
This course may be offered again, depending on instructor availability and demand. It is part of a rotating series of “Introductory and Advanced Topics in NLP.”
