Pinned Repositories
3d-attention-video-understanding
Using a 3D Nearby Self-Attention Transformer to leverage the spatiotemporal nature of video for representation learning.
cars196-classifier
cross-modal-speech-segment-retrieval
Learning a common representation space from speech and text for cross-modal retrieval given textual queries and speech files.
hierarchical-language-modeling
We address the task of learning contextualized word, sentence and document representations with a hierarchical language model by stacking Transformer-based encoders on a sentence level and subsequently on a document level and performing masked token prediction.
joint-nas-hpo
Automatically improving and analyzing the performance of a neural network for a fashion classification dataset. Instead of only considering the architecture and hyperparameters separately we build a system to jointly optimize them.
kg-augmented-lm
Leveraging knowledge graphs to learn a more factually grounded language model for retrieval and question answering downstream tasks.
multimodal-self-distillation
A generalized self-supervised training paradigm for unimodal and multimodal alignment and fusion.
seminar_multimodal_dl
https://slds-lmu.github.io/seminar_multimodal_dl/
seminar_multimodal_dl
https://slds-lmu.github.io/seminar_multimodal_dl/
marcomoldovan's Repositories
marcomoldovan/hierarchical-language-modeling
We address the task of learning contextualized word, sentence and document representations with a hierarchical language model by stacking Transformer-based encoders on a sentence level and subsequently on a document level and performing masked token prediction.
marcomoldovan/multimodal-self-distillation
A generalized self-supervised training paradigm for unimodal and multimodal alignment and fusion.
marcomoldovan/3d-attention-video-understanding
Using a 3D Nearby Self-Attention Transformer to leverage the spatiotemporal nature of video for representation learning.
marcomoldovan/cars196-classifier
marcomoldovan/cross-modal-speech-segment-retrieval
Learning a common representation space from speech and text for cross-modal retrieval given textual queries and speech files.
marcomoldovan/joint-nas-hpo
Automatically improving and analyzing the performance of a neural network for a fashion classification dataset. Instead of only considering the architecture and hyperparameters separately we build a system to jointly optimize them.
marcomoldovan/kg-augmented-lm
Leveraging knowledge graphs to learn a more factually grounded language model for retrieval and question answering downstream tasks.
marcomoldovan/seminar_multimodal_dl
https://slds-lmu.github.io/seminar_multimodal_dl/