
This dataset provides utterance-level annotations for multimodal sentiment analysis derived from publicly available YouTube videos. Each data instance corresponds to a single utterance and includes aligned multimodal information consisting of transcribed text, audio segments, and visual representations extracted as video keyframes. Sentiment labels are manually annotated at the utterance level to capture fine-grained affective expressions within conversational contexts. The dataset is designed to support research in multimodal learning, affective computing, and large language model (LLM)-based sentiment analysis. It can be used for benchmarking sentiment classification models, evaluating multimodal fusion strategies, and exploring zero-shot or fine-tuning approaches with vision–language and audio–text models. All data are provided for research and educational purposes only.
multimodal sentiment analysis, video sentiment, utterance-level annotation, large language models, affective computing
multimodal sentiment analysis, video sentiment, utterance-level annotation, large language models, affective computing
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
