서브메뉴
검색
Grounding Language in Images and Videos.
Grounding Language in Images and Videos.
- 자료유형
- 학위논문
- Control Number
- 0017161970
- International Standard Book Number
- 9798382652443
- Dewey Decimal Classification Number
- 004
- Main Entry-Personal Name
- Sadhu, Arka.
- Publication, Distribution, etc. (Imprint
- [S.l.] : University of Southern California., 2024
- Publication, Distribution, etc. (Imprint
- Ann Arbor : ProQuest Dissertations & Theses, 2024
- Physical Description
- 244 p.
- General Note
- Source: Dissertations Abstracts International, Volume: 85-11, Section: A.
- General Note
- Advisor: Nevatia, Ramakant.
- Dissertation Note
- Thesis (Ph.D.)--University of Southern California, 2024.
- Summary, Etc.
- 요약While machine learning research has traditionally explored image, video and text understanding as separate fields, the surge in multi-modal content in today's digital landscape underscores the importance of computation models that adeptly navigate complex interactions between text, images and videos. This dissertation addresses this challenge of grounding language in visual media - the task of associating linguistic symbols with perceptual experiences and actions. The overarching goal of this dissertation is to bridge the gap between language and vision as a means to a "deeper understanding" of images and videos to allow developing models capable of reasoning over longer-time horizons such as hour-long movies, or a collection of images, or even multiple videos.A pivotal contribution of my work is the use of Semantic Roles for images, videos and text. Unlike previous works that primarily focused on recognizing single entities or generating holistic captions, the use of Semantic Roles facilitates a fine-grained understanding of "who did what to whom" in a structured format. It maintains the advantages of having free-form language phrases and at the same time also being comprehensive and complete like entity recognition, thus enriching the model's interpretive capabilities.In this thesis, we will introduce the various vision-language tasks developed during my Ph.D. This includes grounding unseen words, spatio-temporal localization of entities in a video, video question answering, visual semantic role labeling in videos, reasoning across more than one image or a video, and finally, weakly-supervised open-vocabulary object detection. Each task is accompanied by the creation and development of dedicated datasets, evaluation protocols, and model frameworks. These tasks aim to investigate a particular phenomenon inherent in image or video understanding in isolation, develop corresponding datasets and model frameworks, and outline evaluation protocols robust to data priors.The resulting models can be used for other downstream tasks like obtaining common-sense knowledge graphs from instructional videos or drive end-user applications like Retrieval, Question Answering, and Captioning. By facilitating the deeper integration of language and vision, this dissertation represents a step-forward in machine learning models capable of finer-understanding of the world around us.
- Subject Added Entry-Topical Term
- Computer science.
- Subject Added Entry-Topical Term
- Computer engineering.
- Subject Added Entry-Topical Term
- Linguistics.
- Index Term-Uncontrolled
- Computer vision
- Index Term-Uncontrolled
- Image understanding
- Index Term-Uncontrolled
- Machine learning
- Index Term-Uncontrolled
- Natural language processing
- Index Term-Uncontrolled
- Video understanding
- Added Entry-Corporate Name
- University of Southern California Computer Science
- Host Item Entry
- Dissertations Abstracts International. 85-11A.
- Electronic Location and Access
- 로그인을 한후 보실 수 있는 자료입니다.
- Control Number
- joongbu:657024