본문

서브메뉴

Grounding Language in Images and Videos.
Contents Info
Grounding Language in Images and Videos.
자료유형  
 학위논문
Control Number  
0017161970
International Standard Book Number  
9798382652443
Dewey Decimal Classification Number  
004
Main Entry-Personal Name  
Sadhu, Arka.
Publication, Distribution, etc. (Imprint  
[S.l.] : University of Southern California., 2024
Publication, Distribution, etc. (Imprint  
Ann Arbor : ProQuest Dissertations & Theses, 2024
Physical Description  
244 p.
General Note  
Source: Dissertations Abstracts International, Volume: 85-11, Section: A.
General Note  
Advisor: Nevatia, Ramakant.
Dissertation Note  
Thesis (Ph.D.)--University of Southern California, 2024.
Summary, Etc.  
요약While machine learning research has traditionally explored image, video and text understanding as separate fields, the surge in multi-modal content in today's digital landscape underscores the importance of computation models that adeptly navigate complex interactions between text, images and videos. This dissertation addresses this challenge of grounding language in visual media - the task of associating linguistic symbols with perceptual experiences and actions. The overarching goal of this dissertation is to bridge the gap between language and vision as a means to a "deeper understanding" of images and videos to allow developing models capable of reasoning over longer-time horizons such as hour-long movies, or a collection of images, or even multiple videos.A pivotal contribution of my work is the use of Semantic Roles for images, videos and text. Unlike previous works that primarily focused on recognizing single entities or generating holistic captions, the use of Semantic Roles facilitates a fine-grained understanding of "who did what to whom" in a structured format. It maintains the advantages of having free-form language phrases and at the same time also being comprehensive and complete like entity recognition, thus enriching the model's interpretive capabilities.In this thesis, we will introduce the various vision-language tasks developed during my Ph.D. This includes grounding unseen words, spatio-temporal localization of entities in a video, video question answering, visual semantic role labeling in videos, reasoning across more than one image or a video, and finally, weakly-supervised open-vocabulary object detection. Each task is accompanied by the creation and development of dedicated datasets, evaluation protocols, and model frameworks. These tasks aim to investigate a particular phenomenon inherent in image or video understanding in isolation, develop corresponding datasets and model frameworks, and outline evaluation protocols robust to data priors.The resulting models can be used for other downstream tasks like obtaining common-sense knowledge graphs from instructional videos or drive end-user applications like Retrieval, Question Answering, and Captioning. By facilitating the deeper integration of language and vision, this dissertation represents a step-forward in machine learning models capable of finer-understanding of the world around us. 
Subject Added Entry-Topical Term  
Computer science.
Subject Added Entry-Topical Term  
Computer engineering.
Subject Added Entry-Topical Term  
Linguistics.
Index Term-Uncontrolled  
Computer vision
Index Term-Uncontrolled  
Image understanding
Index Term-Uncontrolled  
Machine learning
Index Term-Uncontrolled  
Natural language processing
Index Term-Uncontrolled  
Video understanding
Added Entry-Corporate Name  
University of Southern California Computer Science
Host Item Entry  
Dissertations Abstracts International. 85-11A.
Electronic Location and Access  
로그인을 한후 보실 수 있는 자료입니다.
Control Number  
joongbu:657024
New Books MORE
최근 3년간 통계입니다.

פרט מידע

  • הזמנה
  • 캠퍼스간 도서대출
  • 서가에 없는 책 신고
  • התיקיה שלי
גשמי
Reg No. Call No. מיקום מצב להשאיל מידע
TQ0033242 T   원문자료 열람가능/출력가능 열람가능/출력가능
마이폴더 부재도서신고

* הזמנות זמינים בספר ההשאלה. כדי להזמין, נא לחץ על כפתור ההזמנה

해당 도서를 다른 이용자가 함께 대출한 도서

Related books

Related Popular Books

도서위치