서브메뉴
검색
Adapting Pre-trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language Processing.
Adapting Pre-trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language Processing.
- 자료유형
- 학위논문
- Control Number
- 0017162352
- International Standard Book Number
- 9798383223857
- Dewey Decimal Classification Number
- 401
- Main Entry-Personal Name
- Downey, C. M.
- Publication, Distribution, etc. (Imprint
- [S.l.] : University of Washington., 2024
- Publication, Distribution, etc. (Imprint
- Ann Arbor : ProQuest Dissertations & Theses, 2024
- Physical Description
- 132 p.
- General Note
- Source: Dissertations Abstracts International, Volume: 86-01, Section: B.
- General Note
- Advisor: Levow, Gina-Anne;Steinert-Threlkeld, Shane.
- Dissertation Note
- Thesis (Ph.D.)--University of Washington, 2024.
- Summary, Etc.
- 요약Advances in Natural Language Processing (NLP) over the past decade have largely been driven by the scale of data and computation used to train large neural network-based models. However, these techniques are inapplicable to the vast majority of the world's languages, which lack the vast digitized text datasets available for English and a few other very high-resource languages. In this dissertation, we present three case studies for extending NLP applications to under-resourced languages. These case studies include conducting unsupervised morphological segmentation for extremely low-resource languages via multilingual training and transfer, optimizing the vocabulary of a pre-trained cross-lingual model for specific target language(s), and specializing a pre-trained model for a low-resource language family (Uralic). Based on these case studies, we argue for three broad, guiding principles in extending NLP applications to under-resourced languages. First: where possible, robustly pre-trained models and representations should be leveraged. Second: components of pre-trained models that are not optimized for new languages should be substituted or substantially adapted. Third: targeted multilingual training provides a middle ground between the lack of adequate data to train models for individual under-resourced languages on one hand, and the diminishing returns of "massively multilingual" training on the other.
- Subject Added Entry-Topical Term
- Linguistics.
- Subject Added Entry-Topical Term
- Computer science.
- Subject Added Entry-Topical Term
- Language.
- Index Term-Uncontrolled
- Natural Language Processing
- Index Term-Uncontrolled
- Uralic
- Index Term-Uncontrolled
- Multilinguality
- Index Term-Uncontrolled
- Vocabulary
- Added Entry-Corporate Name
- University of Washington Linguistics
- Host Item Entry
- Dissertations Abstracts International. 86-01B.
- Electronic Location and Access
- 로그인을 한후 보실 수 있는 자료입니다.
- Control Number
- joongbu:655700
detalle info
- Reserva
- 캠퍼스간 도서대출
- 서가에 없는 책 신고
- Mi carpeta