서브메뉴
검색
Adapting Pre-trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language Processing.
Adapting Pre-trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language Processing.
상세정보
- 자료유형
- 학위논문
- Control Number
- 0017162352
- International Standard Book Number
- 9798383223857
- Dewey Decimal Classification Number
- 401
- Main Entry-Personal Name
- Downey, C. M.
- Publication, Distribution, etc. (Imprint
- [S.l.] : University of Washington., 2024
- Publication, Distribution, etc. (Imprint
- Ann Arbor : ProQuest Dissertations & Theses, 2024
- Physical Description
- 132 p.
- General Note
- Source: Dissertations Abstracts International, Volume: 86-01, Section: B.
- General Note
- Advisor: Levow, Gina-Anne;Steinert-Threlkeld, Shane.
- Dissertation Note
- Thesis (Ph.D.)--University of Washington, 2024.
- Summary, Etc.
- 요약Advances in Natural Language Processing (NLP) over the past decade have largely been driven by the scale of data and computation used to train large neural network-based models. However, these techniques are inapplicable to the vast majority of the world's languages, which lack the vast digitized text datasets available for English and a few other very high-resource languages. In this dissertation, we present three case studies for extending NLP applications to under-resourced languages. These case studies include conducting unsupervised morphological segmentation for extremely low-resource languages via multilingual training and transfer, optimizing the vocabulary of a pre-trained cross-lingual model for specific target language(s), and specializing a pre-trained model for a low-resource language family (Uralic). Based on these case studies, we argue for three broad, guiding principles in extending NLP applications to under-resourced languages. First: where possible, robustly pre-trained models and representations should be leveraged. Second: components of pre-trained models that are not optimized for new languages should be substituted or substantially adapted. Third: targeted multilingual training provides a middle ground between the lack of adequate data to train models for individual under-resourced languages on one hand, and the diminishing returns of "massively multilingual" training on the other.
- Subject Added Entry-Topical Term
- Linguistics.
- Subject Added Entry-Topical Term
- Computer science.
- Subject Added Entry-Topical Term
- Language.
- Index Term-Uncontrolled
- Natural Language Processing
- Index Term-Uncontrolled
- Uralic
- Index Term-Uncontrolled
- Multilinguality
- Index Term-Uncontrolled
- Vocabulary
- Added Entry-Corporate Name
- University of Washington Linguistics
- Host Item Entry
- Dissertations Abstracts International. 86-01B.
- Electronic Location and Access
- 로그인을 한후 보실 수 있는 자료입니다.
- Control Number
- joongbu:655700
MARC
008250224s2024 us ||||||||||||||c||eng d■001000017162352
■00520250211152002
■006m o d
■007cr#unu||||||||
■020 ▼a9798383223857
■035 ▼a(MiAaPQ)AAI31330087
■040 ▼aMiAaPQ▼cMiAaPQ
■0820 ▼a401
■1001 ▼aDowney, C. M.
■24510▼aAdapting Pre-trained Models and Leveraging Targeted Multilinguality for Under-Resourced and Endangered Language Processing.
■260 ▼a[S.l.]▼bUniversity of Washington. ▼c2024
■260 1▼aAnn Arbor▼bProQuest Dissertations & Theses▼c2024
■300 ▼a132 p.
■500 ▼aSource: Dissertations Abstracts International, Volume: 86-01, Section: B.
■500 ▼aAdvisor: Levow, Gina-Anne;Steinert-Threlkeld, Shane.
■5021 ▼aThesis (Ph.D.)--University of Washington, 2024.
■520 ▼aAdvances in Natural Language Processing (NLP) over the past decade have largely been driven by the scale of data and computation used to train large neural network-based models. However, these techniques are inapplicable to the vast majority of the world's languages, which lack the vast digitized text datasets available for English and a few other very high-resource languages. In this dissertation, we present three case studies for extending NLP applications to under-resourced languages. These case studies include conducting unsupervised morphological segmentation for extremely low-resource languages via multilingual training and transfer, optimizing the vocabulary of a pre-trained cross-lingual model for specific target language(s), and specializing a pre-trained model for a low-resource language family (Uralic). Based on these case studies, we argue for three broad, guiding principles in extending NLP applications to under-resourced languages. First: where possible, robustly pre-trained models and representations should be leveraged. Second: components of pre-trained models that are not optimized for new languages should be substituted or substantially adapted. Third: targeted multilingual training provides a middle ground between the lack of adequate data to train models for individual under-resourced languages on one hand, and the diminishing returns of "massively multilingual" training on the other.
■590 ▼aSchool code: 0250.
■650 4▼aLinguistics.
■650 4▼aComputer science.
■650 4▼aLanguage.
■653 ▼aNatural Language Processing
■653 ▼aUralic
■653 ▼aMultilinguality
■653 ▼aVocabulary
■690 ▼a0290
■690 ▼a0984
■690 ▼a0679
■71020▼aUniversity of Washington▼bLinguistics.
■7730 ▼tDissertations Abstracts International▼g86-01B.
■790 ▼a0250
■791 ▼aPh.D.
■792 ▼a2024
■793 ▼aEnglish
■85640▼uhttp://www.riss.kr/pdu/ddodLink.do?id=T17162352▼nKERIS▼z이 자료의 원문은 한국교육학술정보원에서 제공합니다.
미리보기
내보내기
chatGPT토론
Ai 추천 관련 도서
Info Détail de la recherche.
- Réservation
- 캠퍼스간 도서대출
- 서가에 없는 책 신고
- My Folder