서브메뉴
검색
Deep Learning Models of Polyadenylation Regulation and Applications in Genetics.
Deep Learning Models of Polyadenylation Regulation and Applications in Genetics.
상세정보
- 자료유형
- 학위논문
- Control Number
- 0017163566
- International Standard Book Number
- 9798384019442
- Dewey Decimal Classification Number
- 574
- Main Entry-Personal Name
- Stroup, Emily Kunce.
- Publication, Distribution, etc. (Imprint
- [S.l.] : Northwestern University., 2024
- Publication, Distribution, etc. (Imprint
- Ann Arbor : ProQuest Dissertations & Theses, 2024
- Physical Description
- 230 p.
- General Note
- Source: Dissertations Abstracts International, Volume: 86-02, Section: B.
- General Note
- Includes supplementary digital materials.
- General Note
- Advisor: Ji, Zhe.
- Dissertation Note
- Thesis (Ph.D.)--Northwestern University, 2024.
- Summary, Etc.
- 요약RNA processing tightly couples transcription to the 5' capping, intron splicing, and 3'-end formation of pre-mRNA to produce a mature mRNA prepared for export to the cytoplasm and downstream translation. During formation of the 3'-end, the pre-mRNA undergoes endonucleolytic cleavage and the polyA tail is synthesized. This process is carried out by a collection of 20 core proteins that comprise the polyadenylation machinery and more than 60 auxiliary proteins that may interact with or modulate the activity of this complex. The core proteins assemble into several subcomplexes that recognize cis-regulatory elements surrounding a polyA site. In mammals, the polyadenylation complex is well-characterized; yet in other model species, such as yeast and plants, there remain many open questions regarding the controlling factors of polyadenylation activity.The signals that mediate polyadenylation activity must be optimized genome-wide to prevent spurious polyadenylation that may produce non-functional transcripts or proteins. For example, the canonical human polyadenylation signal AAUAAA and its close variants occur across the human genome, yet few of these motifs correspond to active polyA sites. In this work, we explore the formation, strength, and heterogeneity of cleavage sites in human, yeast, and plants. In doing so, we seek to better understand the regulation of polyadenylation in human health and disease, and across species.Using 3' Region Extraction and Deep Sequencing data for more than 100 human samples, we identify polyA sites genome-wide at single-nucleotide resolution and use them to develop a suite of deep learning models that interrogate the polyadenylation code. First, the PolyaID model uses the pre-mRNA sequence to classify a polyA site from the genomic sequence and simultaneously predict the cleavage profile in the immediate vicinity. Simultaneously, the PolyaStrength model predicts the sequence strength of a polyA site, as measured by the relative usage within each gene. We then systematically disrupt motifs surrounding polyA sites to identify the cis-regulatory elements that determine site classification and strength. This technique reveals the unique motif families that form polyA sites and their position-dependent importance.However, the sequence alone does not fully explain the use of alternative polyadenylation (APA) sites. We develop APA-specific logistic regression models for composite intronic, skipped intronic, and terminal exonic sites to predict their usage based on the surrounding gene structure. During intronic APA, polyadenylation competes with splicing and tends to occur when splice sites are weakened. In the terminal exons and 3'UTRs, the distance between proximal and distal sites and the distance to the downstream antisense gene govern site use. Finally, we apply our suite of models to the genetic databases ClinVar, GWAS Catalog, and UK BioBank and discover thousands of variants expected to alter APA dynamics.We extend this approach to S. cerevisiae, S. pombe, and A. thaliana, developing highly accurate PolyaClassifier models to distinguish cleavage sites at nucleotide resolution. Using motif disruption, we reveal the unbiased motif families that define polyA sites in these species and discover that they are highly species-specific despite the conservation of many proteins in the polyadenylation complex. In these non-metazoan species, the polyadenylation machinery tolerates significant degeneracy in these key motifs compared to humans.In summary, we develop a collection of species-specific deep learning models that reveal the motifs mediating polyA site formation, heterogeneity, and strength in diverse species and their quantitative contributions to these processes. These models capture the combinatorial regulation of polyadenylation activity in healthy conditions and are sufficiently accurate and sensitive to examine the dysregulation of APA during disease and stress.Supplementary MaterialsSeveral supplementary data files are part of this dissertation and referenced in the text where appropriate. The names, contents, and purpose of these files are described here:• Supplementary File 1 contains the human polyA sites identified using the 3'READS datasets. Each polyA site is annotated with the gene and gene sub-region where the site is found, the dominant PAS signal, the number of supporting 3'READS, the relative usage compared to other sites in the gene, and the vector of surrounding 3'READS ± 50 nt upstream and downstream around the maximum cleavage site.• Supplementary File 2 contains the cis-regulatory motifs regulating human polyadenylation activity. Each motif is annotated with the relevant motif family, the nucleotide content of the motif, the max and mean disruption position, and the sum importance score at each position ± 120 nt upstream and downstream relative to the maximum cleavage site.• Supplementary File 3 contains predictions for the PAS pA-QTLs from Shulman and Elkon (2020). Each variant is annotated with the allele where the variant is located, the effect of the variant, the PAS signal for each allele, and the human PolyaID and PolyaStrength predictions.• Supplementary File 4 contains variants from ClinVar, the UK BioBank, and GWAS Catalog expected to impact polyadenylation activity. Variants are split into tabs based on the source of the variant and the location in terminal exons or introns. Each variant is annotated with the variant ID and location, the reference and alternative allele sequences, the gene and sub-gene region where site is found, the position of and distance to the polyA site, the PolyaID and PolyaStrength predictions, the composite or skipped intronic usage prediction (if relevant), the polyadenylation-associated motifs affected by the variant, and relevant clinical or trait information from the original database.• Supplementary File 5 contains the motifs showing significant importance scores from the PolyaClassifier model in S. cerevisiae. Results for significant 4mers, 5mers, and 6mers are included. Each motif is annotated with the relevant motif family, the mean motif importance score and 95 % confidence interval in key regions, and the sum importance profile ± 250 nt upstream and downstream of the cleavage site.• Supplementary File 6 contains the polyA sites identified in S. cerevisiae. The locations and summary characteristics for clustered polyA sites are shown. Each polyA site is annotated with the genomic location, the representative max cleavage site, the gene and sub-gene region where the site is located, the relative position of the site, the supporting 3'READS, and relative usage compared to other sites in the same gene.• Supplementary File 7 contains the motifs showing significant importance scores from the PolyaStrength model in S. cerevisiae. Results for significant 6mers are included. Each motif is annotated with the relevant motif family, the mean motif importance score and 95 % confidence interval in key regions, and the sum importance profile ± 250 nt upstream and downstream of the cleavage site.• Supplementary File 8 contains the polyA sites identified in S. pombe. The locations and summary characteristics for clustered polyA sites are shown. Each polyA site is annotated with the genomic location, the representative max cleavage site, the gene and sub-gene region where the site is located, the relative position of the site, the supporting 3'READS, and relative usage compared to other sites in the same gene.• Supplementary File 9 contains the motifs showing significant importance scores from the PolyaClassifier model in S. pombe. Results for significant 4mers, 5mers, and 6mers are included. Each motif is annotated with the relevant motif family, the mean motif importance score and 95 % confidence interval in key regions, and the sum importance profile ± 250 nt upstream and downstream of the cleavage site.• Supplementary File 10 contains the polyA sites identified in A. thaliana. The locations and summary characteristics for clustered polyA sites are shown.
- Summary, Etc.
- 요약Each polyA site is annotated with the genomic location, the representative max cleavage site, the gene and sub-gene region where the site is located, the relative position of the site, the supporting 3'READS, and relative usage compared to other sites in the same gene.• Supplementary File 11 contains contains the motifs showing significant importance scores from the PolyaClassifier model in A. thaliana. Results for significant 4mers, 5mers, and 6mers are included. Each motif is annotated with the relevant motif family, the mean motif importance score and 95 % confidence interval in key regions, and the sum importance profile ± 250 nt upstream and downstream of the cleavage site.
- Subject Added Entry-Topical Term
- Bioinformatics.
- Subject Added Entry-Topical Term
- Biology.
- Subject Added Entry-Topical Term
- Genetics.
- Index Term-Uncontrolled
- Computational modeling
- Index Term-Uncontrolled
- Deep learning
- Index Term-Uncontrolled
- Genomics
- Index Term-Uncontrolled
- Human genetics
- Index Term-Uncontrolled
- Machine learning
- Index Term-Uncontrolled
- Polyadenylation
- Added Entry-Corporate Name
- Northwestern University Driskill Graduate Training Program in Life Sciences
- Host Item Entry
- Dissertations Abstracts International. 86-02B.
- Electronic Location and Access
- 로그인을 한후 보실 수 있는 자료입니다.
- Control Number
- joongbu:658517
MARC
008250224s2024 us ||||||||||||||c||eng d■001000017163566
■00520250211152725
■006m o d
■007cr#unu||||||||
■020 ▼a9798384019442
■035 ▼a(MiAaPQ)AAI31490052
■040 ▼aMiAaPQ▼cMiAaPQ
■0820 ▼a574
■1001 ▼aStroup, Emily Kunce.▼0(orcid)0000-0001-8053-0295
■24510▼aDeep Learning Models of Polyadenylation Regulation and Applications in Genetics.
■260 ▼a[S.l.]▼bNorthwestern University. ▼c2024
■260 1▼aAnn Arbor▼bProQuest Dissertations & Theses▼c2024
■300 ▼a230 p.
■500 ▼aSource: Dissertations Abstracts International, Volume: 86-02, Section: B.
■500 ▼aIncludes supplementary digital materials.
■500 ▼aAdvisor: Ji, Zhe.
■5021 ▼aThesis (Ph.D.)--Northwestern University, 2024.
■520 ▼aRNA processing tightly couples transcription to the 5' capping, intron splicing, and 3'-end formation of pre-mRNA to produce a mature mRNA prepared for export to the cytoplasm and downstream translation. During formation of the 3'-end, the pre-mRNA undergoes endonucleolytic cleavage and the polyA tail is synthesized. This process is carried out by a collection of 20 core proteins that comprise the polyadenylation machinery and more than 60 auxiliary proteins that may interact with or modulate the activity of this complex. The core proteins assemble into several subcomplexes that recognize cis-regulatory elements surrounding a polyA site. In mammals, the polyadenylation complex is well-characterized; yet in other model species, such as yeast and plants, there remain many open questions regarding the controlling factors of polyadenylation activity.The signals that mediate polyadenylation activity must be optimized genome-wide to prevent spurious polyadenylation that may produce non-functional transcripts or proteins. For example, the canonical human polyadenylation signal AAUAAA and its close variants occur across the human genome, yet few of these motifs correspond to active polyA sites. In this work, we explore the formation, strength, and heterogeneity of cleavage sites in human, yeast, and plants. In doing so, we seek to better understand the regulation of polyadenylation in human health and disease, and across species.Using 3' Region Extraction and Deep Sequencing data for more than 100 human samples, we identify polyA sites genome-wide at single-nucleotide resolution and use them to develop a suite of deep learning models that interrogate the polyadenylation code. First, the PolyaID model uses the pre-mRNA sequence to classify a polyA site from the genomic sequence and simultaneously predict the cleavage profile in the immediate vicinity. Simultaneously, the PolyaStrength model predicts the sequence strength of a polyA site, as measured by the relative usage within each gene. We then systematically disrupt motifs surrounding polyA sites to identify the cis-regulatory elements that determine site classification and strength. This technique reveals the unique motif families that form polyA sites and their position-dependent importance.However, the sequence alone does not fully explain the use of alternative polyadenylation (APA) sites. We develop APA-specific logistic regression models for composite intronic, skipped intronic, and terminal exonic sites to predict their usage based on the surrounding gene structure. During intronic APA, polyadenylation competes with splicing and tends to occur when splice sites are weakened. In the terminal exons and 3'UTRs, the distance between proximal and distal sites and the distance to the downstream antisense gene govern site use. Finally, we apply our suite of models to the genetic databases ClinVar, GWAS Catalog, and UK BioBank and discover thousands of variants expected to alter APA dynamics.We extend this approach to S. cerevisiae, S. pombe, and A. thaliana, developing highly accurate PolyaClassifier models to distinguish cleavage sites at nucleotide resolution. Using motif disruption, we reveal the unbiased motif families that define polyA sites in these species and discover that they are highly species-specific despite the conservation of many proteins in the polyadenylation complex. In these non-metazoan species, the polyadenylation machinery tolerates significant degeneracy in these key motifs compared to humans.In summary, we develop a collection of species-specific deep learning models that reveal the motifs mediating polyA site formation, heterogeneity, and strength in diverse species and their quantitative contributions to these processes. These models capture the combinatorial regulation of polyadenylation activity in healthy conditions and are sufficiently accurate and sensitive to examine the dysregulation of APA during disease and stress.Supplementary MaterialsSeveral supplementary data files are part of this dissertation and referenced in the text where appropriate. The names, contents, and purpose of these files are described here:• Supplementary File 1 contains the human polyA sites identified using the 3'READS datasets. Each polyA site is annotated with the gene and gene sub-region where the site is found, the dominant PAS signal, the number of supporting 3'READS, the relative usage compared to other sites in the gene, and the vector of surrounding 3'READS ± 50 nt upstream and downstream around the maximum cleavage site.• Supplementary File 2 contains the cis-regulatory motifs regulating human polyadenylation activity. Each motif is annotated with the relevant motif family, the nucleotide content of the motif, the max and mean disruption position, and the sum importance score at each position ± 120 nt upstream and downstream relative to the maximum cleavage site.• Supplementary File 3 contains predictions for the PAS pA-QTLs from Shulman and Elkon (2020). Each variant is annotated with the allele where the variant is located, the effect of the variant, the PAS signal for each allele, and the human PolyaID and PolyaStrength predictions.• Supplementary File 4 contains variants from ClinVar, the UK BioBank, and GWAS Catalog expected to impact polyadenylation activity. Variants are split into tabs based on the source of the variant and the location in terminal exons or introns. Each variant is annotated with the variant ID and location, the reference and alternative allele sequences, the gene and sub-gene region where site is found, the position of and distance to the polyA site, the PolyaID and PolyaStrength predictions, the composite or skipped intronic usage prediction (if relevant), the polyadenylation-associated motifs affected by the variant, and relevant clinical or trait information from the original database.• Supplementary File 5 contains the motifs showing significant importance scores from the PolyaClassifier model in S. cerevisiae. Results for significant 4mers, 5mers, and 6mers are included. Each motif is annotated with the relevant motif family, the mean motif importance score and 95 % confidence interval in key regions, and the sum importance profile ± 250 nt upstream and downstream of the cleavage site.• Supplementary File 6 contains the polyA sites identified in S. cerevisiae. The locations and summary characteristics for clustered polyA sites are shown. Each polyA site is annotated with the genomic location, the representative max cleavage site, the gene and sub-gene region where the site is located, the relative position of the site, the supporting 3'READS, and relative usage compared to other sites in the same gene.• Supplementary File 7 contains the motifs showing significant importance scores from the PolyaStrength model in S. cerevisiae. Results for significant 6mers are included. Each motif is annotated with the relevant motif family, the mean motif importance score and 95 % confidence interval in key regions, and the sum importance profile ± 250 nt upstream and downstream of the cleavage site.• Supplementary File 8 contains the polyA sites identified in S. pombe. The locations and summary characteristics for clustered polyA sites are shown. Each polyA site is annotated with the genomic location, the representative max cleavage site, the gene and sub-gene region where the site is located, the relative position of the site, the supporting 3'READS, and relative usage compared to other sites in the same gene.• Supplementary File 9 contains the motifs showing significant importance scores from the PolyaClassifier model in S. pombe. Results for significant 4mers, 5mers, and 6mers are included. Each motif is annotated with the relevant motif family, the mean motif importance score and 95 % confidence interval in key regions, and the sum importance profile ± 250 nt upstream and downstream of the cleavage site.• Supplementary File 10 contains the polyA sites identified in A. thaliana. The locations and summary characteristics for clustered polyA sites are shown.
■520 ▼aEach polyA site is annotated with the genomic location, the representative max cleavage site, the gene and sub-gene region where the site is located, the relative position of the site, the supporting 3'READS, and relative usage compared to other sites in the same gene.• Supplementary File 11 contains contains the motifs showing significant importance scores from the PolyaClassifier model in A. thaliana. Results for significant 4mers, 5mers, and 6mers are included. Each motif is annotated with the relevant motif family, the mean motif importance score and 95 % confidence interval in key regions, and the sum importance profile ± 250 nt upstream and downstream of the cleavage site.
■590 ▼aSchool code: 0163.
■650 4▼aBioinformatics.
■650 4▼aBiology.
■650 4▼aGenetics.
■653 ▼aComputational modeling
■653 ▼aDeep learning
■653 ▼aGenomics
■653 ▼aHuman genetics
■653 ▼aMachine learning
■653 ▼aPolyadenylation
■690 ▼a0715
■690 ▼a0306
■690 ▼a0369
■690 ▼a0800
■71020▼aNorthwestern University▼bDriskill Graduate Training Program in Life Sciences.
■7730 ▼tDissertations Abstracts International▼g86-02B.
■790 ▼a0163
■791 ▼aPh.D.
■792 ▼a2024
■793 ▼aEnglish
■85640▼uhttp://www.riss.kr/pdu/ddodLink.do?id=T17163566▼nKERIS▼z이 자료의 원문은 한국교육학술정보원에서 제공합니다.
미리보기
내보내기
chatGPT토론
Ai 추천 관련 도서
detalle info
- Reserva
- 캠퍼스간 도서대출
- 서가에 없는 책 신고
- Mi carpeta