본문

서브메뉴

Deep Learning Models of Polyadenylation Regulation and Applications in Genetics.
Deep Learning Models of Polyadenylation Regulation and Applications in Genetics.

상세정보

자료유형  
 학위논문
Control Number  
0017163566
International Standard Book Number  
9798384019442
Dewey Decimal Classification Number  
574
Main Entry-Personal Name  
Stroup, Emily Kunce.
Publication, Distribution, etc. (Imprint  
[S.l.] : Northwestern University., 2024
Publication, Distribution, etc. (Imprint  
Ann Arbor : ProQuest Dissertations & Theses, 2024
Physical Description  
230 p.
General Note  
Source: Dissertations Abstracts International, Volume: 86-02, Section: B.
General Note  
Includes supplementary digital materials.
General Note  
Advisor: Ji, Zhe.
Dissertation Note  
Thesis (Ph.D.)--Northwestern University, 2024.
Summary, Etc.  
요약RNA processing tightly couples transcription to the 5' capping, intron splicing, and 3'-end formation of pre-mRNA to produce a mature mRNA prepared for export to the cytoplasm and downstream translation. During formation of the 3'-end, the pre-mRNA undergoes endonucleolytic cleavage and the polyA tail is synthesized. This process is carried out by a collection of 20 core proteins that comprise the polyadenylation machinery and more than 60 auxiliary proteins that may interact with or modulate the activity of this complex. The core proteins assemble into several subcomplexes that recognize cis-regulatory elements surrounding a polyA site. In mammals, the polyadenylation complex is well-characterized; yet in other model species, such as yeast and plants, there remain many open questions regarding the controlling factors of polyadenylation activity.The signals that mediate polyadenylation activity must be optimized genome-wide to prevent spurious polyadenylation that may produce non-functional transcripts or proteins. For example, the canonical human polyadenylation signal AAUAAA and its close variants occur across the human genome, yet few of these motifs correspond to active polyA sites. In this work, we explore the formation, strength, and heterogeneity of cleavage sites in human, yeast, and plants. In doing so, we seek to better understand the regulation of polyadenylation in human health and disease, and across species.Using 3' Region Extraction and Deep Sequencing data for more than 100 human samples, we identify polyA sites genome-wide at single-nucleotide resolution and use them to develop a suite of deep learning models that interrogate the polyadenylation code. First, the PolyaID model uses the pre-mRNA sequence to classify a polyA site from the genomic sequence and simultaneously predict the cleavage profile in the immediate vicinity. Simultaneously, the PolyaStrength model predicts the sequence strength of a polyA site, as measured by the relative usage within each gene. We then systematically disrupt motifs surrounding polyA sites to identify the cis-regulatory elements that determine site classification and strength. This technique reveals the unique motif families that form polyA sites and their position-dependent importance.However, the sequence alone does not fully explain the use of alternative polyadenylation (APA) sites. We develop APA-specific logistic regression models for composite intronic, skipped intronic, and terminal exonic sites to predict their usage based on the surrounding gene structure. During intronic APA, polyadenylation competes with splicing and tends to occur when splice sites are weakened. In the terminal exons and 3'UTRs, the distance between proximal and distal sites and the distance to the downstream antisense gene govern site use. Finally, we apply our suite of models to the genetic databases ClinVar, GWAS Catalog, and UK BioBank and discover thousands of variants expected to alter APA dynamics.We extend this approach to S. cerevisiae, S. pombe, and A. thaliana, developing highly accurate PolyaClassifier models to distinguish cleavage sites at nucleotide resolution. Using motif disruption, we reveal the unbiased motif families that define polyA sites in these species and discover that they are highly species-specific despite the conservation of many proteins in the polyadenylation complex. In these non-metazoan species, the polyadenylation machinery tolerates significant degeneracy in these key motifs compared to humans.In summary, we develop a collection of species-specific deep learning models that reveal the motifs mediating polyA site formation, heterogeneity, and strength in diverse species and their quantitative contributions to these processes. These models capture the combinatorial regulation of polyadenylation activity in healthy conditions and are sufficiently accurate and sensitive to examine the dysregulation of APA during disease and stress.Supplementary MaterialsSeveral supplementary data files are part of this dissertation and referenced in the text where appropriate. The names, contents, and purpose of these files are described here:• Supplementary File 1 contains the human polyA sites identified using the 3'READS datasets. Each polyA site is annotated with the gene and gene sub-region where the site is found, the dominant PAS signal, the number of supporting 3'READS, the relative usage compared to other sites in the gene, and the vector of surrounding 3'READS ± 50 nt upstream and downstream around the maximum cleavage site.• Supplementary File 2 contains the cis-regulatory motifs regulating human polyadenylation activity. Each motif is annotated with the relevant motif family, the nucleotide content of the motif, the max and mean disruption position, and the sum importance score at each position ± 120 nt upstream and downstream relative to the maximum cleavage site.• Supplementary File 3 contains predictions for the PAS pA-QTLs from Shulman and Elkon (2020). Each variant is annotated with the allele where the variant is located, the effect of the variant, the PAS signal for each allele, and the human PolyaID and PolyaStrength predictions.• Supplementary File 4 contains variants from ClinVar, the UK BioBank, and GWAS Catalog expected to impact polyadenylation activity. Variants are split into tabs based on the source of the variant and the location in terminal exons or introns. Each variant is annotated with the variant ID and location, the reference and alternative allele sequences, the gene and sub-gene region where site is found, the position of and distance to the polyA site, the PolyaID and PolyaStrength predictions, the composite or skipped intronic usage prediction (if relevant), the polyadenylation-associated motifs affected by the variant, and relevant clinical or trait information from the original database.• Supplementary File 5 contains the motifs showing significant importance scores from the PolyaClassifier model in S. cerevisiae. Results for significant 4mers, 5mers, and 6mers are included. Each motif is annotated with the relevant motif family, the mean motif importance score and 95 % confidence interval in key regions, and the sum importance profile ± 250 nt upstream and downstream of the cleavage site.• Supplementary File 6 contains the polyA sites identified in S. cerevisiae. The locations and summary characteristics for clustered polyA sites are shown. Each polyA site is annotated with the genomic location, the representative max cleavage site, the gene and sub-gene region where the site is located, the relative position of the site, the supporting 3'READS, and relative usage compared to other sites in the same gene.• Supplementary File 7 contains the motifs showing significant importance scores from the PolyaStrength model in S. cerevisiae. Results for significant 6mers are included. Each motif is annotated with the relevant motif family, the mean motif importance score and 95 % confidence interval in key regions, and the sum importance profile ± 250 nt upstream and downstream of the cleavage site.• Supplementary File 8 contains the polyA sites identified in S. pombe. The locations and summary characteristics for clustered polyA sites are shown. Each polyA site is annotated with the genomic location, the representative max cleavage site, the gene and sub-gene region where the site is located, the relative position of the site, the supporting 3'READS, and relative usage compared to other sites in the same gene.• Supplementary File 9 contains the motifs showing significant importance scores from the PolyaClassifier model in S. pombe. Results for significant 4mers, 5mers, and 6mers are included. Each motif is annotated with the relevant motif family, the mean motif importance score and 95 % confidence interval in key regions, and the sum importance profile ± 250 nt upstream and downstream of the cleavage site.• Supplementary File 10 contains the polyA sites identified in A. thaliana. The locations and summary characteristics for clustered polyA sites are shown.
Summary, Etc.  
요약Each polyA site is annotated with the genomic location, the representative max cleavage site, the gene and sub-gene region where the site is located, the relative position of the site, the supporting 3'READS, and relative usage compared to other sites in the same gene.• Supplementary File 11 contains contains the motifs showing significant importance scores from the PolyaClassifier model in A. thaliana. Results for significant 4mers, 5mers, and 6mers are included. Each motif is annotated with the relevant motif family, the mean motif importance score and 95 % confidence interval in key regions, and the sum importance profile ± 250 nt upstream and downstream of the cleavage site.
Subject Added Entry-Topical Term  
Bioinformatics.
Subject Added Entry-Topical Term  
Biology.
Subject Added Entry-Topical Term  
Genetics.
Index Term-Uncontrolled  
Computational modeling
Index Term-Uncontrolled  
Deep learning
Index Term-Uncontrolled  
Genomics
Index Term-Uncontrolled  
Human genetics
Index Term-Uncontrolled  
Machine learning
Index Term-Uncontrolled  
Polyadenylation
Added Entry-Corporate Name  
Northwestern University Driskill Graduate Training Program in Life Sciences
Host Item Entry  
Dissertations Abstracts International. 86-02B.
Electronic Location and Access  
로그인을 한후 보실 수 있는 자료입니다.
Control Number  
joongbu:658517

MARC

 008250224s2024        us  ||||||||||||||c||eng  d
■001000017163566
■00520250211152725
■006m          o    d                
■007cr#unu||||||||
■020    ▼a9798384019442
■035    ▼a(MiAaPQ)AAI31490052
■040    ▼aMiAaPQ▼cMiAaPQ
■0820  ▼a574
■1001  ▼aStroup,  Emily  Kunce.▼0(orcid)0000-0001-8053-0295
■24510▼aDeep  Learning  Models  of  Polyadenylation  Regulation  and  Applications  in  Genetics.
■260    ▼a[S.l.]▼bNorthwestern  University.  ▼c2024
■260  1▼aAnn  Arbor▼bProQuest  Dissertations  &  Theses▼c2024
■300    ▼a230  p.
■500    ▼aSource:  Dissertations  Abstracts  International,  Volume:  86-02,  Section:  B.
■500    ▼aIncludes  supplementary  digital  materials.
■500    ▼aAdvisor:  Ji,  Zhe.
■5021  ▼aThesis  (Ph.D.)--Northwestern  University,  2024.
■520    ▼aRNA  processing  tightly  couples  transcription  to  the  5'  capping,  intron  splicing,  and  3'-end  formation  of  pre-mRNA  to  produce  a  mature  mRNA  prepared  for  export  to  the  cytoplasm  and  downstream  translation.  During  formation  of  the  3'-end,  the  pre-mRNA  undergoes  endonucleolytic  cleavage  and  the  polyA  tail  is  synthesized.  This  process  is  carried  out  by  a  collection  of  20  core  proteins  that  comprise  the  polyadenylation  machinery  and  more  than  60  auxiliary  proteins  that  may  interact  with  or  modulate  the  activity  of  this  complex.  The  core  proteins  assemble  into  several  subcomplexes  that  recognize  cis-regulatory  elements  surrounding  a  polyA  site.  In  mammals,  the  polyadenylation  complex  is  well-characterized;  yet  in  other  model  species,  such  as  yeast  and  plants,  there  remain  many  open  questions  regarding  the  controlling  factors  of  polyadenylation  activity.The  signals  that  mediate  polyadenylation  activity  must  be  optimized  genome-wide  to  prevent  spurious  polyadenylation  that  may  produce  non-functional  transcripts  or  proteins.  For  example,  the  canonical  human  polyadenylation  signal  AAUAAA  and  its  close  variants  occur  across  the  human  genome,  yet  few  of  these  motifs  correspond  to  active  polyA  sites.  In  this  work,  we  explore  the  formation,  strength,  and  heterogeneity  of  cleavage  sites  in  human,  yeast,  and  plants.  In  doing  so,  we  seek  to  better  understand  the  regulation  of  polyadenylation  in  human  health  and  disease,  and  across  species.Using  3'  Region  Extraction  and  Deep  Sequencing  data  for  more  than  100  human  samples,  we  identify  polyA  sites  genome-wide  at  single-nucleotide  resolution  and  use  them  to  develop  a  suite  of  deep  learning  models  that  interrogate  the  polyadenylation  code.  First,  the  PolyaID  model  uses  the  pre-mRNA  sequence  to  classify  a  polyA  site  from  the  genomic  sequence  and  simultaneously  predict  the  cleavage  profile  in  the  immediate  vicinity.  Simultaneously,  the  PolyaStrength  model  predicts  the  sequence  strength  of  a  polyA  site,  as  measured  by  the  relative  usage  within  each  gene.  We  then  systematically  disrupt  motifs  surrounding  polyA  sites  to  identify  the  cis-regulatory  elements  that  determine  site  classification  and  strength.  This  technique  reveals  the  unique  motif  families  that  form  polyA  sites  and  their  position-dependent  importance.However,  the  sequence  alone  does  not  fully  explain  the  use  of  alternative  polyadenylation  (APA)  sites.  We  develop  APA-specific  logistic  regression  models  for  composite  intronic,  skipped  intronic,  and  terminal  exonic  sites  to  predict  their  usage  based  on  the  surrounding  gene  structure.  During  intronic  APA,  polyadenylation  competes  with  splicing  and  tends  to  occur  when  splice  sites  are  weakened.  In  the  terminal  exons  and  3'UTRs,  the  distance  between  proximal  and  distal  sites  and  the  distance  to  the  downstream  antisense  gene  govern  site  use.  Finally,  we  apply  our  suite  of  models  to  the  genetic  databases  ClinVar,  GWAS  Catalog,  and  UK  BioBank  and  discover  thousands  of  variants  expected  to  alter  APA  dynamics.We  extend  this  approach  to  S.  cerevisiae,  S.  pombe,  and  A.  thaliana,  developing  highly  accurate  PolyaClassifier  models  to  distinguish  cleavage  sites  at  nucleotide  resolution.  Using  motif  disruption,  we  reveal  the  unbiased  motif  families  that  define  polyA  sites  in  these  species  and  discover  that  they  are  highly  species-specific  despite  the  conservation  of  many  proteins  in  the  polyadenylation  complex.  In  these  non-metazoan  species,  the  polyadenylation  machinery  tolerates  significant  degeneracy  in  these  key  motifs  compared  to  humans.In  summary,  we  develop  a  collection  of  species-specific  deep  learning  models  that  reveal  the  motifs  mediating  polyA  site  formation,  heterogeneity,  and  strength  in  diverse  species  and  their  quantitative  contributions  to  these  processes.  These  models  capture  the  combinatorial  regulation  of  polyadenylation  activity  in  healthy  conditions  and  are  sufficiently  accurate  and  sensitive  to  examine  the  dysregulation  of  APA  during  disease  and  stress.Supplementary  MaterialsSeveral  supplementary  data  files  are  part  of  this  dissertation  and  referenced  in  the  text  where  appropriate.  The  names,  contents,  and  purpose  of  these  files  are  described  here:•  Supplementary  File  1  contains  the  human  polyA  sites  identified  using  the  3'READS  datasets.  Each  polyA  site  is  annotated  with  the  gene  and  gene  sub-region  where  the  site  is  found,  the  dominant  PAS  signal,  the  number  of  supporting  3'READS,  the  relative  usage  compared  to  other  sites  in  the  gene,  and  the  vector  of  surrounding  3'READS  ±  50  nt  upstream  and  downstream  around  the  maximum  cleavage  site.•  Supplementary  File  2  contains  the  cis-regulatory  motifs  regulating  human  polyadenylation  activity.  Each  motif  is  annotated  with  the  relevant  motif  family,  the  nucleotide  content  of  the  motif,  the  max  and  mean  disruption  position,  and  the  sum  importance  score  at  each  position  ±  120  nt  upstream  and  downstream  relative  to  the  maximum  cleavage  site.•  Supplementary  File  3  contains  predictions  for  the  PAS  pA-QTLs  from  Shulman  and  Elkon  (2020).  Each  variant  is  annotated  with  the  allele  where  the  variant  is  located,  the  effect  of  the  variant,  the  PAS  signal  for  each  allele,  and  the  human  PolyaID  and  PolyaStrength  predictions.•  Supplementary  File  4  contains  variants  from  ClinVar,  the  UK  BioBank,  and  GWAS  Catalog  expected  to  impact  polyadenylation  activity.  Variants  are  split  into  tabs  based  on  the  source  of  the  variant  and  the  location  in  terminal  exons  or  introns.  Each  variant  is  annotated  with  the  variant  ID  and  location,  the  reference  and  alternative  allele  sequences,  the  gene  and  sub-gene  region  where  site  is  found,  the  position  of  and  distance  to  the  polyA  site,  the  PolyaID  and  PolyaStrength  predictions,  the  composite  or  skipped  intronic  usage  prediction  (if  relevant),  the  polyadenylation-associated  motifs  affected  by  the  variant,  and  relevant  clinical  or  trait  information  from  the  original  database.•  Supplementary  File  5  contains  the  motifs  showing  significant  importance  scores  from  the  PolyaClassifier  model  in  S.  cerevisiae.  Results  for  significant  4mers,  5mers,  and  6mers  are  included.  Each  motif  is  annotated  with  the  relevant  motif  family,  the  mean  motif  importance  score  and  95  %  confidence  interval  in  key  regions,  and  the  sum  importance  profile  ±  250  nt  upstream  and  downstream  of  the  cleavage  site.•  Supplementary  File  6  contains  the  polyA  sites  identified  in  S.  cerevisiae.  The  locations  and  summary  characteristics  for  clustered  polyA  sites  are  shown.  Each  polyA  site  is  annotated  with  the  genomic  location,  the  representative  max  cleavage  site,  the  gene  and  sub-gene  region  where  the  site  is  located,  the  relative  position  of  the  site,  the  supporting  3'READS,  and  relative  usage  compared  to  other  sites  in  the  same  gene.•  Supplementary  File  7  contains  the  motifs  showing  significant  importance  scores  from  the  PolyaStrength  model  in  S.  cerevisiae.  Results  for  significant  6mers  are  included.  Each  motif  is  annotated  with  the  relevant  motif  family,  the  mean  motif  importance  score  and  95  %  confidence  interval  in  key  regions,  and  the  sum  importance  profile  ±  250  nt  upstream  and  downstream  of  the  cleavage  site.•  Supplementary  File  8  contains  the  polyA  sites  identified  in  S.  pombe.  The  locations  and  summary  characteristics  for  clustered  polyA  sites  are  shown.  Each  polyA  site  is  annotated  with  the  genomic  location,  the  representative  max  cleavage  site,  the  gene  and  sub-gene  region  where  the  site  is  located,  the  relative  position  of  the  site,  the  supporting  3'READS,  and  relative  usage  compared  to  other  sites  in  the  same  gene.•  Supplementary  File  9  contains  the  motifs  showing  significant  importance  scores  from  the  PolyaClassifier  model  in  S.  pombe.  Results  for  significant  4mers,  5mers,  and  6mers  are  included.  Each  motif  is  annotated  with  the  relevant  motif  family,  the  mean  motif  importance  score  and  95  %  confidence  interval  in  key  regions,  and  the  sum  importance  profile  ±  250  nt  upstream  and  downstream  of  the  cleavage  site.•  Supplementary  File  10  contains  the  polyA  sites  identified  in  A.  thaliana.  The  locations  and  summary  characteristics  for  clustered  polyA  sites  are  shown.  
■520    ▼aEach  polyA  site  is  annotated  with  the  genomic  location,  the  representative  max  cleavage  site,  the  gene  and  sub-gene  region  where  the  site  is  located,  the  relative  position  of  the  site,  the  supporting  3'READS,  and  relative  usage  compared  to  other  sites  in  the  same  gene.•  Supplementary  File  11  contains  contains  the  motifs  showing  significant  importance  scores  from  the  PolyaClassifier  model  in  A.  thaliana.  Results  for  significant  4mers,  5mers,  and  6mers  are  included.  Each  motif  is  annotated  with  the  relevant  motif  family,  the  mean  motif  importance  score  and  95  %  confidence  interval  in  key  regions,  and  the  sum  importance  profile  ±  250  nt  upstream  and  downstream  of  the  cleavage  site.
■590    ▼aSchool  code:  0163.
■650  4▼aBioinformatics.
■650  4▼aBiology.
■650  4▼aGenetics.
■653    ▼aComputational  modeling
■653    ▼aDeep  learning
■653    ▼aGenomics
■653    ▼aHuman  genetics
■653    ▼aMachine  learning
■653    ▼aPolyadenylation
■690    ▼a0715
■690    ▼a0306
■690    ▼a0369
■690    ▼a0800
■71020▼aNorthwestern  University▼bDriskill  Graduate  Training  Program  in  Life  Sciences.
■7730  ▼tDissertations  Abstracts  International▼g86-02B.
■790    ▼a0163
■791    ▼aPh.D.
■792    ▼a2024
■793    ▼aEnglish
■85640▼uhttp://www.riss.kr/pdu/ddodLink.do?id=T17163566▼nKERIS▼z이  자료의  원문은  한국교육학술정보원에서  제공합니다.

미리보기

내보내기

chatGPT토론

Ai 추천 관련 도서


    New Books MORE
    Related books MORE
    최근 3년간 통계입니다.

    detalle info

    • Reserva
    • 캠퍼스간 도서대출
    • 서가에 없는 책 신고
    • Mi carpeta
    Material
    número de libro número de llamada Ubicación estado Prestar info
    TQ0034838 T   원문자료 열람가능/출력가능 열람가능/출력가능
    마이폴더 부재도서신고

    * Las reservas están disponibles en el libro de préstamos. Para hacer reservaciones, haga clic en el botón de reserva

    해당 도서를 다른 이용자가 함께 대출한 도서

    Related books

    Related Popular Books

    도서위치