본문

서브메뉴

Model-Free Algorithms for Constrained Reinforcement Learning in Discounted and Average Reward Settings.
Model-Free Algorithms for Constrained Reinforcement Learning in Discounted and Average Reward Settings.

상세정보

자료유형  
 학위논문
Control Number  
0017165052
International Standard Book Number  
9798346578185
Dewey Decimal Classification Number  
519
Main Entry-Personal Name  
Bai, Qinbo.
Publication, Distribution, etc. (Imprint  
[S.l.] : Purdue University., 2024
Publication, Distribution, etc. (Imprint  
Ann Arbor : ProQuest Dissertations & Theses, 2024
Physical Description  
140 p.
General Note  
Source: Dissertations Abstracts International, Volume: 86-05, Section: A.
General Note  
Advisor: Aggarwal, Vaneet.
Dissertation Note  
Thesis (Ph.D.)--Purdue University, 2024.
Summary, Etc.  
요약Reinforcement learning (RL), which aims to train an agent to maximize its accumulated reward through time, has attracted much attention in recent years. Mathematically, RL is modeled as a Markov Decision Process, where the agent interacts with the environment step by step. In practice, RL has been applied to autonomous driving, robotics, recommendation systems, and financial management. Although RL has been greatly studied in the literature, most proposed algorithms are model-based, which requires estimating the transition kernel. To this end, we begin to study the sample efficient model-free algorithms under different settings. Firstly, we propose a conservative stochastic primal-dual algorithm in the infinite horizon discounted reward setting. The proposed algorithm converts the original problem from policy space to the occupancy measure space, which makes the non-convex problem linear. Then, we advocate the use of a randomized primal-dual approach to achieve O(\\ϵ-2) sample complexity, which matches the lower bound. However, when it comes to the infinite horizon average reward setting, the problem becomes more challenging since the environment interaction never ends and can't be reset, which makes reward samples not independent anymore. To solve this, we design an epoch-based policy-gradient algorithm. In each epoch, the whole trajectory is divided into multiple sub-trajectories with an interval between each two of them. Such intervals are long enough so that the reward samples are asymptotically independent. By controlling the length of trajectory and intervals, we obtain a good gradient estimator and prove the proposed algorithm achieves O(T3/4) regret bound.
Subject Added Entry-Topical Term  
Linear programming.
Subject Added Entry-Topical Term  
Exploitation.
Subject Added Entry-Topical Term  
Recommender systems.
Subject Added Entry-Topical Term  
Markov analysis.
Subject Added Entry-Topical Term  
Robotics.
Subject Added Entry-Topical Term  
Information science.
Added Entry-Corporate Name  
Purdue University.
Host Item Entry  
Dissertations Abstracts International. 86-05A.
Electronic Location and Access  
로그인을 한후 보실 수 있는 자료입니다.
Control Number  
joongbu:655474

MARC

 008250224s2024        us  ||||||||||||||c||eng  d
■001000017165052
■00520250211153118
■006m          o    d                
■007cr#unu||||||||
■020    ▼a9798346578185
■035    ▼a(MiAaPQ)AAI31732968
■035    ▼a(MiAaPQ)Purdue27173229
■040    ▼aMiAaPQ▼cMiAaPQ
■0820  ▼a519
■1001  ▼aBai,  Qinbo.
■24510▼aModel-Free  Algorithms  for  Constrained  Reinforcement  Learning  in  Discounted  and  Average  Reward  Settings.
■260    ▼a[S.l.]▼bPurdue  University.  ▼c2024
■260  1▼aAnn  Arbor▼bProQuest  Dissertations  &  Theses▼c2024
■300    ▼a140  p.
■500    ▼aSource:  Dissertations  Abstracts  International,  Volume:  86-05,  Section:  A.
■500    ▼aAdvisor:  Aggarwal,  Vaneet.
■5021  ▼aThesis  (Ph.D.)--Purdue  University,  2024.
■520    ▼aReinforcement  learning  (RL),  which  aims  to  train  an  agent  to  maximize  its  accumulated  reward  through  time,  has  attracted  much  attention  in  recent  years.  Mathematically,  RL  is  modeled  as  a  Markov  Decision  Process,  where  the  agent  interacts  with  the  environment  step  by  step.  In  practice,  RL  has  been  applied  to  autonomous  driving,  robotics,  recommendation  systems,  and  financial  management.  Although  RL  has  been  greatly  studied  in  the  literature,  most  proposed  algorithms  are  model-based,  which  requires  estimating  the  transition  kernel.  To  this  end,  we  begin  to  study  the  sample  efficient  model-free  algorithms  under  different  settings.  Firstly,  we  propose  a  conservative  stochastic  primal-dual  algorithm  in  the  infinite  horizon  discounted  reward  setting.  The  proposed  algorithm  converts  the  original  problem  from  policy  space  to  the  occupancy  measure  space,  which  makes  the  non-convex  problem  linear.  Then,  we  advocate  the  use  of  a  randomized  primal-dual  approach  to  achieve  O(\\ϵ-2)  sample  complexity,  which  matches  the  lower  bound.  However,  when  it  comes  to  the  infinite  horizon  average  reward  setting,  the  problem  becomes  more  challenging  since  the  environment  interaction  never  ends  and  can't  be  reset,  which  makes  reward  samples  not  independent  anymore.  To  solve  this,  we  design  an  epoch-based  policy-gradient  algorithm.  In  each  epoch,  the  whole  trajectory  is  divided  into  multiple  sub-trajectories  with  an  interval  between  each  two  of  them.  Such  intervals  are  long  enough  so  that  the  reward  samples  are  asymptotically  independent.  By  controlling  the  length  of  trajectory  and  intervals,  we  obtain  a  good  gradient  estimator  and  prove  the  proposed  algorithm  achieves  O(T3/4)  regret  bound.
■590    ▼aSchool  code:  0183.
■650  4▼aLinear  programming.
■650  4▼aExploitation.
■650  4▼aRecommender  systems.
■650  4▼aMarkov  analysis.
■650  4▼aRobotics.
■650  4▼aInformation  science.
■690    ▼a0771
■690    ▼a0723
■690    ▼a0796
■71020▼aPurdue  University.
■7730  ▼tDissertations  Abstracts  International▼g86-05A.
■790    ▼a0183
■791    ▼aPh.D.
■792    ▼a2024
■793    ▼aEnglish
■85640▼uhttp://www.riss.kr/pdu/ddodLink.do?id=T17165052▼nKERIS▼z이  자료의  원문은  한국교육학술정보원에서  제공합니다.

미리보기

내보내기

chatGPT토론

Ai 추천 관련 도서


    New Books MORE
    Related books MORE
    최근 3년간 통계입니다.

    Info Détail de la recherche.

    • Réservation
    • 캠퍼스간 도서대출
    • 서가에 없는 책 신고
    • My Folder
    Matériel
    Reg No. Call No. emplacement Status Lend Info
    TQ0031496 T   원문자료 열람가능/출력가능 열람가능/출력가능
    마이폴더 부재도서신고

    * Les réservations sont disponibles dans le livre d'emprunt. Pour faire des réservations, S'il vous plaît cliquer sur le bouton de réservation

    해당 도서를 다른 이용자가 함께 대출한 도서

    Related books

    Related Popular Books

    도서위치