중부대학교 도서관

본문 바로가기
탑 메뉴 바로가기
주 메뉴 바로가기
하단 바로가기

상세정보

Model-Free Algorithms for Constrained Reinforcement Learning in Discounted and Average Reward Settings.

자료유형: 학위논문

Control Number: 0017165052

International Standard Book Number: 9798346578185

Dewey Decimal Classification Number: 519

Main Entry-Personal Name: Bai, Qinbo.

Publication, Distribution, etc. (Imprint: [S.l.] : Purdue University., 2024

Publication, Distribution, etc. (Imprint: Ann Arbor : ProQuest Dissertations & Theses, 2024

Physical Description: 140 p.

General Note: Source: Dissertations Abstracts International, Volume: 86-05, Section: A.

General Note: Advisor: Aggarwal, Vaneet.

Dissertation Note: Thesis (Ph.D.)--Purdue University, 2024.

Summary, Etc.: 요약Reinforcement learning (RL), which aims to train an agent to maximize its accumulated reward through time, has attracted much attention in recent years. Mathematically, RL is modeled as a Markov Decision Process, where the agent interacts with the environment step by step. In practice, RL has been applied to autonomous driving, robotics, recommendation systems, and financial management. Although RL has been greatly studied in the literature, most proposed algorithms are model-based, which requires estimating the transition kernel. To this end, we begin to study the sample efficient model-free algorithms under different settings. Firstly, we propose a conservative stochastic primal-dual algorithm in the infinite horizon discounted reward setting. The proposed algorithm converts the original problem from policy space to the occupancy measure space, which makes the non-convex problem linear. Then, we advocate the use of a randomized primal-dual approach to achieve O(\\ϵ-2) sample complexity, which matches the lower bound. However, when it comes to the infinite horizon average reward setting, the problem becomes more challenging since the environment interaction never ends and can't be reset, which makes reward samples not independent anymore. To solve this, we design an epoch-based policy-gradient algorithm. In each epoch, the whole trajectory is divided into multiple sub-trajectories with an interval between each two of them. Such intervals are long enough so that the reward samples are asymptotically independent. By controlling the length of trajectory and intervals, we obtain a good gradient estimator and prove the proposed algorithm achieves O(T3/4) regret bound.

Subject Added Entry-Topical Term: Linear programming.

Subject Added Entry-Topical Term: Exploitation.

Subject Added Entry-Topical Term: Recommender systems.

Subject Added Entry-Topical Term: Markov analysis.

Subject Added Entry-Topical Term: Robotics.

Subject Added Entry-Topical Term: Information science.

Added Entry-Corporate Name: Purdue University.

Host Item Entry: Dissertations Abstracts International. 86-05A.

Electronic Location and Access: 로그인을 한후 보실 수 있는 자료입니다.

Control Number: joongbu:655474

008250224s2024        us  ||||||||||||||c||eng  d
■001000017165052
■00520250211153118
■006m          o    d
■007cr#unu||||||||
■020    ▼a9798346578185
■035    ▼a(MiAaPQ)AAI31732968
■035    ▼a(MiAaPQ)Purdue27173229
■040    ▼aMiAaPQ▼cMiAaPQ
■0820  ▼a519
■1001  ▼aBai,  Qinbo.
■24510▼aModel-Free  Algorithms  for  Constrained  Reinforcement  Learning  in  Discounted  and  Average  Reward  Settings.
■260    ▼a[S.l.]▼bPurdue  University.  ▼c2024
■260  1▼aAnn  Arbor▼bProQuest  Dissertations  &  Theses▼c2024
■300    ▼a140  p.
■500    ▼aSource:  Dissertations  Abstracts  International,  Volume:  86-05,  Section:  A.
■500    ▼aAdvisor:  Aggarwal,  Vaneet.
■5021  ▼aThesis  (Ph.D.)--Purdue  University,  2024.
■520    ▼aReinforcement  learning  (RL),  which  aims  to  train  an  agent  to  maximize  its  accumulated  reward  through  time,  has  attracted  much  attention  in  recent  years.  Mathematically,  RL  is  modeled  as  a  Markov  Decision  Process,  where  the  agent  interacts  with  the  environment  step  by  step.  In  practice,  RL  has  been  applied  to  autonomous  driving,  robotics,  recommendation  systems,  and  financial  management.  Although  RL  has  been  greatly  studied  in  the  literature,  most  proposed  algorithms  are  model-based,  which  requires  estimating  the  transition  kernel.  To  this  end,  we  begin  to  study  the  sample  efficient  model-free  algorithms  under  different  settings.  Firstly,  we  propose  a  conservative  stochastic  primal-dual  algorithm  in  the  infinite  horizon  discounted  reward  setting.  The  proposed  algorithm  converts  the  original  problem  from  policy  space  to  the  occupancy  measure  space,  which  makes  the  non-convex  problem  linear.  Then,  we  advocate  the  use  of  a  randomized  primal-dual  approach  to  achieve  O(\\ϵ-2)  sample  complexity,  which  matches  the  lower  bound.  However,  when  it  comes  to  the  infinite  horizon  average  reward  setting,  the  problem  becomes  more  challenging  since  the  environment  interaction  never  ends  and  can't  be  reset,  which  makes  reward  samples  not  independent  anymore.  To  solve  this,  we  design  an  epoch-based  policy-gradient  algorithm.  In  each  epoch,  the  whole  trajectory  is  divided  into  multiple  sub-trajectories  with  an  interval  between  each  two  of  them.  Such  intervals  are  long  enough  so  that  the  reward  samples  are  asymptotically  independent.  By  controlling  the  length  of  trajectory  and  intervals,  we  obtain  a  good  gradient  estimator  and  prove  the  proposed  algorithm  achieves  O(T3/4)  regret  bound.
■590    ▼aSchool  code:  0183.
■650  4▼aLinear  programming.
■650  4▼aExploitation.
■650  4▼aRecommender  systems.
■650  4▼aMarkov  analysis.
■650  4▼aRobotics.
■650  4▼aInformation  science.
■690    ▼a0771
■690    ▼a0723
■690    ▼a0796
■71020▼aPurdue  University.
■7730  ▼tDissertations  Abstracts  International▼g86-05A.
■790    ▼a0183
■791    ▼aPh.D.
■792    ▼a2024
■793    ▼aEnglish
■85640▼uhttp://www.riss.kr/pdu/ddodLink.do?id=T17165052▼nKERIS▼z이  자료의  원문은  한국교육학술정보원에서  제공합니다.

New Books MORE

Related books MORE

최근 3년간 통계입니다.

Réservation
캠퍼스간 도서대출
서가에 없는 책 신고
보존서고대출신청
My Folder

Matériel
Reg No.	Call No.	emplacement	Status	Lend Info
TQ0031496	T	원문자료	열람가능/출력가능	열람가능/출력가능 마이폴더 부재도서신고

* Les réservations sont disponibles dans le livre d'emprunt. Pour faire des réservations, S'il vous plaît cliquer sur le bouton de réservation

본문

서브메뉴

검색

상세정보

MARC

미리보기

내보내기

chatGPT토론

Ai 추천 관련 도서

New Books MORE

Related books MORE

최근 3년간 통계입니다.

Info Détail de la recherche.

해당 도서를 다른 이용자가 함께 대출한 도서

Related books

Related Popular Books

도서위치

QUICK LINK