서브메뉴
검색
Model-Free Algorithms for Constrained Reinforcement Learning in Discounted and Average Reward Settings.
Model-Free Algorithms for Constrained Reinforcement Learning in Discounted and Average Reward Settings.
상세정보
- 자료유형
- 학위논문
- Control Number
- 0017165052
- International Standard Book Number
- 9798346578185
- Dewey Decimal Classification Number
- 519
- Main Entry-Personal Name
- Bai, Qinbo.
- Publication, Distribution, etc. (Imprint
- [S.l.] : Purdue University., 2024
- Publication, Distribution, etc. (Imprint
- Ann Arbor : ProQuest Dissertations & Theses, 2024
- Physical Description
- 140 p.
- General Note
- Source: Dissertations Abstracts International, Volume: 86-05, Section: A.
- General Note
- Advisor: Aggarwal, Vaneet.
- Dissertation Note
- Thesis (Ph.D.)--Purdue University, 2024.
- Summary, Etc.
- 요약Reinforcement learning (RL), which aims to train an agent to maximize its accumulated reward through time, has attracted much attention in recent years. Mathematically, RL is modeled as a Markov Decision Process, where the agent interacts with the environment step by step. In practice, RL has been applied to autonomous driving, robotics, recommendation systems, and financial management. Although RL has been greatly studied in the literature, most proposed algorithms are model-based, which requires estimating the transition kernel. To this end, we begin to study the sample efficient model-free algorithms under different settings. Firstly, we propose a conservative stochastic primal-dual algorithm in the infinite horizon discounted reward setting. The proposed algorithm converts the original problem from policy space to the occupancy measure space, which makes the non-convex problem linear. Then, we advocate the use of a randomized primal-dual approach to achieve O(\\ϵ-2) sample complexity, which matches the lower bound. However, when it comes to the infinite horizon average reward setting, the problem becomes more challenging since the environment interaction never ends and can't be reset, which makes reward samples not independent anymore. To solve this, we design an epoch-based policy-gradient algorithm. In each epoch, the whole trajectory is divided into multiple sub-trajectories with an interval between each two of them. Such intervals are long enough so that the reward samples are asymptotically independent. By controlling the length of trajectory and intervals, we obtain a good gradient estimator and prove the proposed algorithm achieves O(T3/4) regret bound.
- Subject Added Entry-Topical Term
- Linear programming.
- Subject Added Entry-Topical Term
- Exploitation.
- Subject Added Entry-Topical Term
- Recommender systems.
- Subject Added Entry-Topical Term
- Markov analysis.
- Subject Added Entry-Topical Term
- Robotics.
- Subject Added Entry-Topical Term
- Information science.
- Added Entry-Corporate Name
- Purdue University.
- Host Item Entry
- Dissertations Abstracts International. 86-05A.
- Electronic Location and Access
- 로그인을 한후 보실 수 있는 자료입니다.
- Control Number
- joongbu:655474
MARC
008250224s2024 us ||||||||||||||c||eng d■001000017165052
■00520250211153118
■006m o d
■007cr#unu||||||||
■020 ▼a9798346578185
■035 ▼a(MiAaPQ)AAI31732968
■035 ▼a(MiAaPQ)Purdue27173229
■040 ▼aMiAaPQ▼cMiAaPQ
■0820 ▼a519
■1001 ▼aBai, Qinbo.
■24510▼aModel-Free Algorithms for Constrained Reinforcement Learning in Discounted and Average Reward Settings.
■260 ▼a[S.l.]▼bPurdue University. ▼c2024
■260 1▼aAnn Arbor▼bProQuest Dissertations & Theses▼c2024
■300 ▼a140 p.
■500 ▼aSource: Dissertations Abstracts International, Volume: 86-05, Section: A.
■500 ▼aAdvisor: Aggarwal, Vaneet.
■5021 ▼aThesis (Ph.D.)--Purdue University, 2024.
■520 ▼aReinforcement learning (RL), which aims to train an agent to maximize its accumulated reward through time, has attracted much attention in recent years. Mathematically, RL is modeled as a Markov Decision Process, where the agent interacts with the environment step by step. In practice, RL has been applied to autonomous driving, robotics, recommendation systems, and financial management. Although RL has been greatly studied in the literature, most proposed algorithms are model-based, which requires estimating the transition kernel. To this end, we begin to study the sample efficient model-free algorithms under different settings. Firstly, we propose a conservative stochastic primal-dual algorithm in the infinite horizon discounted reward setting. The proposed algorithm converts the original problem from policy space to the occupancy measure space, which makes the non-convex problem linear. Then, we advocate the use of a randomized primal-dual approach to achieve O(\\ϵ-2) sample complexity, which matches the lower bound. However, when it comes to the infinite horizon average reward setting, the problem becomes more challenging since the environment interaction never ends and can't be reset, which makes reward samples not independent anymore. To solve this, we design an epoch-based policy-gradient algorithm. In each epoch, the whole trajectory is divided into multiple sub-trajectories with an interval between each two of them. Such intervals are long enough so that the reward samples are asymptotically independent. By controlling the length of trajectory and intervals, we obtain a good gradient estimator and prove the proposed algorithm achieves O(T3/4) regret bound.
■590 ▼aSchool code: 0183.
■650 4▼aLinear programming.
■650 4▼aExploitation.
■650 4▼aRecommender systems.
■650 4▼aMarkov analysis.
■650 4▼aRobotics.
■650 4▼aInformation science.
■690 ▼a0771
■690 ▼a0723
■690 ▼a0796
■71020▼aPurdue University.
■7730 ▼tDissertations Abstracts International▼g86-05A.
■790 ▼a0183
■791 ▼aPh.D.
■792 ▼a2024
■793 ▼aEnglish
■85640▼uhttp://www.riss.kr/pdu/ddodLink.do?id=T17165052▼nKERIS▼z이 자료의 원문은 한국교육학술정보원에서 제공합니다.
미리보기
내보내기
chatGPT토론
Ai 추천 관련 도서
Info Détail de la recherche.
- Réservation
- 캠퍼스간 도서대출
- 서가에 없는 책 신고
- My Folder