电子科技 ›› 2024, Vol. 37 ›› Issue (11): 47-54.doi: 10.16180/j.cnki.issn1007-7820.2024.11.007

• • 上一篇    下一篇

基于改进SAC算法的机械臂运动规划

唐超, 张帆   

  1. 上海工程技术大学 机械与汽车工程学院,上海 201620
  • 收稿日期:2022-04-08 出版日期:2024-11-15 发布日期:2024-11-21
  • 作者简介:唐超(1997-),男,硕士研究生。研究方向:深度强化学习。
    张帆(1980-),女,博士,副教授。研究方向:机构学与并联机器人理论。
  • 基金资助:
    上海市科委生物医药领域科技支撑计划(17441901200)

Motion Planning of Manipulator Based on Improved SoftActor-Critic Algorithm

TANG Chao, ZHANG Fan   

  1. School of Mechanical and Automotive Engineering,Shanghai University of Engineering Science, Shanghai 201620,China
  • Received:2022-04-08 Online:2024-11-15 Published:2024-11-21
  • Supported by:
    Shanghai Municipal Science and Technology Commission Science and Technology Support Program in the Field of Biomedicine(17441901200)

摘要:

针对深度强化学习算法在高维状态空间和高精度需求下的机械臂运动规划任务中存在探索效率低、收敛速度慢以及不收敛等问题,文中以SAC(Soft Actor-Critic)算法为基础,引入异步优势机制,提出了一种融合异步优势的AA-SAC(Asynchronous Advantage Soft Actor-Critic)算法。该算法使用Qtarget网络代替了原V网络,有效降低了Q网络的方差,n个独立的进程可并行训练,提升了训练效率。将AA-SAC算法的经验回放池划分成两个部分,将高质量的经验数据单独存放、单独采样,以提高有效经验数据的利用率。仿真结果表明,AA-SAC算法在收敛速度、成功率和稳定性上表现最优。相较于SAC算法,AA-SAC算法的收敛时间提前了3 000回合。收敛后AA-SAC算法的成功率达到了96%,比SAC算法提升了6%,比DDPG(Deep Deterministic Policy Gradient)算法提升了26%。

关键词: 深度强化学习, 异步优势, SAC算法, 经验回放池, 机械臂, 运动规划, 微创手术, CoppeliaSim

Abstract:

In view of the problems such as low exploration efficiency, slow convergence speed or even non-convergence of deep reinforcement learning algorithm in the motion planning task of robot arm under the requirement of high dimensional state space and high precision, this study introduces asynchronous advantage mechanism based on SAC(Soft Actor-Critic) algorithm, and proposes an AA-SAC(Asynchronous Advantage Soft Actor-Critic) algorithm integrating asynchronous advantage. This algorithm replaces the original V network with a Qtarget network,which effectively reduces the variance of the Q network. The n independent processes can be trained in parallel, which improves the training efficiency. The study also divides the experience playback pool of the AA-SAC algorithm into two parts, store and sample high-quality empirical data separately to improve the utilization of effective empirical data. The simulation results show that AA-SAC algorithm has the best performance in convergence speed, success rate and stability. Compared with the SAC algorithm, the convergence time of AA-SAC algorithm is 3 000 rounds earlier. After convergence, the success rate of AA-SAC algorithm reaches 96%, which is 6% higher than SAC algorithm and 26% higher than DDPG(Deep Deterministic Policy Gradient) algorithm.

Key words: deep reinforcement learning, asynchronous advantage, SAC algorithm, experience playback pool, mechanical arm, motion planning, minimally invasive surgery, CoppeliaSim

中图分类号: 

  • TP241