Axis-Space Framework For Cable-Driven Soft Continuum Robot Control Via Reinforcement Learning
Axis-Space Framework For Cable-Driven Soft Continuum Robot Control Via Reinforcement Learning
https://doi.org/10.1038/s44172-023-00110-2 OPEN
Cable-driven soft continuum robots are important tools in minimally invasive surgery (MIS)
to reduce the lesions, pain and risk of infection. The feasibility of employing deep reinfor-
cement learning (DRL) for controlling cable-driven continuum robots has been investigated;
however, a considerable gap between simulations and the real world exists. Here we intro-
duce a deep reinforcement learning-based method, the Axis-Space (AS) framework, which
1234567890():,;
accelerates the computational speed and improves the accuracy of robotic control by
reducing sample complexity (SC) and the number of training steps. In this framework, the SC
was reduced through the design of state space and action space. We demonstrate that our
framework could control a cable driven soft continuum robot with four tendons per section.
Compared with the Double Deep Q-learning Network (DDQN) controller, the proposed
controller increased the convergence speed by more than 11-fold, and reduced the positioning
error by over 10-fold. This framework provides a robust method for soft robotics control.
1 Tsinghua Shenzhen International Graduate School (SIGS), Tsinghua University, 518055 Shenzhen, China. 2 Tsinghua-Berkeley Shenzhen Institute (TBSI),
518055 Shenzhen, China. 3These authors contributed equally: Dehao Wei, Jiaqi Zhou. ✉email: [email protected]
M
inimally invasive surgery (MIS) involves the use of that is, ASDQN and ASDDQN controllers, were proven to have
surgical instruments to access desired surgical targets increased performance, including largely reduced sample data size
inside the human body through a small incision hole. In and high position tracking precision, than their non-AS coun-
this case, MIS offers reduced risks and injuries for patients. Soft terparts under various circumstances that might be encountered
continuum robots have distal dexterity and structural compliance, during MIS.
which play an important role in MIS1,2. Many kinds of soft
continuum robots have been designed for MIS3. At present, the
control of these robots is inaccurate due to the nonlinear beha- Results
viors of the flexible manipulator, including structural deforma- Formulation of ASDDQN controller. The gap between simula-
tions, interactions with soft tissues, and collisions with other tion and reality was caused by discrepancies20 (Fig. 1a). Although
instruments4. Many control methods have been proposed to many studies have focused on the simulation-to-real world tran-
improve the motion accuracy of continuum robots, which can be sition to narrow the reality gap in various robot control
categorized into model-based and model-free methods. problems21–24, there is still a long way to go before filling this gap.
For model-based control methods, the piecewise constant The mismatch between the simulated and real-world environments
curvature assumption is commonly adopted in existing models of raises the demand for simulation-to-real world transfer of the
continuum robots. This allows for computationally efficient knowledge acquired in simulations25. However, this transfer would
closed-form solutions for many continuum robots in an ideal hardly work, especially when it applies to a complex robot with
environment by ignoring environmental interactions and other high degrees of freedom (DoFs), because the gap between the real
nonlinear factors5. The Cosserat rod model allows external factors environment and the simulator becomes remarkable16. In addition,
to be taken into consideration but is computationally expensive directly implementing DRL training with robots possessing high
for real-time control6. There are also other model-based methods DoFs in real world require high SC and large training episodes. The
for continuum robot control based on finite element principles7. response time of robot movement resulted in a protracted training
However, the unknown disturbances on the robot in the soft time of directly training in real world in comparison to simulation-
tissue environment can often be an obstacle for accurate navi- based training. Moreover, owing to the constraints on Fatigue
gation, whereas integrating a flexible force sensor with a robot is strength of a soft continuum robot, the robot arm fractures when
hard to implement due to the narrow spaces and high cost. the requisite training steps surpass the maximum threshold it can
Although fiber Bragg grating optical sensors have been used to endure. Therefore, training robots with high DoFs using DRL in
acquire the external force or shape of a continuum robot, they are the real world faces considerable challenges (Fig. 1b).
limited by small tensile strength and high cost8. The implementation of the AS framework, featuring a
In contrast, model-free control methods for soft continuum revamped state space and action space, yields reduction in SC,
robots do not need prior knowledge on the underlying mechanics facilitating real-world robot training and expediting convergence
or dynamics of the robot9. Machine learning techniques, such as rates. As shown in Fig. 1c, six subspaces, i.e., X+ subspace, X−
neutral networks, extreme learning machines, and Gaussian subspace, Y+ subspace, Y− subspace, Z+ subspace, and Z−
mixture regression, have been implemented to model soft robot subspace, namely, subspaces 1–6, were constructed, and a
controllers10. However, these methods suffer from poor reliability different DRL model was developed for each subspace. During
when addressing real world challenges that a robot has never implementation, every target position within the robot’s 3D
encountered in training. Data-driven and empirical methods have workspace was decomposed into vectors along the three
been utilized for updating these controller models during real- subspaces. These vectors served as the targets for DDQN of the
time control. For example, hybrid robot controller combining corresponding model. Six DDQN models in the six subspaces
inverse kinematic and PID controller was developed to com- comprised the ASDDQN controller. In the ASDDQN model, the
pensate the positioning error caused by external disturbance11. controller utilizes the end-tip position of the robot, obtained from
Thuruthel et al. reported a machine learning-based approach for the motion capture system (MCS), along with the target position
the closed-loop kinematic control of continuum manipulators in as inputs to generate action signals, which are then transmitted to
a task space12. However, the control performance still heavily the soft continuum robot (Fig. 1c).
depends on the huge training data size and sample efficiency. Compared with DDQN controller, AS framework-enabled
DRL has been applied to robot control and the gap between DDQN controller, ASDDQN, can largely reduce sample com-
simulation and real world often limited the application and effi- plexity (SC) that plays an important role in training
ciency of DRL controller. Deep reinforcement learning (DRL) has performance26, and its optimal Q-function can be estimated with
jSjjAj
been recently investigated for soft continuum manipulator con- ε-accurate in entry wise as ð1γÞ 3 2 for a γ-discounted finite-
ε
trol. Generally, DRL methods can be divided into value-based and
horizon MDP with state space S and action space A. In the
policy-based DRL13. Value-based DRL algorithms, such as the
deep Q-learning network (DQN)14, have been applied to the DDQN controller, the action space ADDQN is 10, including xþ S1 ,
þ þ þ + − þ
3-dimensional (3D) motion of a soft continuum robot14. When x , y , y
, x
S1 S1 S1 S2 S2 S2 S2, x
, y , y , z , z , among which x S1 refers to the
applying DRL to robotics control, many researchers choose to section 1 of the robot arm bends towards the direction of xþ axis.
pretrain on a simulator and then transfer it to the real While in the ASDDQN controller, the action space of the model
world4,15,16. However, there is a gap between the simulation and in z þ or z subspace (A1 ) is 2 (zþ , z ) and the action space of the
real world. model in xþ , x , yþ or y subspace (A2 ) is 4 (xþ þ
S1 , xS1 , x S2 , x S2 or
þ þ
Here, we introduced an easily-implemented methodology, the yS1 , yS1 , yS2 , yS2 ). The state space (S) is defined as a set of states
Axis-space (AS) framework, to expedite convergence in DRL by that the vectors from the real-time end tip positions of soft
reducing the SC and training steps. The feasibility and augmen- continuum robot to the target positions. Here, both the real-time
tation of using AS framework-based DRL has been investigated end tip positions and the target positions can be considered as a
on DQN and DDQN controllers for a cable driven soft con- collection of pixel dots with a diameter of 0.5 mm due to the
tinuum robot17,18. Here, we investigate a cable driven soft con- accuracy of the MCS (NOKOV Inc., CHINA) used. Therefore,
tinuum robot with four tendons per section19, which serves as the the state space contains a finite number of vectors. In work space,
experimental platform. More cable driven soft continuum robots assuming there are 2n (n ∈ N*) states along x and y axes,
can be found in Supplementary Table 1. The AS based controllers, 2m (m ∈ N*) states along z axis, there are 2πmn2 states totally.
Fig. 1 Comparison of normal value-based deep reinforcement learning algorithm (e.g., DDQN) and AS framework-based DDQN controllers, i.e.,
ASDDQN. a The gap between simulation and real-world scenarios. b Direct robot training in the real world poses challenges such as long setup time and
the risk of robot arm damage. c Illustration of the design of action and state spaces for DDQN and ASDDQN controllers, and comparison of sample
complexity between the two controllers. The action space of DDQN controller is 10, while that of ASDDQN controller is 2, 2, 4, 4, 4, 4 for six different sub-
spaces. The state space of DDQN is 2πmn2 , whereas that of ASDDQN controller is 4n + 2 m. A comparison of the sample complexity of DDQN and
ASDDQN controllers reveals that the latter requires less sample complexity, resulting in faster convergence than the former. (x, y, z, S1, S2 represent x-axis,
y-axis, z-axis, first segment, and second segment)
In our study, n is larger than 20 while m is larger than 50. The DDQN and realizes fast convergence with the small training dataset
number of states in DDQN controller is 2πmn2 , approximately instead of high data acquisition costs and time consumption.
1:3 105 in our study. In ASDDQN controller, the state space of
the model in zþ or z subspace (S1 ) is m and the state space of The robot system design. A 47.2-mm-long and 7-mm-diameter
the model in xþ , x , yþ or y subspace (S2 ) is n. Therefore, the continuum surgical robot-arm allowing the active distal dexterous
SCDDQN is shown as follows: manipulation was developed (Fig. 2a–e). It had a central lumen
jSjjAj 2πmn2 10 20πmn2 with an inner diameter of 1 mm, sufficient to accommodate one
SCDDQN ¼ 3 ¼ 3 ¼ 3 ð1Þ instrument, such as a miniature camera, surgical forceps, and
1 γ ε2 1 γ ε2 1 γ ε2 suction irrigation tubes (Supplementary Fig. 1a–c and Supple-
and SCNSDDQN is as follows: mentary Table 2). It has two distal bending sections, defined as S1
and S2 (Supplementary Fig. 2), consisting of a series of 16 disks
jSjjAj 2 S1 A1 þ 4 S2 A2 4m þ 16n evenly distributed along the axis with two cylindrical rods con-
SCASDDQN ¼ 3 ¼ 3 ¼ 3
1γ ε 2 1γ ε 2 1 γ ε2 necting every two disks to function as the supporting backbone
and the bending joint. Adjacent joints were arranged orthogon-
ð2Þ
ally to each other to achieve 2-DoF bending motion. The overall
According to Eqs. (1) and (2), we have experimental setup and workflow are shown in Fig. 2f. The MCS
measured the tip position of the continuum robot arm by the
SCDDQN 5πmn2 fiducial markers attached to the proximal and distal end of the
¼ ð3Þ
SCASDDQN m þ 4n robot (Fig. 3a). The hyperparameters used are shown in Table 1.
which in our study is approximately 2.4 × 103. The results indicate
that a larger work space confers greater advantages to the ASDDQN “T” pattern point tracking comparison: DQN, DDQN,
controller. The induction above indicates that the AS framework ASDQN, and ASDDQN. In this section, the models after
requires fewer samples to obtain the same accuracy compared with training for 80 episodes, 25 steps per episode, of DQN, DDQN,
Fig. 2 Overview of the cable-driven 5-DOF soft continuum robot and its control system. a The side view of the manipulator tip with two segments.
b–e Multiple views of the robot: b top view, c exploded view, d left view, and e triangular view of the robot. f The control system consists of a motion
capture system, an end robot arm, and a deep reinforcement learning controller. The tip position is obtained by the motion capture system and transferred
to the deep reinforcement learning controller as the input. The controller outputs the action signals to the motor driver and controls the robot ((x; y; z)
represents the current robot arm end tip position, (xd ; yd ; zd ) represents the desired target position, (xt ; yt ; zt ) represents the robot arm end tip position at
time t, QðSt ; at Þ represents the action-value function at time t).
in the whole trajectory was 6.12 mm. These results presented the
Table 1 Hyperparameters below-satisfactory performance of DQN and DDQN controllers
on this robot due to limited training dataset.
Hyperparameter Value The limitations of both DQN and DDQN were overcome by
Discount factor γ 0.95 using the AS framework strategy. Figure 3f shows the significantly
Replay buffer size 10,000 improved performance of the ASDQN and ASDDQN controllers
Learning rate α 0.0001 when compared to their non-AS counterparts. The RMSE values
Batch size 64 for the whole trajectories were 7.12 and 6.12 mm under the DQN
Max episodes 80 and DDQN controllers but were reduced to 0.55 and 0.61 mm
Initial ϵ 0.4 under the ASDQN and ASDDQN controllers, respectively
ϵ decay rate 0.0001 (Table 2). For both algorithms, AS-based controller achieved an
Target network update frequency C 200
over 10-fold reduction in RMSE. The obtained results demon-
strated the effectiveness of the AS framework in generalization
ASDQN and ASDDQN were evaluated by comparing their per- across diverse value-based DRL algorithms.
formance under the condition that the robot could move freely in
the workspace when tracking “T” patterns. The start position (red Comparison of DDQN and ASDDQN learning performance.
triangle) was fixed, and every subsequent step was taken toward Real-world robot training with reinforcement learning (RL) has
next target position. The colored dots and curves in Fig. 3b–e been challenging due to the immense amount of training data size
represent the tracked trajectories of stepwise locomotion from the required. To assess the efficacy of the ASDDQN controller in
four different controllers, while the black dots represent the target reducing training steps and accelerating convergence, in the
positions. In Fig. 3b, in DQN controller the robot tip was posi- training process, each AS-based model underwent 80 episodes
tioned with high precision from the starting point to the fifth consisting of T = 25 iteration steps each and the episode could be
trajectory point, but from the sixth to the ninth point, it kept ended in advance if the distance of the target and the actual
moving straight upwards instead of turning left or right to comply location did not exceed 0.5 mm. The rewards of the last steps in
with the trajectory command. It formed a significantly deformed each of eight episodes were recorded, and the curve was judged as
“T” pattern when referring to the target “T” pattern. A similar to whether they converged. Here, six neutral network models of
presentation is shown in Fig. 3c via the DDQN controller. The the ASDDQN controller were selected corresponding to AS-based
root mean square error (RMSE) of tracking dislocation was subspaces. The DDQN controller was tested under the same
defined as the distance between the target point and the real point conditions. During the evaluation, the stop condition was set such
the robot reached. The RMSE of the DQN controller in the whole that the robot had moved for 25 steps or reached the goal within
trajectory was 7.12 mm while the RMSE of the DDQN controller an error not exceeding 0.5 mm. As shown in Fig. 3g, the reward
Fig. 3 Comparison of the performance of DQN and DDQN with ASDQN and ASDDQN. a Experimental setup to evaluate the continuum robot
performance under ASDDQN. b–e Evaluation of trajectory precision with b DQN (RMSE: >6.00 mm), c DDQN (RMSE: >6.00 mm), d ASDQN (RMSE:
<0.61 mm), and e ASDDQN (RMSE: <0.61 mm) (SP start position of robot, TP target position, RP real position). f The trajectory tracking RMSE under the
DQN, DDQN, ASDQN, and ASDDQN controllers (X_RMSE root mean square position error of x-axis). g Training performance of the ASDDQN controller
within 2000 steps (M1, M2, M3, M4, M5, M6 means six models in ASDDQN controller). h The evaluation performance of the DQN, DDQN, ASDQN and
ASDDQN controllers. i Training performance of the DDQN controller, which did not converge after 97,000 steps of training.
Fig. 4 Evaluation of different controllers under complex circumstances. a, b Sin (a) and square (b) function shape tracking under ASDDQN, DDQN and
IK controllers. c–f Complicated pattern tracking under the ASDDQN controller, c circle, d 3D spiral, e 3D flower, and f saddle surface (TP target position, RP
real position, SP start position of robot, IK inverse kinematic).
from the DDQN controller. This proved that ASDDQN con- the target position within a threshold of 0.4 mm (Fig. 5a). The
troller had dramatically improved performance in tracking performance of ASDDQN controller (RSME = 0.63 mm for the
complex patterns when compared with its non-AS counterpart 3D-hat shape) was significantly better than that of the IK con-
and IK controllers. troller (RMSE = 4.85 mm for the 3D-hat shape). The inaccuracy
of the IK controller in capturing robot motion under a large
external payload was caused by the inverse kinematics model
Irregular trajectory and complicated 3D point tracking with developed under the constant curvature assumption in capturing
ASDDQN controller. In the context of a soft continuum robot robot motion under a large external payload, which is not
executing a MIS task, it is imperative that its tip movement applicable in complicated scenarios where curvature becomes
exhibit both precision and robustness. Such requirements are irregular and highly variant. The obtained results serve to
essential in ensuring the soft continuum robot can effectively demonstrate the robustness of our ASDDQN controller, a quality
respond to commands for irregular trajectories, regardless of which can be attributed to two key factors. Firstly, it was observed
whether it is operating in an open or tissue-enclosed environment that the addition of payload did not remarkably impact the
that emulates the architecture of a biological body. To test the 3D performance of our robot arm. This is largely due to the inherent
trajectory tracking performance, the robot was commanded to robustness of the arm itself, enabling it to effectively execute
track different reference shapes on two patterns that cohesively movements in response to instructions during operation. Fur-
reflected the complexity of 3D movement (Fig. 4c, circle, and thermore, the findings also highlight the anti-interference cap-
Fig. 4d, 3D spiral shape). The tracking RMSEs of the two patterns abilities of our ASDDQN controller, as it was observed to readily
were 0.87 and 0.55 mm, respectively. The robot was commanded adapt its strategy when the state of the system was altered by
to track different reference points on two patterns that cohesively external factors, such as the impact generated at the end of the
reflected the complexity of 3D movement (Fig. 4e, conical helix, robot arm. As a result of these robust features, the imposed load
and Fig. 4f, circle) to evaluate its 3D point tracking performance. had a minimal impact on the overall performance of the system.
The tracking RMSEs of the two patterns were 0.87 and 0.55 mm,
respectively. The robot was also commanded to track reference
points on a 3D sine function pattern, and the tracking RMSE was The performance of ASDDQN controller in surgery simulation
0.65 mm (see Supplementary Fig. 3). The results showed that by using soft obstacles. The robot arm was placed between a pair
ASDDQN had submillimeter errors in tracking 3D shapes, which of foams that served as an environmental constraint. It resembled
implies its significance in real surgery applications. the soft tissue analogs in the gastrointestinal wall or the inner
epidermis of the uterus, aiming to investigate the online learning
capability of the tracking performance in the presence of soft
Robust ASDDQN control under external payload. To simulate perturbations. The robot arm was commanded to follow a
the scenarios where the continuum robot performs clinical sur- straight-line trajectory that crossed the peaks and valleys of the
gical operations on biological tissues or carries a load, such as an foam under the control of ASDDQN, which ensure the robot was
endoscopic camera or a forceps at its tip, a weight of approxi- in contact and pressed the soft obstacle (Fig. 5b). ASDDQN
mately 10 g was hung at the robot arm tip (illustrated in Sup- achieved performance with an RMSE of 0.49 mm, consistent with
plementary Movie 1). It was then commanded to track the the RMSE values presented in load-free circumstances. This
trajectory of a 3D-hat shape. To evaluate the performance of the suggests the robustness of ASDDQN in real-world practice.
proposed ASDDQN controller, the IK controller was developed In Fig. 5c, points tracking experiment was implemented under
during point tracking and served as the comparison group. In the soft obstacles (Supplementary Movie 2). After reaching final
point tracking experiment, success was defined as the tip reaching target position, the robot tip returned to its start position. The
Fig. 5 Performance evaluation in different simulation scenarios. a 3D hat tracking with an external payload under the ASDDQN controller. The tip was
loaded with a 10 g weight. b Illustration of robot point tracking for endometrial repair application. c Time-lapse capture of real-world tracking (SP start
position of robot, TP target position, RP real position, IK inverse kinematic). Scale bar, 5 cm.
final target point was tracked 10 times, and the end tip position circumstance that nonconvergence was reached even after
error was 0.53 mm. This precision level meets the requirement for 97000 steps of training in the non-AS DRL controller. It also
most MIS surgeries26. eliminates the dependence on the virtual simulation environment
in real-world DRL training of multi-DoFs for soft continuum
Discussion robots.
In this study, an AS framework was proposed to reduce the SC Both algorithmic and robotic generalization were promising in
and speed up the convergence. The superiority of controlling the the AS framework employed in this study. The implementation of
cable driven soft continuum robot using AS framework-based the AS-based controller on diverse value-based RL algorithms,
DRL was presented on a cable driven soft continuum robot with i.e., DQN and DDQN, lead to noteworthy enhancements in both
four tendons per section, and compared to normal DRL and a convergence speed and controlling accuracy of soft robot move-
traditional inverse kinematic-based controller. ments. Currently, we have exclusively applied the AS framework
The time and computation consumption were dramatically to the cable driven soft continuum robot. However, the AS fra-
reduced due to the reduction of SC using the user-friendly AS- mework holds potential for broader application across other soft
based DRL methods when the convergence of the AS-based DRL continuum robots, provided they possess partitionable task
controller was rapidly reached after 9000 steps compared with the spaces.
The physical experiments demonstrated that AS framework- shows the correspondence between each action and cable.
based RL consistently yielded submillimetre tracking accuracy Actions 9 and 10 represented the linear guide moving
and high stability under different test environments, including upwards and downwards by a specific displacement,
different tracking patterns and tracking under external payloads respectively. The settings for actions 1–10 in this study
and soft obstacles. The navigation accuracy in space by AS aug- were adjusted according to the parameters of robots, e.g.,
mentation was increased by over 10-fold when compared with the minimum step of a stepping motor. At each iteration,
counterpart DRL methods and over 5-fold versus the inverse one agent selected an action a 2 A, where a represented the
kinematic-based methods. cabledisplacement in mm and a predefined action set
Further work involves developing a more stable multi-DOF soft A ¼ a1 ; a2 ; a3 ; a4 ; a5 ; a6 ; a7 ; a8 ; a9 ; a10 .
continuum robot that can undergo larger training episodes to meet ● Reward (R): the reward function provided a scalar
the needs of more actual applications. Notably, there is in urgent numerical signal as feedback to the agents. The reward
demand to develop open-loop AS-framework-based DRL controllers function is defined as follows:
that incorporate end-tip orientation for soft continuum robots qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
capable of meeting the requirements of clinical applications. dist½i ¼ ðptarget ½i ptip ½iÞ2 ð4Þ
Although the soft robot demonstrated robust performance under
payload or when obstructed by a soft-tissue analog, its robust per-
r ¼ dist½i edist½i ð5Þ
formance under different real anatomic conditions remain to be
examined. Eventually, it may pave the way for clinically applicable where i = 0, 1, 2, 3, 4, and 5 represent spaces 1–6, respectively.
robots that hold promises to revolutionize medical operations. The Equation (4) represented the spatial Euclidean distance between
motion capture implemented by MCS as in this study is transferable the target and actual position of the end tip of a manipulator.
to medical radiography in motion capture and guidance. Equation (5) ensured that the reward would be 0 if the
manipulator tip reached the target position, and the continuity
Methods of the reward was also guaranteed. Soft robots were made of
Robot design. The robot arm was 3D-printed using stereo- materials with a limit for Fatigue strength, such as nylon (fatigue
lithography of sterilizable high-strength nylon to allow potential strength, 12,400 psi) in this study. In real-world robot training
integration with imaging modalities, such as magnetic resonance with DRL, the robot would fracture if bending was repeated
imaging (MRI). It was driven by six Gathering wire rope (Hon- beyond the specific limit. Equation (5) was set as an exponential
gxiang Company, China) via six customized motor bushings function with the index number being the spatial Euclidean
(Hongtai Company, China). Six 42-stepper motors (Markerbase distance between the target position and actual end tip position of
Company, China) are connected with the motor bushing as the the manipulator. This provided a considerable punishment when
actuators. Motors 1, 2, and 5 were responsible for the pitch the robot was far away from the target position.
motion (the x–z plane). Motors 3, 4, and 6 were responsible for
the yaw motion (the y–z plane). The linear guide was responsible
for the z-axis movement (Supplementary Fig. 2, Supplementary Data availability
Table 2, and Movies 3 and 4). The data that support the findings of this study are available from the corresponding
author upon reasonable request.
Experimental platform setup. A closed-loop feedback position
control of the robot was used with continuous positional data
from MCS. The MCS collected the sensor data, while the control Code availability
program was implemented in Arduino and Python 3. The com- Code can be accessed on GitHub at https://github.com/shaddock-123/Axis-space-AS. To
munication between the Arduino microcontroller and computer utilize the code, it is necessary to first download the SEEKER software in order to obtain
the position information of the optical marker.
was realized by the Transmission Control Protocol/Internet
Protocol. The Arduino microcontroller received action signals,
delivered the corresponding signals to control the motors, then Received: 11 October 2022; Accepted: 17 August 2023;
read and sent sensor data to the computer. The computer
received the sensor data as the input to the DRL algorithms
implemented on PARL 2.0.3. The data were processed, and the
best action of the current policy was chosen and sent to the
Arduino microcontroller. References
1. Burgner-Kahrs, J., Rucker, D. C. & Choset, H. Continuum robots for medical
Markov decision process (MDP). In RL, the Markov property of applications: a survey. IEEE Trans. Robot. 31, 1261–1280 (2015).
2. Runciman, M., Darzi, A. & Mylonas, G. P. Soft robotics in minimally invasive
Markov decision processes (MDPs) is commonly exploited by RL surgery. Soft Robot. 6, 423–443 (2019).
agents. The simplest MDP can be represented as a tuple <S, A, P, 3. Ji, G. L. et al. Towards safe control of continuum manipulator using shielded
R, γ>, where S, A, P, R, and γ represent the state, action, transition multiagent reinforcement learning. IEEE Robot. Autom. Lett. 6, 7461–7468 (2021).
model, reward, and discount factor12. The specific definitions of 4. Webster, R. J. & Jones, B. A. Design and kinematic modeling of constant
the task in our control settings were defined as follows: curvature continuum robots: a review. Int. J. Robot. Res. 29, 1661–1683
(2010).
● State (S): states were defined as (ξ 1 , ξ 2 , ξ 3 ) = 5. Abu Alqumsan, A., Khoo, S. & Norton, M. Robust control of continuum
robots using Cosserat rod theory. Mech. Mach. Theory 131, 48–61 (2019).
ðxtarget xtip ; ytarget ytip ; ztarget z tip Þ, where ξ 1 , ξ 2 and
6. Grazioso, S., Di Gironimo, G. & Siciliano, B. A geometrically exact model for
ξ 3 are the vectors between the robot’s tip position and the soft continuum robots: the finite element deformation space formulation. Soft
target position. Robot. 6, 790–811 (2019).
● Action (A): a total of 10 actions were defined. Actions 1–4 7. Di, H., Xin, Y. & Jian, J. Review of optical fiber sensors for deformation
were responsible for the bending along X+ and X− axis, measurement. Optik 168, 703–713 (2018).
8. Morimoto, R., Nishikawa, S., Niiyama, R., & Kuniyoshi, Y. Model-free
actions 5–8 for Y+ and Y−, actions 9–10 for Z+ and Z−.
reinforcement learning with ensemble for a soft continuum robot arm. In 2021
Each action was implemented by a specific displacement of IEEE 4th International Conference on Soft Robotics (RoboSoft) 141–148 (IEEE,
a cable connecting to the motor. Supplementary Fig. 2 2021).
9. Xu, W., Chen, J., Lau, H. Y. K. & Ren, H. Data-driven methods towards Acknowledgements
learning the highly nonlinear inverse kinematics of tendon-driven surgical This work was supported by the National Natural Science Foundation of China (grant
manipulators. Int. J. Med. Robot. Comput. Assist. Surg. 13, e1774 (2016). numbers: 61971255, 82341019 and 82111530212), the Natural Science Foundation of
10. Wang, Z. et al. Hybrid adaptive control strategy for continuum surgical robot Guangdong Province (grant number: 2021B1515020092), the Shenzhen Science and
under external load. IEEE Robot. Autom. Lett. 6, 1407–1414 (2021). Technology Innovation Commission (grant numbers: WDZC20200821141349001,
11. Thuruthel, T. G. et al. Learning closed loop kinematic controllers for RCYX20200714114736146, RCBS20200714114911104, and KCXFZ20200201101050887),
continuum manipulators in unstructured environments. Soft Robot. 4, and the Shenzhen Bay Laboratory Fund (grant number: SZBL2020090501014).
285–296 (2017).
12. Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT
Press, 2018).
Author contributions
S.M.: conceptualization, supervision, writing—original draft, and writing—review and
13. Mnih, V. et al. Human-level control through deep reinforcement learning.
editing; D.W.: investigation, methodology, mechatronics and control system develop-
Nature 518, 529–533 (2015).
ment, data analysis, visualization, and writing—original draft; J.Z.: methodology, formal
14. Satheeshbabu, S., Uppalapati, N. K., Chowdhary, G. & Krishnan, G. Open loop
analysis, visualization; Y.Z.: methodology and organoid culture; J.M.: typesetting and
position control of soft continuum arm using deep reinforcement learning. In
proofreading.
IEEE International Conference of Robotics and Automation (ICRA) 5133–5139
(IEEE, 2019).
15. Miki, T. et al. Learning robust perceptive locomotion for quadrupedal robots Competing interests
in the wild. Sci. Robot. 7, eabk2822 (2022). The authors declare no competing interests.
16. Q. Ding, Q., Lu, Y., Kyme, A. & Cheng, S. S. Towards a multi-imager
compatible continuum robot with improved dynamics driven by modular
SMA. In 2021 IEEE International Conference on Robotics and Automation Additional information
(ICRA) 11930–11937 (IEEE, 2021). Supplementary information The online version contains supplementary material
17. Zhang, T. et al. Millimeter-scale soft continuum robots for large-angle and available at https://doi.org/10.1038/s44172-023-00110-2.
high-precision manipulation by hybrid actuation. Adv. Intell. Syst. 3, 2000189
(2021). Correspondence and requests for materials should be addressed to Shaohua Ma.
18. Ibarz, J. et al. How to train your robot with deep reinforcement learning:
lessons we have learned. Int. J. Robot. Res. 40, 698–721 (2021). Peer review information Communications Engineering thanks the anonymous reviewers
19. Yeshmukhametov, A., Koganezawa, K. & Yamamoto, Y. A novel discrete wire- for their contribution to the peer review of this work. Primary handling editors: Thanh
driven continuum robot arm with passive sliding disc: design, kinematics and Do, Rosamund Daw, Mengying Su. A peer review file is available.
passive tension control. Robotics 8, 51 (2019).
20. Tan, J. et al. Sim-to-real: learning agile locomotion for quadruped robots. In Reprints and permission information is available at http://www.nature.com/reprints
2018 Proceedings of Robotics: Science and Systems (RSS, 2018).
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
21. Peng, X. B., Andrychowicz, M., Zaremba, W. & Abbeel, P. Sim-to-real
published maps and institutional affiliations.
transfer of robotic control with dynamics randomization. In 2018 IEEE
International Conference on Robotics and Automation (ICRA) 3803–3810
(IEEE, 2018).
22. Chebotar, Y. et al. Closing the sim-to-real loop: adapting simulation Open Access This article is licensed under a Creative Commons
randomization with real world experience. In 2019 International Conference Attribution 4.0 International License, which permits use, sharing,
on Robotics and Automation (ICRA) 8973–8979 (IEEE, 2019). adaptation, distribution and reproduction in any medium or format, as long as you give
23. Arndt, K., Hazara, M., Ghadirzadeh, A. & Kyrki, V. Meta reinforcement appropriate credit to the original author(s) and the source, provide a link to the Creative
learning for sim-to-real domain adaptation. In 2020 IEEE International Commons licence, and indicate if changes were made. The images or other third party
Conference on Robotics and Automation (ICRA) 2725–2731 (IEEE, 2020). material in this article are included in the article’s Creative Commons licence, unless
24. Zhao, W. S., Queralta, J. P. & Westerlund, T. Sim-to-real transfer in deep indicated otherwise in a credit line to the material. If material is not included in the
reinforcement learning for robotics: a survey. In 2020 IEEE Symposium Series article’s Creative Commons licence and your intended use is not permitted by statutory
on Computational Intelligence (SSCI) 737–744 (IEEE, 2020). regulation or exceeds the permitted use, you will need to obtain permission directly from
25. Agarwal, A., Kakade, S. & Yang, L. F. Model-based reinforcement learning the copyright holder. To view a copy of this licence, visit http://creativecommons.org/
with a generative model is minimax optimal. In Proceedings of Thirty Third licenses/by/4.0/.
Conference on Learning Theory 67–83 (PMLR, 2019).
26. Shi, C. Y. et al. Shape sensing techniques for continuum robots in minimally
invasive surgery: a survey. IEEE Trans. Bio-Med. Eng. 64, 1665–1678 (2017). © The Author(s) 2023