Is semi-supervised same as reinforcement learning?

Reinforcement learning is not the same as semi-supervised learning . Reinforcement learning is a method where there are reward values attached to the different steps that the model is supposed to go through. So the algorithm's goal is to accumulate as many reward points as possible and eventually get to an end goal.

What is semi-supervised machine learning with example?

Semi-supervised learning is a broad category of machine learning that uses labeled data to ground predictions, and unlabeled data to learn the shape of the larger data distribution . Practitioners can achieve strong results with fractions of the labeled data, and as a result, can save valuable time and money.

At what situation is semi-supervised learning most desirable?

Semi-supervised learning methods are especially relevant in situations where obtaining a sufficient amount of labeled data is prohibitively difficult or expensive, but large amounts of unlabeled data are relatively easy to acquire .

What are the disadvantages of semi-supervised learning?

Disadvantages of semi-supervised learning Complexity: Integrating labeled and unlabeled data often requires sophisticated pre-processing techniques such as normalizing data ranges, imputing missing values, and dimensionality reduction .

What is an example of supervised learning in real life?

Example 1: We may use supervised learning to predict house prices . Data having details about the size of the house, price, the number of rooms in the house, garden and other features are needed. We need data about various parameters of the house for thousands of houses and it is then used to train the data.

What is an example of reinforcement learning in machine learning?

This type of algorithm should be applied to environments with a dynamic nature and where we don't have complete knowledge about them. A reinforcement learning example is autonomous driving cars that have a dynamic environment where there can be a lot of changes in traffic routes.

What are the benefits of semi-supervised learning?

Benefits and Use Cases Semi-Supervised Learning offers considerable benefits for businesses, such as: Improved accuracy : By using unlabeled data, the model can learn underlying data structures, thereby improving prediction accuracy. Cost-effectiveness: Labeling data can be time-consuming and costly.

Is reinforcement learning the same as supervised learning?

Reinforcement learning is neither supervised nor unsupervised as it does not require labeled data or a training set. It relies on the ability to monitor the response to the actions of the learning agent. Most used in gaming, robotics, and many other fields, reinforcement learning makes use of a learning agent.

What are the 4 types of machine learning?

There are four types of machine learning algorithms: supervised, semi-supervised, unsupervised and reinforcement .

Is unsupervised learning the same as reinforcement?

And, unsupervised learning is where the machine is given training based on unlabeled data without any guidance. Whereas reinforcement learning is when a machine or an agent interacts with its environment, performs actions, and learns by a trial-and-error method.

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (2024)

Chuang Wang, Lie Yang, Ze Lin, Yizhi Liao, Gang Chen, and Longhan XieThis document is the results of the National Key Research and Development Program of China (Grant No. 2021YFB3301400), the research project funded by the National Natural Science Foundation of China (Grant No. 52075177). (Corresponding author: Gang Chen and Longhan Xie.)Chuang Wang is with the South China University of Technology, Guangzhou, CN (e-mail: wichuangwang@mail.scut.edu.cn).Lie Yang is with the School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore. (e-mail: lie.yang@ntu.edu.sg).Gang Chen is with the South China University of Technology, Guangzhou, CN (e-mail: gangchen@scut.edu.cn).Longhan Xie is with the South China University of Technology, Guangzhou, CN (e-mail: melhxie@scut.edu.cn).

Abstract

Assembling a slave object into a fixture-free master object represents a critical challenge in flexible manufacturing. Existing deep reinforcement learning-based methods, while benefiting from visual or operational priors, often struggle with small-batch precise assembly tasks due to their reliance on insufficient priors and high-costed model development. To address these limitations, this paper introduces a cognitive manipulation and learning approach that utilizes skill graphs to integrate learning-based object detection with fine manipulation models into a cohesive modular policy. This approach enables the detection of the master object from both global and local perspectives to accommodate positional uncertainties and variable backgrounds, and parametric residual policy to handle pose error and intricate contact dynamics effectively. Leveraging the skill graph, our method supports knowledge-informed learning of semi-supervised learning for object detection and classroom-to-real reinforcement learning for fine manipulation. Simulation experiments on a gear-assembly task have demonstrated that the skill-graph-enabled coarse-operation planning and visual attention are essential for efficient learning and robust manipulation, showing substantial improvements of 13 $\%$ in success rate and 15.4 $\%$ in number of completion steps over competing methods. Real-world experiments further validate that our system is highly effective for robotic assembly in semi-structured environments.

Index Terms:

Robotic assembly, Semi-structured environment, Object detection, Semi-supervised learning, Residual reinforcement learning.

I Introduction

The flexible manufacturing systems aim to swiftly adapt to market demands and individual customer requirements, facilitating a quick and cost-effective response to new tasks [1, 2]. In contemporary industrial robotics, flexibility is primarily achieved through automated end-effector changes, efficient robot programming, and the utilization of component-specific fixtures [3]. In the realm of low-volume batch production, robotic assembly systems tailored for flexible manufacturing must handle objects that are randomly positioned and unsecured by fixtures, thereby enhancing flexibility and adaptability across various product types in hardware [4, 5]. While this less structured approach reduces the need for developing specific fixtures, saving both time and costs, it introduces significant software challenges, particularly the need for precise identification of randomly located objects within a defined workspace and the accurate control of force during precision assembly tasks.

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (1)

Human operators, leveraging task-specific knowledge and prior experience, intuitively manage these complexities to execute high-precision assembly tasks in such environments. Conversely, designing or learning an efficient and robust policy that enables robots to perform similarly in semi-structured environments poses a considerable challenge.Early solutions involved multi-stage methods that utilized vision systems to estimate pose errors and guide robots via visual servoing, complemented by force/torque-based algorithms to correct both visual and positional inaccuracies during the insertion process [6].More recently, deep reinforcement learning (DRL) has emerged as a promising alternative, formulating effective policies through trial-and-error learning without relying on precise modeling and sensors [7].However, these methods typically require extensive empirical knowledge or substantial training data, which can be time-consuming and labor-intensive, involving tasks such as image labeling, parameter tuning, or costly interactions.

To address this issue, recent advancements in robotic manipulation have focused on the integration of multi-stage methods and deep reinforcement learning to obtain robust and efficient policies[8, 9]. The neural-symbolic framework integrates the convenience of traditional methods with the flexibility of RL, compensating for the inaccuracy of traditional positioning methods and the low efficiency of RL.Despite these advancements, significant challenges remain in utilizing the advantages of various modules for fixture-less precise assembly tasks. The primary issue is that singular visual or operational representations fail to provide a comprehensive understanding of the robots, tasks, and environments necessary for robot learning to handle the large position uncertainty and complex contact dynamics [10, 11]. Additionally, both hand-designed and learning-based prior visual and operational representations often require substantial engineering efforts or extensive training data, complicating their application in real-world scenarios [12].

A key differentiator between human and non-human cognition is the capacity for structured knowledge representation, which has proven essential in addressing these challenges [13]. Such representations linked with visual [14, 15, 16] or operational priors [17, 18, 19] and learning process guided by knowledge [20], have successfully captured a wide range of human cognitive processes, enabling efficient autonomous learning with minimal supervisory intervention [21, 22]. Drawing inspiration from the human approach to learning and manipulation, this thesis introduces the Cognitive Manipulation Method for Robotic Assembly in Semi-Structured Environments (CM4RASSE). This innovative method utilizes graph-based structural prior knowledge to integrate learning-based object detection and fine manipulation models into a cohesive modular policy, promoting self-directed learning with minimal human oversight.

The proposed approach begins by establishing a skill graph that consolidates spatial, temporal, and causal information, creating a structured and flexible cognitive manipulation architecture by integrating multiple modules. This framework complements the general object detection model as a rich visual representation for fine manipulation by providing operationally relevant positional and visual attention information. Additionally, a rich operational representation based on skill graphs and planning enables transitioning from global observation and coarse operation to local observation and fine operation to address the challenges of assembly tasks in semi-structured environments.The proposed knowledge-informed developmental training method mirrors the human cognitive process of observing before acting and mastering skills in controlled settings before tackling real-world scenarios.Initially, we employ the skill graph and to collect diverse samples from pick-and-place interactions and minimal manual labeling for calibration and automatedlabeling, facilitating semi-supervised object detection learning and cost-effective hand-eye-task calibration. This phase results in a rich vision representation which is robust to position uncertainty and variable backgrounds.Subsequently, the skill graph integrates visual perception with coarse operation planning, providing structured data that includes spatial locations and task-focused perspectives, enabling the agent to learn a residual policy for manipulation skills instead of learning from scratch. Moreover, this structured approach facilitates the transfer of learned skills to semi-structured environments, where variability and unpredictability are more pronounced.The efficacy of CM4RASSE is first studied in simulation and subsequently evaluated on a real robot through its application in a high-precision, jigless peg-in-hole and gear assembly task. The results show that the vision and manipulation model can be efficiently learned in new tasks with prior knowledge guidance, and the integrated policy performs effectively in the semi-structured environment.

This work presents a cognitive manipulation method, leveraging a skill graph to integrate learning-based visual perception and fine manipulation to facilitate efficient and effective learning of new tasks. Our primary contributions are as follows:

1.
A Novel neural-symbolic framework: We have developed a skill graph that serves as common sense to integrate various modules, thereby handling the large position uncertainty and complex contact dynamics.
2.
Semi-Supervised visual representation learning: We leverage the skill graph to collect a diverse array of samples and minimal manual labeling for Cost-Effective Hand-Eye-Task calibration and automated labeling, promoting semi-supervised learning of Object Detection.
3.
Classroom-to-Real Residual Reinforcement Learning: Our skill graph, combining visual perception with trajectory planning, supports the learning of fine manipulation policies within a structured environment and enabling a seamless transition of acquired skills to a semi-structured environment, effectively navigating the inherent challenges of precise assembly tasks.
4.
Comparative and Comprehensive Studies: We conduct comparative and comprehensive studies to assess the effectiveness of each component and integrated policy in terms of learning efficiency and assembly performance.

II Related work

Robotic systems require precise state feedback and sophisticated control policies to effectively address specific tasks in semi-structured environments. This section reviews the cutting-edge methodologies and significant advancements in robotic assembly within such environments, highlighting the visual representation and operational representation for robot learning and concepts derived from human cognitive systems.

II-A Robotic Assembly in Semi-structured Environments

Multi-stage methods have been pivotal in precision assembly tasks, employing an integrated robotic system across three distinct phases [2]: 1) Initial Approach: Utilizing an eye-to-hand camera, the system employs position-based visual servoing (PBVS) to navigate towards the master object. 2) Alignment: A force/torque-based local search method corrects alignment errors to ensure precise fitting. 3) Insertion Execution: Discrepancies in position and orientation are rectified using a force/torque control algorithm, ensuring successful component insertion. To enhance the efficiency of local search for assembling components with complex geometries, the part’s geometry itself guides the alignment process through image-based visual servoing (IBVS) with an eye-in-hand camera [23]. Hybrid strategies that combine both eye-to-hand and eye-in-hand cameras merge the benefits of different visual servoing techniques, providing comprehensive visual cues that maintain target visibility throughout the operation [24, 25]. However, the success of these systems heavily depends on the precise selection of features and the strategic design of control methodologies, necessitating meticulous calibration and parameter adjustments to minimize errors and ensure operational stability.

Innovative approaches have been introduced to bolster robustness against variations in surface geometry and lighting conditions and simplify vision, calibration, and control processes. For instance, Haugaard et al. [26] introduced a deep learning approach for pin and hole point estimation in multi-camera setups, facilitating visual servoing for initial alignment. Mou et al. [27] devised a technique for more precise and efficient position estimation of manipulated connectors, leveraging YOLO-based relevant region detection. [28] propose a 6D pose estimation of template geometries, to which manipulation objects should be connected. To diminish design complexity and boost optimal policy, reinforcement learning (RL) [7] presents an alternative by leveraging trial-and-error learning over precise modeling. To heighten approach robustness amidst environmental variations, spatial attention point network models [29] have been introduced, employing visual attention to extract pertinent image features for motion controllers and utilizing offline training to enhance sample efficiency. However, it is expensive to collect sufficient experience data from real-world scenarios. Furthermore, the low clearance and contact dynamics in precision assembly tasks complicate demonstrations [30], simulation-to-real transfer[31], and offline training.

Several studies have also leveraged model-based and learning-based modules to construct robust and efficient policies. Based on learning in multi-stage, Lee M A et al. [32] employed vision-based uncertainty estimation to differentiate between free-space and contact-rich regions, applying model-based methods in free-space for minimal environmental interaction and RL techniques to navigate inaccuracies in perception/action pipeline. Zhao et al. [9] proposed a fine positioning policy learned by DRL under an eye-in-hand camera view with a traditional coarse positioning method and impedance control. Based on the residual learning, Shi Y et al. [8] combined an eye-in-hand vision-based fixed policy with a contact force-based parametric policy to enhance the robustness and efficiency of the RL algorithm. Besides the force-based trajectory generators, [33] introduced an image-based trajectory generator trained by DRL to enable a robot to adapt to assembly parts with different shapes. Similarly, [34] proposed a residual high-level visual policy to determine the robot pose increment in Cartesian space through deep RL.However, these methods with direct bonding cannot solve the large position uncertainty and complex contact dynamics simultaneously and it is difficult to adapt to new tasks quickly because they do not fully utilize the advantages of various modules.

II-B Visual Representation for Robot Learning

The integration of prior visual models into the RL framework has shown promise in enhancing learning efficiency and generalization in unstructured settings by detection [14], pose estimation [35, 16], visual affordances [15]. Unsupervised [13], self-supervised learning [36] and hybrid observation-synthesis [37] has been applied to learn the prior visual models for different robotic skills. Specifically for grasping, [38] propose self-supervised visual affordance models that are grounded in real human behavior from teleoperated play data, driving the model-based planner to the vicinity of afforded regions and guiding a local grasping RL policy to favor the same object regions favored by people. Building on this prior work, this study introduces a semi-supervised visual representation to provides structured information including spatial location and task attention information for assembly skill learning.

II-C Operational Representation for Robot Learning

Operational representation can reduce the complexity of the solution space of a given manipulation problem by applying a well-designed but still flexible structure, such as formal method for task and domain-specific knowledge [18], stochastic graph [39], switching functions [40], manipulation primitives [41], graph-based skill formalism [17, 19]. Especially, [19] uses temporal abstraction and task decomposition as the higher-level policy in the hierarchical reinforcement learning method to reduce problem complexity. Based on this work, we further extend the skill graph to fixture-less assembly tasks by integrating object detection, coarse operation planning, and residual fine manipulation policy.

II-D Cognitive Systems and Learning Mechanisms

Research in cognitive robotics aims to emulate human intelligence, paving the way for the development of human-level artificial intelligence by cognitive architectures leveraging core capabilities such as sensing, cognition, learning, and control [42]. The existing theories offer crucial insights for creating foundational elements and learning strategies for cognitive systems, such as hybrid neural-symbolic models [20] and top-down learning [43].Especially, [44] adopted a connectionist-based approach for object recognition and compliant motion learning based on adaptive resonance theory (ART), aiming to design robotic agents for assembly tasks. This study employs a skill graph to integrate the neural models, which enable human-like operation and learning.

III Problem Statement

This work focuses on the ability to locate the master object and insert the slave object where the master object is randomly positioned within the workspace, as shown in Fig. 2.

The task can be formulated as a Markov Decision Process (MDP) with a transition function $P(S_{t+1}|S_{t},A_{t})$ , which is a probability distribution over the next state $S_{t+1}$ conditioned on the execution of a certain action $A_{t}$ in the current state $S_{t}$ . We want to find a policy $\pi(A_{t}|S_{t})$ that dictates the probability over actions conditioned on a given state. The complexity of the transition function $P$ is determined by the degree of the environment, which directly affects the difficulty of designing or learning the policy $\pi$ . Unlike the environment in which humans live, more structured industrial scenarios can provide more prior knowledge for the learning process [45, 46, 32, 47]. Therefore, the assumption of known partial knowledge of the state $S_{t}$ and transition function $P$ is exploited as follows:

1.
Semi-structured industrial scenarios, such as flat workbenches and limited work areas, add constraints to the robot’s behavior and can also simplify the robot’s perception and operation requirements. In this work, the dimensions of the master object’s pose are categorized into constrained and unconstrained parts. The constrained segments of $z$ , $rx$ , and $ry$ are restricted within the workspace, influenced by the task and workspace shape. Conversely, the unconstrained segment of $x$ , $y$ and $rz$ varies within a specific range ( $S_{uncon}:[(x_{min},x_{max}),(y_{min},y_{max}),(rz_{min},rz_{max})]$ ).
2.
General knowledge about manipulation tasks can guide strategy design and learning, considering the spatial separation between the master and slave objects. The manipulation process can be divided into contact-free $S_{cf}$ and contact-rich $S_{cr}$ regions based on task geometry, with attention to uncertainties in pose estimation $E_{r}$ and contact dynamics $F^{max}$ in contact-rich $S_{cr}$ regions for precise assembly tasks. Humans utilize global and local fields of view sequentially to enhance fine manipulation tasks, addressing the constraints of a single camera.
3.
The geometric parameters and even the forward and inverse kinematics of the robot are often available from the robot supplier. It allows us to design and learn the manipulation policy in task space. We can also use it to obtain geometry information of tools, platforms, and tasks by demonstration.

The challenge is to incorporate general prior knowledge and a learning-based model to address task-specific uncertainty. This work aims to propose a neural-symbolic cognitive manipulation method for assembly skill learning, enabling the utilization of prior knowledge to train a visual representation and fine manipulation policy to handle the uncertainties of robot, environment and tasks.

IV Method

This work introduces a novel cognitive manipulation method for solving assembly tasks in semi-structured environments, as shown in Fig. 3. Central to our approach is the skill graph, which orchestrates multiple modules within a mixed-strategy framework, driving three key modes of operation: manipulation in semi-structured environments, vision model training, and fine manipulation training.This section presented the proposed methodology in three parts: 1) Cognitive Manipulation Architecture: This component introduces a neural-symbolic framework that integrates multimodal and scalable information through a combination of model-based and learning-based methods.2) Semi-supervised Visual Representation learning: This stage outlines a cost-effective method for hand-eye-task calibration and object detection training, enhancing the visual representation capabilities essential for precise manipulation.3) Classroom-to-Real Residual Reinforcement Learning: The final part involves training the residual policy within a specially designed classroom environment and task execution in semi-structured settings.

IV-A Cognitive Manipulation Architecture

The cognitive manipulation architecture is designed to enhance the effective contact-rich manipulation in semi-structured environments. It leverages a skill graph to integrate multiple modules including an object detection module to manage positional uncertainties, a model-based system for trajectory planning and compliance settings, and a residual policy for managing pose error and complex contact dynamics. These components work together to support effective manipulation control, as depicted in Fig. 4 and Alg. 1.

IV-A1 Skill Graph Based on General Knowledge and Task Specification

The abstract expert knowledge of assembly tasks in semi-structured environments can be harnessed to segment the manipulation process into distinct stages and components, as depicted in Fig. 5. The skill graph integrate symbolic and subsymbolic representations according to general versus task-specific knowledge and degrees of accessibility for a well-designed yet flexible structure.

We define general knowledge using a partial model that encapsulates spatial, temporal, and causal information. Initially, we consider the spatial information concerning the end-effector (EE) pose in the manipulation, which includes the home position ${}^{B}X_{E}^{s}$ , assembly bottleneck pose ${}^{B}X_{E}^{m}$ , and assembly goal pose ${}^{B}X_{E}^{g}$ within the robot base frame. Temporally, the process is segmented into four stages: reaching ${}^{B}X_{E}^{s}$ for global perception to estimate the ${}^{B}X_{E}^{g}$ , planning a coarse operation from ${}^{B}X_{E}^{s}$ to ${}^{B}X_{E}^{m}$ and then to ${}^{B}X_{E}^{g}$ , executing the coarse operation to reach ${}^{B}X_{E}^{m}$ , and performing the fine operation for insertion at ${}^{B}X_{E}^{g}$ . Causal transition conditions between these stages are defined based on the positional relationships and contact states between the peg and hole. This partial model is illustrated in Eqn. (1) and Fig. 5. Although the introduction of sequential logic and causality is crucial for operating in semi-structured environments—enhancing safety and reducing learning costs due to potential interference from other agents (robots or humans), this paper primarily focuses on the sequential logic and causal enhancement of robot learning methods, while not extensively addressing multiple exceptional states and their management strategies.

\begin{cases}n_{1}:f_{gp}(\theta_{1};\Omega_{1})&c_{1}:^{B}X_{O}\in S_{uncon}%\\n_{2}:f_{pl}(\theta_{2};\Omega_{2})&{c_{2}:done}\\n_{3}:f_{cf}(\theta_{3};\Omega_{3})&c_{3}:X-{}^{B}X_{E}^{m}\in E_{th}\\n_{4}:f_{cr}(\theta_{4};\Omega_{4})&c_{4}:X-{}^{B}X_{E}^{g}\in E_{th}\text{\&}%F_{z}>F_{z}^{max}\end{cases}

(1)

Despite the generic nature of temporal and causal information in assembly tasks, task-specific information remains essential. For the manipulation process, the bottleneck poses ${}^{B}X_{E}^{m}$ and goal pose ${}^{B}X_{E}^{g}$ of EE are determined by the task’s geometry and assembly relationships, master object pose, grasp pose for the slave object, and the tool center position (TCP) offset. Object-Embodiment-Centric (OEC) geometry representation is derived from demonstrations to enable direct prediction of key waypoints in the operational process through the master object pose estimation, as outlined in our prior work [48]. The OEC representation selects a grasping point on the task board as the coordinate origin ${}^{B}X_{O}$ and extrapolates the bottleneck pose of EE ${}^{B}X_{E}^{m}$ and goal pose ${}^{B}X_{E}^{g}$ in the robot base frame from the teaching, then transform these to ${}^{O}X_{E}^{m}$ and ${}^{O}X_{E}^{g}$ in the task frame.In the semi-structured environment, the master object is randomly placed within a workspace of range $S_{uncon}$ , as specified by a human operator. The home position of the EE ${}^{B}X_{E}^{s}$ is strategically set outside this region to prevent occlusion of the eye-to-hand camera’s field of view. With partial constraints of the workspace, the pose can be determined by estimating the $x$ , $y$ , and $rz$ dimensions. The uncertainties introduced by the pose estimation and demonstration system, as well as contact dynamics, are also considered for precise contact-rich task execution.In conclusion, employing the general knowledge base, both planning and learning-based methods are utilized for developing a complex policy, which includes object detection for pose estimation and task-related visual information extraction, spatially dependent trajectory planning as the basic strategy, and task-specific residual strategies for handling uncertainties. A graph structure is devised to reflect the required sequence of motions and modules necessary to complete the task, as illustrated in Fig. 4.

IV-A2 Compliance Controller

Employing the virtual force-driven spring-mass-damping modal and robot kinematics, we utilize a modified Cartesian parallel position and force controller as the low-level control for robot learning, generating velocity commands. The control law for joint velocities $\dot{q}$ is expressed as:

\overset{\cdot}{q}=\frac{J^{-1}M^{-1}}{s+M^{-1}B}\left[K\left(X_{d}-X\right)-%\left(F_{d}-F\right)\right]

(2)

where Jacobian matrix $J$ provides the relation between end-effector and joint velocities. Desired poses $X_{d}$ and forces $F_{d}$ govern the behavior, with the stiffness matrix $K$ balancing the six-dimensional tracking error for position/orientation and force/torque. The inertia matrix $M$ and damping matrix $B$ influence the response speed and stability.

IV-A3 Object Detection for Pose Estimation and Task Attention

Object detection is a popular algorithm used to locate objects in an image or video stream. It predicts multiple bounding boxes for objects in the image $I$ , and each bounding box contains the predicted values for the object’s position $(x,y)$ , size $(w,h)$ , confidence $c_{con}$ , and category $c_{cate}$ , as shown in (3). With a pre-defined confidence level, the effective predicted bounding box for the object is selected.

[c_{cate},x,y,w,h,c_{con}]=detect(I)

(3)

To cover the entire workspace and accurately detect the object of interest, we attach an eye-to-hand RGB camera at the top of the workspace to capture the 2D image $I_{eth}$ , as shown in Fig. 4 (b). An object detection-based coarse perception system generates one bounding box around the object of interest to obtain the location $(x_{0},y_{0})$ and two other bounding boxes around the predefined feature structures to obtain the location $(x_{1},y_{1})$ and $(x_{2},y_{2})$ . According to the eye-to-hand transformation ${}^{r}T_{c}$ , the estimated points $(x_{0}^{\prime},y_{0}^{\prime})$ are transformed to the robot frame as shown in (4). Considering the partial pose information of the object, including $rx_{con}$ , $ry_{con}$ , and $z_{con}$ , the pose ${}^{B}X_{O}$ can be determined as in (5). Global perception is used in the first stage $n_{1}$ to determine whether the main assembly object is ready for the assembly operation on the one hand, and to provide location information for the assembly operation on the other hand.

(x_{i}^{\prime},y_{i}^{\prime})=tramsform(x_{i},y_{i}),i=0,1,2

(4)

{}^{B}X_{O}=[x_{0}^{\prime},y_{0}^{\prime},z_{con},rx_{con},ry_{con},\arctan(%\frac{y_{2}-y_{1}}{x_{2}-x_{1}})]

(5)

The second model uses the image $I_{eih}$ from an eye-in-hand camera to provide local task detection as attention, enhancing the ability to differentiate tasks from the environment as indicated in Fig. 4 (b). Object detection utilizes a bounding box as a region of interest (ROI) to identify specific structures crucial for vision-based precise alignment in assembly tasks. The work uses a simple attention strategy that utilizes this bounding box $(x_{a},y_{a},w_{a},h_{a})$ to crop and resize the task-related area from the input image $I_{eih}$ and generate an attention-guided observation $I_{atten}$ for fine manipulation, enabling the residual policy to concentrate on the specific structure amidst varying environments.

I_{atten}=crop((x_{a},y_{a},w_{a},h_{a}),I_{eih})

(6)

IV-A4 Coarse Operation Planing

With the partial model and coarse perception system, we can plan a coarse operation as the second stage $n_{2}$ . We first obtain the assembly goal point ${}^{B}X_{E}^{g}$ and bottleneck pose ${}^{B}X_{E}^{m}$ with the estimated pose ${}^{B}X_{O}$ and the OEC geometry information of ${}^{O}X_{E}^{m}$ and ${}^{O}X_{E}^{g}$ , which divide the operation into contact-free and contact-rich manipulation.

The uncertainty due to pose estimation and compliance control can be ignored in the contact-free region $S_{cf}$ . A fast min-jek trajectory $\tau_{cf}$ between the home point ${}^{B}X_{E}^{s}$ and the bottleneck pose ${}^{B}X_{E}^{m}$ can be generated. In addition, a high-stiffness $K_{cf}$ of compliance controller is used to ensure acceptable position tracking errors. The trajectory and stiffness provide coarse operation for the contact-free region, which can be defined as Eqn. (7).

\pi^{cf}_{H}\sim(\tau_{cf}(t),K_{cf})

(7)

However, the uncertainty cannot be ignored in the contact-rich region. A slow trajectory $\tau_{cr}$ and small stiffness are used in the contact-rich region $S_{cr}$ . We define an exploration space $W$ to represent the offset range of a compliance robot disturbed by safe external forces $F_{max}$ . It should cover the assembly depth and error range to ensure safe contact and effective error compensation, as shown in Eqn. (8). Furthermore, the small stiffness matrix of compliance control $K_{cr}$ is obtained with the estimated exploration space $W$ and maximum contact force $F_{max}$ , which can be defined as,

		$\displaystyle W=E_{r}+abs({}^{B}X_{E}^{m}-{}^{B}X_{E}^{g})$		(8)
		$\displaystyle K_{cr}=F_{max}\cdot diag(W)^{-1}$
		$\displaystyle\pi^{cr}_{H}\sim(\tau_{cr}(t),K_{cr})$

IV-A5 Residual Policy for Fine manipulation

The manipulation is divided into two phases according to the planned coarse operation and carried out by the compliance controller in (2). In the third stage $n_{3}$ , the end effector of the robot is moved from the home point ${}^{B}X_{E}^{s}$ to the bottleneck pose ${}^{B}X_{E}^{m}$ with a planned efficient policy $\pi^{cf}_{H}$ .

Since the contact-rich assembly manipulation in the fourth stage $n_{4}$ requires a higher level of accuracy than conventional robot and vision systems, the planned safe policy $\pi^{cf}_{H}$ is switched and the residual policy $\pi_{\theta}$ is enabled to refine the initial policy for precise localization and complex force control. In addition to guidance from a fixed policy, the residual policy also receives attentional observation guidance $I_{atten}$ from object detection. The tactile $F$ from the force sensor mounted on the wrist and the relative pose $R_{p}={}^{B}X_{E}-{}^{B}X_{E}^{g}$ of the end-effector serve as additional observations for contact dynamics. The residual policy generates the desired force and torque $F_{d}$ in Eqn. (9), which together with $\pi^{cf}_{H}$ as input to the compliance controlle in Eqn. (2).

F_{d}=\pi_{\theta}(I_{atten},R_{p},F)

(9)

With the help of coarse operation planning, the learning of cognitive manipulation is carried out separately. The object detection models and hand-eye-task calibration are trained by semi-supervised visual representation learning in subsection B. The residual policy is trained by classroom-to-real residual reinforcement learning in subsection C.

IV-B Semi-supervised Visual Representation Learning

This section delves into training two object detection models and calibrating hand-eye-task relationships using collected samples based on the geometric model of a specific assembly task and the robot’s kinematics. Our approach enables the gathering of a varied sample set through a carefully planned coarse operation in pick-and-place. To address the challenges related to accurate data labeling, we suggest a streamlined calibration and labeling process that significantly reduces the engineering effort. Furthermore, fine-tuning from a pre-trained model is utilized to reduce the reliance on extensive sample volumes.

IV-B1 Data Collection via Coarse Operation

The master object may appear at different points $P$ throughout the workspace $S_{w}$ , requiring global localization, further global localization exists as possible relative pose error $R_{P}$ within the range $E_{r}$ , requiring further local perception as visual attention for fine manipulation. To ensure data diversity for training the position and attention models, we first uniformly collected $m$ points within the workspace where the task board will appear during the real scene, denoted by $P=[x^{r},y^{r}]$ . Second, we used an offset to the bottleneck pose ${}^{B}X_{E}^{m}$ and generated $n$ points to cover the uncertainty space while avoiding collisions, defined them as $R_{P}=[\Delta x^{r},\Delta y^{r},\Delta z^{r}]$ . The sampling points and poses are shown in Fig. 6 (b). We collected eye-in-hand and eye-to-hand images, as well as the corresponding position and relative pose, using a hand-designed trajectory for data diversity, as shown in Eqn. (10).

		$\displaystyle(I_{i}^{eth},[x_{i}^{r},y_{i}^{r}]),i=1,2,...,m\times n$		(10)
		$\displaystyle(I_{i}^{eih},[\Delta x_{i}^{r},\Delta y_{i}^{r},\Delta z_{i}^{r}]%),i=1,2,...,m\times n$		(10)

IV-B2 Calibration and Automate Label

For eye-to-hand, the transformation between camera and robot, ${}^{c}T_{r}$ and ${}^{r}T_{c}$ , need to be calibrated for automated label and vision-based localization in robot control. Considering only $x$ and $y$ dimensions, the transform can be formulated as Eqn. (11) and (12).

[x_{i}^{c},y_{i}^{c}]^{T}=\left[\begin{array}[]{ccc}a_{11}&a_{12}&a_{13}\\a_{21}&a_{22}&a_{23}\end{array}\right][x_{i}^{r},y_{i}^{r},1]^{T}

(11)

[x_{i}^{r},y_{i}^{r}]^{T}=\left[\begin{array}[]{ccc}b_{11}&b_{12}&b_{13}\\b_{21}&b_{22}&b_{23}\end{array}\right][x_{i}^{c},y_{i}^{c},1]^{T}

(12)

For eye-in-hand, we mainly focus on the mapping relationship of relative motion between the robot and the task in the robot coordinate system and pixel coordinate system, to realize semi-automatic annotation. Because the image Jacobian can be considered as constant in a limited space, the transform $J$ can be formulated as Eq. (13).

\left[\begin{array}[]{cccc}{}^{l}\Delta x_{i}^{c}\\{}^{l}\Delta y_{i}^{c}\\{}^{r}\Delta x_{i}^{c}\\{}^{r}\Delta y_{i}^{c}\end{array}\right]=\left[\begin{array}[]{cccc}c_{11}&c_{%12}&c_{13}&c_{14}\\c_{21}&c_{22}&c_{23}&c_{24}\\c_{31}&c_{32}&c_{33}&c_{34}\\c_{41}&c_{42}&c_{43}&c_{44}\end{array}\right]\left[\begin{array}[]{cccc}\Deltax%_{i}^{r}\\\Delta y_{i}^{r}\\\Delta z_{i}^{r}\\1\end{array}\right]

(13)

Using a small number of coordinates in the pixel frame $L_{eth}^{i}=[x_{i}^{c},y_{i}^{c}]$ and $L_{eih}^{i}=[{}^{l}\Delta x_{i}^{c},{}^{l}\Delta y_{i}^{c},{}^{r}\Delta x_{i}^%{c},{}^{r}\Delta y_{i}^{c}]$ provided by manual annotation and coordinates in the robot frame captured during sampling, the transformation from the robot base frame to the eye-to-hand pixel frame ${}^{c}T_{r}$ , ${}^{r}T_{c}$ and the relative motion relationship between the robot base frame and the eye-to-hand pixel frame can be estimated. The calibrated transform relationship can be used to automate the labeling of the remaining images. In addition, the transform ${}^{r}T_{c}$ from the pixel frame to the robot’s base frame can be used as hand-eye-task calibration, which involves estimating the assembly pose of the robot’s end-effector for localization by the eye-to-hand camera.

IV-B3 Fine-tuning from Pre-trained Model

This work used a one-stage real-time object detection approach, YOLO (You Only Look Once), to estimate assembly goal pose and visual attention, which is famous for robustness and fast computation. In addition, the pre-trained model using the ImageNet dataset can be used as initial parameters for training in custom datasets, which require fewer samples. The image and labels are divided into training and test sets for model training and evaluation.

IV-C Classroom-to-Real Residual Reinforcement Learning

In this subsection, we discuss a practical Residual Reinforcement Learning for fine manipulation to address the challenges posed by exploration efficiency and safety in semi-structured environments. Classroom-to-real learning trains the residual policy within a simplified structured environment and subsequently transfers it to semi-structured environments. The visual representation and coarse operation provide a base policy and task-relevant features for context generalization to facilitate effective learning and seamless transfer.

IV-C1 Curriculum Residual Learning

In a structured environment, the key poses of the end effector can be obtained through demonstration. The coarse operation as a base policy is generated and local task detection as attention is loaded for the initialization of the cognitive manipulation. This work formulated the combination of base and residual sub-policies based on the compliance controller in task space as follows:

\dot{q}=f_{cr}(\pi_{H},\pi_{\theta})

(14)

To increase the robustness of residual policy to the perceptual uncertainty of the pose estimator, a random error is injected into the trajectory. However, the error range may affect the learning efficiency due to the difficulty of exploration and the diversity of samples, as shown in Fig. 7. Therefore, this work automatically controls the task difficulty by increasing or decreasing $\varepsilon$ to the guidance error range of $E_{r0}$ to keep the success rate $s_{r}$ within the desired interval $[\alpha,\beta]$ , as shown in Eqn. (15).

E_{r}=E_{r0}+\varepsilon*1_{s_{r}>\beta}-\varepsilon*1_{s_{r}<\alpha}

(15)

IV-C2 Reward Shaping

This work normalizes the Euclidean distance between the end-effector current $X$ and target pose ${}^{B}X_{E}^{g}$ to create a guide reward $R_{guid}$ that increases as it gets closer to the target pose as in Eqn. (16). The force penalty $R_{forc}$ is defined according to the interaction force $F$ for smooth operation as in Eqn. (17). Additionally, if the transition condition $c_{4}$ is satisfied, it gets a positive reward of $R_{succ}$ as in Eqn. (18). Therefore a multi-objective reward function is defined as Eqn. (19), which uses weights $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{3}$ to balance multiple sub-objectives.

R_{guid}=\left\|diag(W)^{-1}(X-{}^{B}X_{E}^{g})\right\|

(16)

R_{forc}=\left\|diag({F}^{\max})^{-1}F\right\|

(17)

{R_{succ},d}=\begin{cases}100,1&{\text{if}}\ c_{4}\\{0,0}&{\text{otherwise.}}\end{cases}

(18)

r(s)=\lambda_{1}R_{guid}+\lambda_{2}R_{forc}+\lambda_{1}R_{succ}

(19)

IV-C3 Soft-Actor-Critic

A model-free DRL, soft actor-critic, is introduced to achieve a real-time optimal control strategy for intricate fine manipulation. Unlike pure RL, residual learning enhances the performance of the entire policy by optimizing the residual parameterized part of the policy. The state and action of the residual policy, along with the reward of the overall policy, are gathered in multiple recurring episodes and stored in a data replay buffer for off-policy learning. The training was carried out in a structured environment and then transferred to a semi-structured environment. The cognitive manipulation algorithm is depicted in Aglo.1, where Line 1-2 acquire the coarse operation $\pi^{cf}_{H}$ and $\pi^{cr}_{H}$ with the estimated localization ${}^{B}X_{E}^{g}$ , Line 3 -9 obtain the desired pose, $X_{d}$ , stiffness $K$ and desired force/torque $F_{d}$ , Line 6 drives the robot to the assembly bottleneck pose, Line 11-18 perform fine manipulation for accurate assembly tasks.

0: $\Delta P_{g}$ , $\Delta P_{p}$ , $X$ , $F$ , $I_{eth}$ , $I_{eih}$

0: $X_{d}$ , $F_{d}$ , $K$

1:Estimate localization $P$ with Eqn. (1-4)

2:Planning motion guidance $\pi^{cf}_{H}$ and $\pi^{cr}_{H}$ with Eqn. (6-9)

3:foreach time stepdo

4: $X_{d},K\leftarrow\pi^{cf}_{H}$

5: $F_{d}\leftarrow 0$

6:Apply action $X_{d}$ , $F_{d}$ and $K$ to robot controller

7:if $|X-P_{p}|\leq E_{c}$ then

8:Break

9:endif

10:endfor

11:foreach time stepdo

12:Estimate attention $I_{atten}$ with Eqn. (5)

13: $X_{d},K\leftarrow\pi^{cr}_{H}$

14: $F_{d}\leftarrow\pi_{\theta}$

15:Apply action $K$ , $F_{d}$ and $X_{d}$ to robot controller

16:if $|X-P_{g}|\leq E_{c}$ then

17:Break

18:endif

19:endfor

V Experiment

This section delineates the experimental validation of the proposed cognitive manipulation method, specifically designed for robotic assembly tasks within semi-structured environments.Firstly, we introduce the robot hardware and software and establish several baselines to compare the proposed method with existing methodologies. Secondly, comparison and ablation experiments are performed in simulation to validate hand-eye calibration and auto-annotation methods using small amounts of manually annotated data, the effect of embodied data acquisition on the object detection model, and the effect of object detection on the training and performance of fine manipulation in semi-structured environments by providing location and visual attention. Finally, the comprehensive evaluation with two real tasks underscores the practical applications of our approach for robotic assembly within semi-structured environments.

V-A Experiment Setup

V-A1 Hardware and Software

The experiments are conducted on a computer equipped with an Nvidia GeForce RTX 2060 GPU and an Intel i7-9700 CPU. The Robot Operating System (ROS) is utilized as the middleware, facilitating seamless communication between the learning algorithms, control modules, and the robotic system.

V-A2 Partial Model, Data Collection and Reward Design

The geometry information for data collection in semi-supervised object detection models and classroom-to-real fine manipulation policy training is obtained by demonstration. Maximum contact force $F^{max}$ is set as 10 N in x, y and z directions and 0.1 N*m in Rx, Ry and Rz directions. The hybrid policy updates the pose and force commands to the controller at a frequency of 5 Hz, while the controller outputs the target joint velocities for the robot at 120 Hz. Each experimental episode is capped at 120 steps, with policy networks undergoing 200 gradient updates per episode. The reward weights $w_{1}$ , $w_{2}$ , and $w_{3}$ are set to 1, 0.8, and 1, respectively, as determined by preliminary experiments to balance operation speed and smoothness. The curriculum increases or decreases by 0.5 mm to the error range from 2 mm to keep the success rate within the desired interval [0.5, 0.7].

V-A3 Baselines for Comparative Study

To underscore the advantages of our cognitive manipulation architecture in terms of learning efficiency and context generalization, we compare our method in assembly tasks against the following baselines:

1.
Baseline 1 [28]: This baseline directly predicts the desired final poses of slave object for manipulation using only raw RGB-D images and implements a simple open-loop control scheme on a real robot. The training data for pose estimation is synthetically generated, facilitating easy manipulation of geometry and texture.
2.
Baseline 2 [49, 27]: This approach leverages direct teaching to specify global master object poses and plans robotic motion accordingly. It utilizes hand-mounted cameras and visual classifiers to predict and correct positioning errors, enabling precise attachment of master and slave objects without calibration by search policy.
3.
Baseline 3 [32]: This baseline introduces a perception system with uncertainty estimates to delineate regions where the model-based policy is reliable from those where it may be flawed or undefined, blending the strengths of model-based and learning-based methods.
4.
Baseline 4 [8]: This work combines a vision-based fixed policy with a contact-based residual parametric policy, enhancing the robustness and efficiency of the RL algorithm.
5.
Baseline 5 ([50]): This baseline Employs similar residual learning techniques with Cartesian impedance control, utilizing visual inputs for larger error adjustments during contact-rich manipulation.

V-B Simulation Experiment for Comparative and Ablation Study

Our proposed approach seeks to enhance the sampling efficiency and reduce the engineering effort required for policy reconfiguration in contact-rich tasks within semi-structured environments. This is accomplished by integrating semi-supervised learning of object detection with classroom-to-real residual reinforcement learning of fine manipulation. To facilitate a comparative and ablation study, we have developed a simulation environment based on Gazebo, which allows for dynamic loading and deletion of objects at any position within the defined space, thereby constructing a robot assembly task in a semi-structured setting. The application of our proposed cognitive manipulation framework to a new assembly task is structured into four distinct stages: Embodied hand-eye-task calibration and semi-automatic annotation, supervised fine-tuning of the object detection models, residual reinforcement learning of fine manipulation, and application of the integrated strategy to a semi-structured environment. Each stage is progressively compared with established baselines to evaluate the advantages of the proposed methodologies.

V-B1 Embodied Hand-Eye-Task Calibration and Semi-Automatic Annotation

The dataset is curated based on prior knowledge of uncertainty to ensure sample diversity. The initial stage aims to minimize the costs associated with manual labeling and hand-eye calibration while generating high-quality labeling data and accurately estimating the target assembly pose of the end-effector. This stage encompasses four critical processes: data acquisition, manual labeling, hand-eye-task relationship fitting, and semi-automatic labeling. We explore the dependency on the proportion of manually annotated samples and its advantages over purely manual annotation. The standard deviation of the hand-eye relationship fitting serves as a quantitative index for evaluating the calibration and annotation.

Number	${}^{c}T_{r}$ STD		$J$ STD		${}^{r}T_{c}$ STD
Number	Manual	All	Manual	All	${}^{r}T_{c}$ STD
8	0.0039	0.0007	0.0062	0.0010	0.0020
19	0.0043	0.0010	0.0064	0.0014	0.0025
38	0.0041	0.0013	0.0068	0.0021	0.0023

The number of manually labeled samples and the corresponding accuracy of ${}^{c}T_{r}$ , $J$ , and ${}^{r}T_{c}$ represented by standard deviation, as well as the standard deviation of ${}^{c}T_{r}$ and $J$ on all data sets are shown in Table III. The results demonstrate that a small number of manually labeled samples can effectively establish hand-eye calibration relations and facilitate semi-automatic labeling of the remaining images, reducing the standard deviation of data annotation. Examples of both manually and automatically annotated data are illustrated in Fig. 12. The semi-automatic annotation reduces the cost of manual annotation by 97.4%.

V-B2 Supervised Fine-tuning for Object Detection

In this phase, We examine the impact of sample diversity on model performance by designing various sampling schemes and comparing models trained on different dataset sizes. We mainly consider two factors in the sampling process: pick-and-place position and robot posture, which together affect sample size and diversity, including 10 (5*2), 30 (10*3), 60 (14*5), 120 (15*8), 180 (18*10), 384 (24*16).In addition, the combination of 10 (5*2) after data enhancement to 300 is used as a baseline to facilitate the comparison between embodied data acquisition and traditional data enhancement methods based on graph transformation. 20 $\%$ of the embodied data enhanced 385 data is taken as the verification set to represent the complex state in the assembly operation process. Precision, recall, mAP@.5, and mAP@.5:.95 are employed to assess the influence of sample quantity and data enhancement methods on the performance of the object detection models.

Models	Datasets	Precision	Recall	mAP@.5	mAP@.5:.95
Pose estimation	10	0.233	0.448	0.156	0.109
	10-aug	0.999	1.0	0.995	0.935
	30	0.750	0.741	0.774	0.374
	60	0.999	1.0	0.995	0.950
	120	0.999	1.0	0.995	0.994
	180	0.999	1.0	0.995	0.993
	384	0.999	1.0	0.995	0.990
Task attention	10	0.022	0.051	0.012	0.003
	10-aug	0.999	1.0	0.995	0.714
	30	0.686	0.980	0.815	0.539
	60	0.999	1.0	0.995	0.879
	120	0.999	1.0	0.995	0.936
	180	0.998	1.0	0.995	0.964
	384	1.0	1.0	0.995	0.989

The results indicate that data acquisition based on prior knowledge significantly enhances the environmental perception capabilities of the detection model. The mAP@.5:.95 values for the two object detection models during the training process are displayed in Fig. 12, highlighting the importance of sample diversity and the limitations of data enhancement based solely on graphical transformations. Contrasting the effect of the number of samples on the model performance, too few samples, less than 60, will cause the training process to fail to converge. With the increase in the number of samples, the trained model can obtain higher precision, recall, mAP@.5, and mAP@.5:.95 values. This is due to improved sample diversity by more background changes and robot-task relative posture. In particular, the performance of local task attention models is more sensitive to sample diversity due to unavoidable occlusion during contact-rich operations. Contrasting methods of sample generation, although data enhancement based on graph transformation can increase the number of samples to avoid overfitting, it lacks diversity. The position estimation model achieved acceptable results, while the task attention model performed poorly. The embodied data collection increases the mAP@.5:.95 by 5.5 $\%$ in global perception and 27.5 $\%$ in local perception compared to existing data augmentation methods.

V-B3 Residual Reinforcement Learning of Fine Manipulation

This stage involves training a residual policy, supported by a hand-designed base policy and task-focused view, on a fixed master object setting.Physical contact states are identified using LSTM networks to encode time-series data from touch and proprioception sensors, combined with visual feedback processed through a CNN for serious position error. An MLP then integrates the low-dimensional latent features from LSTM and CNN to generate residual actions. We contrast this procedure with three baselines 3, 4, and 5 to highlight the advantages of knowledge-informed learning.

Results suggest that our approach facilitates more efficient and effective learning by concentrating on task-relevant details and addressing intricate contact dynamics and positional uncertainties. The success rate and error range throughout the training of the manipulation policy are presented in Fig. 11. In baseline 3, without the base policy, pure RL struggles in precise insertion tasks due to a local optimum created by penalizing contact forces. Combining the base policy with RL-based residual policy in baseline 4 can result in success, but the curriculum only achieves an 8 mm error in force-based residual policy training due to limited observations of contact force and vision-based pose estimation, creating a Partially Observable Markov Decision Processes (POMDP). In baseline 5, utilizing raw visual information can improve observations, with the curriculum achieving around 20 mm. The ROI-based attention in our approach, allowing the policy to rely on limited features and introducing perturbations due to restricted detection accuracy, marginally influences efficiency.

V-B4 Cognitive Manipulation in Semi-structured Environment

After individual training phases, the integrated policy is evaluated for its effectiveness in terms of success rate and completion time during assembly tasks in a semi-structured environment. The manipulator grasps the slave object, a gear, while the master object, a task board, is randomly positioned within a confined workspace of 350*350 mm. We carry out 16 trials to compare our method with other baselines, excluding the non-convergent baseline 3.

Methods	Success rate	Costed steps
Baseline 1	0.125	$64.03\pm 18.45$
Baseline 2	0.313	$104.3\pm 23.63$
Baseline 4	0.87	$85.3\pm 19.11$
Baseline 5	0.67	$73.4\pm 32.86$
Ours	1.0	$54.4\pm 17.17$

The results indicate that our approach exhibits superior performance in handling challenging assembly tasks, achieving a perfect success rate and significantly reducing completion steps, as outlined in Table III. Baseline 1 can only complete 12.5 $\%$ of trials with an average of 64 steps, while baseline 2 can only complete 40 $\%$ of trials with an average of 104 steps, nearing the maximum time step. The object detection, single-camera usage, and the simple model-based method fall short of meeting the task’s accuracy and the environment’s uncertainty requirements. Although random residual actions can help to compensate for the perception errors, the semi-structured environment poses additional challenges due to movable objects. The contact force generated during the random search can displace the task board, leading to larger errors or even causing the gear to slip off the peg. Baseline 4 outperforms baseline 2 in both success rate and cost steps because force-based agents can enhance the search policy by regulating the contact force and position reference based on the estimated contact state derived from interaction forces. Although baseline 5 is more robust than baseline 4 in training, it only performs 13.95 $\%$ better in costed steps and even worse in success rate. The raw visual information enables the residual policy to compensate for larger errors in training, but its performance may degrade in different locations with different backgrounds. In comparison, the proposed method utilizes visual attention to assist the agent in focusing on the task, resulting in a success rate of 1 with an average of 54.4 costed steps. In conclusion, the success rate is increased by 13 $\%$ and the number of steps is reduced by 15.4 $\%$ compared to competing methods.

V-C Comprehensive Evaluation on Real Tasks

The primary objective of this research is to develop and validate a cognitive manipulation framework suitable for robot learning in real-world robotic applications. To assess the effectiveness of our proposed architecture, we conducted experiments using a UR5 robot to execute two precision assembly tasks: peg-in-hole and gear-insertion. These tasks, depicted in Fig. 13 (a), are designed to test the robot’s ability to handle complex manipulations in real settings. The robot was programmed to perform tasks based on geometric information derived from a teaching phase, which was used to construct a skill graph that encapsulates common assembly knowledge. Critical points including grasp and bottleneck pose were identified in semi-structured environments to facilitate hand-eye-task calibration and semi-supervised fine-tuning of object detection, as shown in Fig. 13 (b) and (c). In structured environments, critical points including grasp and assembly goal pose guided the learning of contact-rich fine manipulation, as shown in Fig. 13 (d). An Object-Embodiment-Centric (OEC) task representation, incorporating home, grasp, bottleneck, and assembly goal points, was employed to reconstruct the basic operational strategy. This strategy was integrated with visual and fine manipulation models to accomplish assembly tasks within a confined area of 500x500 mm, as shown in Fig. 13 (e).

The training process was optimized based on insights from simulation experiments, focusing on minimizing training costs while maximizing operational efficiency. For object detection, 5 points were gathered from the workspace to improve environmental robustness. We captured five images from a global perspective at each point for pose estimation and an additional 18 images per point from a local view to enhance task-specific attention. This embodied data collection strategy ensured diversity and data enhancement is further used to enhance the robustness against robot pose variations in manipulation. The fine manipulation training extended over 150 episodes, deemed adequate for achieving resilience against uncertainties in the base policy arising from pose estimation errors and unknown contact dynamics. We evaluated the success rate and completion time of the assembly tasks, using these metrics to benchmark the performance of our proposed architecture against two other baselines.

Task	Methods	Success Rate	Completion Time
Peg-in-hole	Baselines1	0.18	$8.13\pm 4.96$
	Baselines2	0.437	$18.06\pm 6.65$
	Ours	0.937	$\ 6.11\pm 1.32$
Gear-insertion	Baselines1	0.06	$9.2\pm 3.65$
	Baselines2	0.313	$19.10\pm 5.75$
	Ours	0.875	$\ 7.04\pm 1.42$

The results, detailed in Table IV, indicate that our approach significantly outperformed the baselines in both tasks.Baselines 1 and 2 faced challenges with the tasks, primarily due to inaccuracies in pose estimation and control, as well as their inability to effectively navigate the semi-structured environment. In addition, traditional search-based methods proved ineffective due to the mobility of the master object, often causing the robot to get stuck on the object’s surface. Performance relies on the expert’s experience and parameter tuning. It took approximately 8 hours and multiple attempts to gather samples and fine-tune policy and controller parameters for new tasks, whereas our method only required 2.58 hours and minimal human intervention. Both global localization and local attention models can be trained within 0.58 hours, including 20 minutes for sampling and 15 minutes for training two models. Learning a robust residual policy for robustness to 15 mm error took 150 episodes, consuming 2 hours. The experimental results underscore the superiority of our cognitive manipulation framework in the practical applicability in complex, real-world environments, improving learning efficiency and engineering effort while showcasing the precision and efficiency of manipulation for robotic assembly tasks in real-world scenarios.

VI Discussion

The architecture presented in this study leverages a skill graph that merges the generalization capabilities of a pre-trained object detection model with the optimization ability of reinforcement learning to facilitate efficient learning with minimal reliance on extensive human knowledge and interaction data. Separate training of different components has proven more practical for real robots in precise assembly tasks compared to end-to-end training methods [29]. This method mirrors human learning processes, where theoretical knowledge is acquired first and then practiced in controlled settings before tackling complex real-world tasks.The skill graph not only enhances the efficiency of the object detection learning process by providing explicit prior knowledge for sample collection and annotation but also enables the system to estimate the assembly target pose similar to [28]. Motion planning guided by the skill graph enables diverse data collection from various perspectives and locations, reducing the need for manual labeling through low-cost calibration and automated labeling processes. The large pre-trained model benefits from this setup by allowing effective generalization through fine-tuning with implicit prior knowledge.Furthermore, the learned visual model excels in providing interpretable spatial location and task correlation information, surpassing structured visual representations [13] and uncertainty-aware pose estimation [32]. This information is essential for guiding and constraining the exploration process in reinforcement learning, enabling the system to efficiently learn about contact dynamics and pose uncertainty. the residual policy, guided by the base policy and focused multimodal observation, is optimized through a multi-objective reward system, enhancing the capability to tackle complex tasks and generalize across various contexts without the need for fixtures. Therefore, the separation and guidance for learning tasks using prior knowledge significantly enhances learning efficiency in controlled environments.

Compared to the existing combination of model-based and learning-based approaches [32, 8, 9], our cognitive manipulation method excels in semi-structured environments. It mimics the human approach of transitioning from global to local perception and from coarse to fine manipulation. In contact-free areas, the object detection estimates the position uncertainty of the main object problem caused by fewer constraints. The skill graph enables global perception beyond the workspace to avoid occlusion and directs coarse operations with rich geometric information, facilitating flexible and safe robot movement. In contact-rich areas, object detection provides visual information about the task’s attention and resolves the variable background interference caused by other dynamic objects. The residual strategy integrates task-focused visual and tactile information to solve the pose estimation error and the complex contact dynamics.

The partial models within our method are utilized to address diverse configurations in semi-structured environments. While this study primarily focuses on knowledge-driven robot learning and experiments validate the impact of such learning on efficiency and strategy robustness, this approach can be adapted to different environments by acquiring geometric information through teaching and adjusting temporal logic and transition conditions within the partial model.

VII Conclusion

This study introduces a novel cognitive manipulation framework for robotic assembly tasks in semi-structured environments. The framework employs a skill graph that integrates object detection, coarse operation planning, and fine operation execution. The training process, guided by skill maps and coarse-operation planning in controlled environments, involves semi-supervised learning for object detection and residual reinforcement learning of multimodal fine-operation strategies. The cognitive manipulation models are subsequently transferred to a semi-structured environment, where object detection and coarse operation, enhanced by skill graph, handle the uncertainty of the environment and provide guidance for residual policy to address pose estimation and contact dynamics uncertainty.Experimental results from simulation demonstrate that our cognitive manipulation facilitates reducing manual annotation costs by 97.4 $\%$ and enables learning assembly tasks involving a 20 mm error and a 0.1 mm gap within 300 episodes, showing significant progress in a semi-structured environment—an area where existing methods struggle—with a 13 $\%$ increase in success rate and a 15.4 $\%$ reduction in completion time. The practicality of the method was further confirmed in real experiments.

Despite these advancements, the learning efficiency and generalization capabilities in semi-structured environments have been substantially enhanced, yet challenges persist. Our method effectively utilizes prior knowledge to streamline the learning process for contact manipulation, especially simplifying the reinforcement learning challenges associated with uncertainties in pose estimation and contact dynamics. However, there is potential to further improve learning efficiency. Future work will focus on advancing learning efficiency through the application of offline enhancement methods, including sim-to-real transfer or meta-learning, to streamline the residual reinforcement learning process. The efficient learning method using prior knowledge increases the possibility of contact-rich manipulation multitasking or meta-learning in real robots. In addition, our approach, which utilizes object detection and skill graphs, aims to mitigate uncertainties in semi-structured environments, but further generalization to diverse environments and tasks remains a goal. Future research could explore more sophisticated 3D or 6D pose estimation techniques, develop more precise quality estimation and monitoring methods based on visuo-tactile fusion, and incorporate large language models (LLMs) for common sense reasoning. Considering complex state and exception handling methods, such as addressing failure and success scenarios, could reduce assumptions about semi-structured environments and enhance the quality and reliability of robot operations.

References

[1]A.Perzylo, M.Rickert, B.Kahl, N.Somani, C.Lehmann, A.Kuss, S.Profanter, A.B. Beck, M.Haage, M.R. Hansen etal., “Smerobotics: Smart robots for flexible manufacturing,” IEEE Robotics & Automation Magazine, vol.26, no.1, pp. 78–90, 2019.
[2]J.Hughes, K.Gilday, L.Scimeca, S.Garg, and F.Iida, “Flexible, adaptive industrial assembly: driving innovation through competition: Flexible manufacturing,” Intelligent Service Robotics, vol.13, pp. 169–178, 2020.
[3]K.Dharmara, R.P. Monfared, P.S. Ogun, and M.R. Jackson, “Robotic assembly of threaded fasteners in a non-structured environment,” The International Journal of Advanced Manufacturing Technology, vol.98, pp. 2093–2107, 2018.
[4]K.Nottensteiner, A.Sachtler, and A.Albu-Schäffer, “Towards autonomous robotic assembly: Using combined visual and tactile sensing for adaptive task execution,” Journal of Intelligent & Robotic Systems, vol. 101, no.3, p.49, 2021.
[5]F.Suárez-Ruiz, X.Zhou, and Q.-C. Pham, “Can robots assemble an ikea chair?” Science Robotics, vol.3, no.17, p. eaat6385, 2018.
[6]H.Chen, G.Zhang, H.Zhang, and T.A. Fuhlbrigge, “Integrated robotic system for high precision assembly in a semi-structured environment,” Assembly Automation, vol.27, no.3, pp. 247–252, 2007.
[7]J.Luo, O.Sushkov, R.Pevceviciute, W.Lian, C.Su, M.Vecerik, N.Ye, S.Schaal, and J.Scholz, “Robust multi-modal policies for industrial assembly via reinforcement learning and demonstrations: A large-scale study,” arXiv preprint arXiv:2103.11512, 2021.
[8]Y.Shi, Z.Chen, H.Liu, S.Riedel, C.Gao, Q.Feng, J.Deng, and J.Zhang, “Proactive action visual residual reinforcement learning for contact-rich tasks using a torque-controlled robot,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).IEEE, 2021, pp. 765–771.
[9]J.Zhao, Z.Wang, L.Zhao, and H.Liu, “A learning-based two-stage method for submillimeter insertion tasks with only visual inputs,” IEEE Transactions on Industrial Electronics, 2023.
[10]S.Stevsic, S.Christen, and O.Hilliges, “Learning to assemble: Estimating 6d poses for robotic object-object manipulation,” IEEE Robotics and Automation Letters, p. 1159–1166, Apr 2020.
[11]Q.Yu, C.Hao, J.Wang, W.Liu, L.Liu, Y.Mu, Y.You, H.Yan, and C.Lu, “Manipose: A comprehensive benchmark for pose-aware object manipulation in robotics,” 2024.
[12]M.Köhler, M.Eisenbach, and H.-M. Gross, “Few-shot object detection: A comprehensive survey,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21, 2023.
[13]F.Zhang, Y.Chen, H.Qiao, and Z.Liu, “Surrl: Structural unsupervised representations for robot learning,” IEEE Transactions on Cognitive and Developmental Systems, vol.15, no.2, p. 819–831, Jun 2023. [Online]. Available: http://dx.doi.org/10.1109/tcds.2022.3187186
[14]S.Demura, K.Sano, W.Nakajima, K.Nagahama, K.Takesh*ta, and K.Yamazaki, “Picking up one of the folded and stacked towels by a single arm robot,” in 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO).IEEE, 2018, pp. 1551–1556.
[15]L.Yen-Chen, A.Zeng, S.Song, P.Isola, and T.-Y. Lin, “Learning to see before learning to act: Visual pre-training for manipulation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).IEEE, 2020, pp. 7286–7293.
[16]Y.-Z. Hsieh, F.-X. Xu, and S.-S. Lin, “Deep convolutional generative adversarial network for inverse kinematics of self-assembly robotic arm based on the depth sensor,” IEEE Sensors Journal, vol.23, no.1, pp. 758–765, 2022.
[17]L.Johannsmeier, M.Gerchow, and S.Haddadin, “A framework for robot manipulation: Skill formalism, meta learning and adaptive control,” 2019 International Conference on Robotics and Automation (ICRA), pp. 5844–5850, 2018.
[18]X.Li, Z.T. Serlin, G.Yang, and C.A. Belta, “A formal methods approach to interpretable reinforcement learning for robotic planning,” Science Robotics, vol.4, 2019.
[19]X.Liu, G.Wang, Z.Liu, Y.T. Liu, Z.Liu, and P.Huang, “Hierarchical reinforcement learning integrating with human knowledge for practical robot skill learning in complex multi-stage manipulation,” IEEE Transactions on Automation Science and Engineering, 2023.
[20]R.Sun, “Dual-process theories, cognitive architectures, and hybrid neural-symbolic models,” Neurosymbolic Artificial Intelligence, pp. 1–9, 03 2024.
[21]C.Yang, C.Chen, W.He, R.Cui, and Z.Li, “Robot learning system based on adaptive neural control and dynamic movement primitives,” IEEE Transactions on Neural Networks and Learning Systems, vol.30, pp. 777–787, 2019.
[22]R.Rayyes, H.Donat, J.J. Steil, and M.Spranger, “Interest-driven exploration with observational learning for developmental robots,” IEEE Transactions on Cognitive and Developmental Systems, vol.15, pp. 373–384, 2023.
[23]H.-C. Song, Y.-L. Kim, and J.-B. Song, “Automated guidance of peg-in-hole assembly tasks for complex-shaped parts,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE, 2014, pp. 4517–4522.
[24]M.G. Krishnan, A.T. Vijayan, and A.Sankar, “Performance enhancement of two-camera robotic system using adaptive gain approach,” Industrial Robot: The International Journal of Robotics Research and Application, vol.47, no.1, pp. 45–56, 2020.
[25]Y.-C. Peng, D.Jivani, R.J. Radke, and J.Wen, “Comparing position-and image-based visual servoing for robotic assembly of large structures,” in 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE).IEEE, 2020, pp. 1608–1613.
[26]R.Haugaard, J.Langaa, C.Sloth, and A.Buch, “Fast robust peg-in-hole insertion with continuous visual servoing,” in Conference on Robot Learning.PMLR, 2021, pp. 1696–1705.
[27]F.Mou, H.Ren, B.Wang, and D.Wu, “Pose estimation and robotic insertion tasks based on yolo and layout features,” Engineering Applications of Artificial Intelligence, vol. 114, p. 105164, 2022.
[28]“Learning to assemble: Estimating 6d poses for robotic object-object manipulation,” IEEE Robotics and Automation Letters, vol.5, pp. 1159–1166, 2020.
[29]A.Y. Yasutomi, H.Ichiwara, H.Ito, H.Mori, and T.Ogata, “Visual spatial attention and proprioceptive data-driven reinforcement learning for robust peg-in-hole task under variable conditions,” IEEE Robotics and Automation Letters, vol.8, pp. 1834–1841, 2023.
[30]G.Schoettler, A.Nair, J.Luo, S.Bahl, J.A. Ojea, E.Solowjow, and S.Levine, “Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards,” 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5548–5555, 2019.
[31]Y.Wang, L.Zhao, Q.Zhang, R.Zhou, L.Wu, J.Ma, B.Zhang, and Y.Zhang, “Alignment method of combined perception for peg-in-hole assembly with deep reinforcement learning,” J. Sensors, vol. 2021, pp. 5 073 689:1–5 073 689:12, 2021.
[32]M.A. Lee, C.Florensa, J.Tremblay, N.Ratliff, A.Garg, F.Ramos, and D.Fox, “Guided uncertainty-aware policy optimization: Combining learning and model-based strategies for sample-efficient policy learning,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).IEEE, 2020, pp. 7505–7512.
[33]K.Ahn, M.-W. Na, and J.-B. Song, “Robotic assembly strategy via reinforcement learning based on force and visual information,” Robotics Auton. Syst., vol. 164, p. 104399, 2023.
[34]Z.Zhang, Y.Wang, Z.Zhang, L.Wang, H.Huang, and Q.Cao, “A residual reinforcement learning method for robotic assembly using visual and force information,” Journal of Manufacturing Systems, 2024.
[35]P.Chen and W.Lu, “Deep reinforcement learning based moving object grasping,” Information Sciences, vol. 565, pp. 62–76, 2021.
[36]P.Jin, Y.Lin, Y.Song, T.Li, and W.Yang, “Vision-force-fused curriculum learning for robotic contact-rich assembly tasks,” Frontiers in Neurorobotics, vol.17, 2023.
[37]H.Chen, W.Wan, M.Matsush*ta, T.Kotaka, and K.Harada, “Automatically prepare training data for yolo using robotic in-hand observation and synthesis,” IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2023 SEP 4 2023.
[38]J.Borja-Diaz, O.Mees, G.Kalweit, L.Hermann, J.Boedecker, and W.Burgard, “Affordance learning from play for sample-efficient policy learning,” in 2022 International Conference on Robotics and Automation (ICRA).IEEE, 2022, pp. 6372–6378.
[39]C.Xiong, N.Shukla, W.Xiong, and S.-C. Zhu, “Robot learning with a spatial, temporal, and causal and-or graph,” 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 2144–2151, 2016.
[40]X.Yu, B.Li, W.He, Y.Feng, L.Cheng, and C.Silvestre, “Adaptive-constrained impedance control for human–robot co-transportation,” IEEE Transactions on Cybernetics, vol.52, no.12, pp. 13 237–13 249, 2022.
[41]M.Tavassoli, S.Katyara, M.Pozzi, N.Deshpande, D.G. Caldwell, and D.Prattichizzo, “Learning skills from demonstrations: A trend from motion primitives to experience abstraction,” IEEE Transactions on Cognitive and Developmental Systems, vol.16, pp. 57–74, 2022.
[42]J.Li, Z.Li, F.Chen, A.Bicchi, Y.Sun, and T.f*ckuda, “Combined sensing, cognition, learning, and control for developing future neuro-robotics systems: A survey,” IEEE Transactions on Cognitive and Developmental Systems, vol.11, no.2, pp. 148–161, 2019.
[43]Z.Xie and Y.Jin, “An extended reinforcement learning framework to model cognitive development with enactive pattern representation,” IEEE Transactions on Cognitive and Developmental Systems, vol.10, no.3, p. 738–750, Sep 2018. [Online]. Available: http://dx.doi.org/10.1109/tcds.2018.2796940
[44]I.Lopez-Juarez, J.CORONACASTUERA, M.PENACABRERA, and K.ORDAZHERNANDEZ, “On the design of intelligent robotic agents for assembly,” Information Sciences, p. 377–402, May 2005. [Online]. Available: http://dx.doi.org/10.1016/j.ins.2004.09.011
[45]H.Chen, G.Zhang, H.B. Zhang, and T.A. Fuhlbrigge, “Integrated robotic system for high precision assembly in a semi‐structured environment,” Assembly Automation, vol.27, pp. 247–252, 2007.
[46]F.von Drigalski, K.Kasaura, C.C. Beltran-Hernandez, M.Hamaya, K.Tanaka, and T.Matsubara, “Uncertainty-aware manipulation planning using gravity and environment geometry,” IEEE Robotics and Automation Letters, vol.7, no.4, pp. 11 942–11 949, 2022.
[47]J.Eßer, N.Bach, C.Jestel, O.Urbann, and S.Kerner, “Guided reinforcement learning: A review and evaluation for efficient and effective real-world robotics [survey],” IEEE Robotics & Automation Magazine, vol.30, pp. 67–85, 2023.
[48]C.Wang, C.Su, B.Sun, G.Chen, and L.Xie, “Extended residual learning with one-shot imitation learning for robotic assembly in semi-structured environment,” Frontiers in Neurorobotics, vol.18, 2024.
[49]J.Zhang, W.Wan, N.Tanaka, M.Fujita, K.Takahashi, and K.Harada, “Integrating a pipette into a robot manipulator with uncalibrated vision and tcp for liquid handling,” IEEE Transactions on Automation Science and Engineering, 2023.
[50]A.Ranjbar, N.A. Vien, H.Ziesche, J.Boedecker, and G.Neumann, “Residual feedback learning for contact-rich manipulation tasks with uncertainty,” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2383–2390, 2021.