Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (2024)

Chuang Wang, Lie Yang, Ze Lin, Yizhi Liao, Gang Chen, and Longhan XieThis document is the results of the National Key Research and Development Program of China (Grant No. 2021YFB3301400), the research project funded by the National Natural Science Foundation of China (Grant No. 52075177). (Corresponding author: Gang Chen and Longhan Xie.)Chuang Wang is with the South China University of Technology, Guangzhou, CN (e-mail: wichuangwang@mail.scut.edu.cn).Lie Yang is with the School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore. (e-mail: lie.yang@ntu.edu.sg).Gang Chen is with the South China University of Technology, Guangzhou, CN (e-mail: gangchen@scut.edu.cn).Longhan Xie is with the South China University of Technology, Guangzhou, CN (e-mail: melhxie@scut.edu.cn).

Abstract

Assembling a slave object into a fixture-free master object represents a critical challenge in flexible manufacturing. Existing deep reinforcement learning-based methods, while benefiting from visual or operational priors, often struggle with small-batch precise assembly tasks due to their reliance on insufficient priors and high-costed model development. To address these limitations, this paper introduces a cognitive manipulation and learning approach that utilizes skill graphs to integrate learning-based object detection with fine manipulation models into a cohesive modular policy. This approach enables the detection of the master object from both global and local perspectives to accommodate positional uncertainties and variable backgrounds, and parametric residual policy to handle pose error and intricate contact dynamics effectively. Leveraging the skill graph, our method supports knowledge-informed learning of semi-supervised learning for object detection and classroom-to-real reinforcement learning for fine manipulation. Simulation experiments on a gear-assembly task have demonstrated that the skill-graph-enabled coarse-operation planning and visual attention are essential for efficient learning and robust manipulation, showing substantial improvements of 13%percent\%% in success rate and 15.4%percent\%% in number of completion steps over competing methods. Real-world experiments further validate that our system is highly effective for robotic assembly in semi-structured environments.

Index Terms:

Robotic assembly, Semi-structured environment, Object detection, Semi-supervised learning, Residual reinforcement learning.

I Introduction

The flexible manufacturing systems aim to swiftly adapt to market demands and individual customer requirements, facilitating a quick and cost-effective response to new tasks [1, 2]. In contemporary industrial robotics, flexibility is primarily achieved through automated end-effector changes, efficient robot programming, and the utilization of component-specific fixtures [3]. In the realm of low-volume batch production, robotic assembly systems tailored for flexible manufacturing must handle objects that are randomly positioned and unsecured by fixtures, thereby enhancing flexibility and adaptability across various product types in hardware [4, 5]. While this less structured approach reduces the need for developing specific fixtures, saving both time and costs, it introduces significant software challenges, particularly the need for precise identification of randomly located objects within a defined workspace and the accurate control of force during precision assembly tasks.

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (1)

Human operators, leveraging task-specific knowledge and prior experience, intuitively manage these complexities to execute high-precision assembly tasks in such environments. Conversely, designing or learning an efficient and robust policy that enables robots to perform similarly in semi-structured environments poses a considerable challenge.Early solutions involved multi-stage methods that utilized vision systems to estimate pose errors and guide robots via visual servoing, complemented by force/torque-based algorithms to correct both visual and positional inaccuracies during the insertion process [6].More recently, deep reinforcement learning (DRL) has emerged as a promising alternative, formulating effective policies through trial-and-error learning without relying on precise modeling and sensors [7].However, these methods typically require extensive empirical knowledge or substantial training data, which can be time-consuming and labor-intensive, involving tasks such as image labeling, parameter tuning, or costly interactions.

To address this issue, recent advancements in robotic manipulation have focused on the integration of multi-stage methods and deep reinforcement learning to obtain robust and efficient policies[8, 9]. The neural-symbolic framework integrates the convenience of traditional methods with the flexibility of RL, compensating for the inaccuracy of traditional positioning methods and the low efficiency of RL.Despite these advancements, significant challenges remain in utilizing the advantages of various modules for fixture-less precise assembly tasks. The primary issue is that singular visual or operational representations fail to provide a comprehensive understanding of the robots, tasks, and environments necessary for robot learning to handle the large position uncertainty and complex contact dynamics [10, 11]. Additionally, both hand-designed and learning-based prior visual and operational representations often require substantial engineering efforts or extensive training data, complicating their application in real-world scenarios [12].

A key differentiator between human and non-human cognition is the capacity for structured knowledge representation, which has proven essential in addressing these challenges [13]. Such representations linked with visual [14, 15, 16] or operational priors [17, 18, 19] and learning process guided by knowledge [20], have successfully captured a wide range of human cognitive processes, enabling efficient autonomous learning with minimal supervisory intervention [21, 22]. Drawing inspiration from the human approach to learning and manipulation, this thesis introduces the Cognitive Manipulation Method for Robotic Assembly in Semi-Structured Environments (CM4RASSE). This innovative method utilizes graph-based structural prior knowledge to integrate learning-based object detection and fine manipulation models into a cohesive modular policy, promoting self-directed learning with minimal human oversight.

The proposed approach begins by establishing a skill graph that consolidates spatial, temporal, and causal information, creating a structured and flexible cognitive manipulation architecture by integrating multiple modules. This framework complements the general object detection model as a rich visual representation for fine manipulation by providing operationally relevant positional and visual attention information. Additionally, a rich operational representation based on skill graphs and planning enables transitioning from global observation and coarse operation to local observation and fine operation to address the challenges of assembly tasks in semi-structured environments.The proposed knowledge-informed developmental training method mirrors the human cognitive process of observing before acting and mastering skills in controlled settings before tackling real-world scenarios.Initially, we employ the skill graph and to collect diverse samples from pick-and-place interactions and minimal manual labeling for calibration and automatedlabeling, facilitating semi-supervised object detection learning and cost-effective hand-eye-task calibration. This phase results in a rich vision representation which is robust to position uncertainty and variable backgrounds.Subsequently, the skill graph integrates visual perception with coarse operation planning, providing structured data that includes spatial locations and task-focused perspectives, enabling the agent to learn a residual policy for manipulation skills instead of learning from scratch. Moreover, this structured approach facilitates the transfer of learned skills to semi-structured environments, where variability and unpredictability are more pronounced.The efficacy of CM4RASSE is first studied in simulation and subsequently evaluated on a real robot through its application in a high-precision, jigless peg-in-hole and gear assembly task. The results show that the vision and manipulation model can be efficiently learned in new tasks with prior knowledge guidance, and the integrated policy performs effectively in the semi-structured environment.

This work presents a cognitive manipulation method, leveraging a skill graph to integrate learning-based visual perception and fine manipulation to facilitate efficient and effective learning of new tasks. Our primary contributions are as follows:

  1. 1.

    A Novel neural-symbolic framework: We have developed a skill graph that serves as common sense to integrate various modules, thereby handling the large position uncertainty and complex contact dynamics.

  2. 2.

    Semi-Supervised visual representation learning: We leverage the skill graph to collect a diverse array of samples and minimal manual labeling for Cost-Effective Hand-Eye-Task calibration and automated labeling, promoting semi-supervised learning of Object Detection.

  3. 3.

    Classroom-to-Real Residual Reinforcement Learning: Our skill graph, combining visual perception with trajectory planning, supports the learning of fine manipulation policies within a structured environment and enabling a seamless transition of acquired skills to a semi-structured environment, effectively navigating the inherent challenges of precise assembly tasks.

  4. 4.

    Comparative and Comprehensive Studies: We conduct comparative and comprehensive studies to assess the effectiveness of each component and integrated policy in terms of learning efficiency and assembly performance.

II Related work

Robotic systems require precise state feedback and sophisticated control policies to effectively address specific tasks in semi-structured environments. This section reviews the cutting-edge methodologies and significant advancements in robotic assembly within such environments, highlighting the visual representation and operational representation for robot learning and concepts derived from human cognitive systems.

II-A Robotic Assembly in Semi-structured Environments

Multi-stage methods have been pivotal in precision assembly tasks, employing an integrated robotic system across three distinct phases [2]: 1) Initial Approach: Utilizing an eye-to-hand camera, the system employs position-based visual servoing (PBVS) to navigate towards the master object. 2) Alignment: A force/torque-based local search method corrects alignment errors to ensure precise fitting. 3) Insertion Execution: Discrepancies in position and orientation are rectified using a force/torque control algorithm, ensuring successful component insertion. To enhance the efficiency of local search for assembling components with complex geometries, the part’s geometry itself guides the alignment process through image-based visual servoing (IBVS) with an eye-in-hand camera [23]. Hybrid strategies that combine both eye-to-hand and eye-in-hand cameras merge the benefits of different visual servoing techniques, providing comprehensive visual cues that maintain target visibility throughout the operation [24, 25]. However, the success of these systems heavily depends on the precise selection of features and the strategic design of control methodologies, necessitating meticulous calibration and parameter adjustments to minimize errors and ensure operational stability.

Innovative approaches have been introduced to bolster robustness against variations in surface geometry and lighting conditions and simplify vision, calibration, and control processes. For instance, Haugaard et al. [26] introduced a deep learning approach for pin and hole point estimation in multi-camera setups, facilitating visual servoing for initial alignment. Mou et al. [27] devised a technique for more precise and efficient position estimation of manipulated connectors, leveraging YOLO-based relevant region detection. [28] propose a 6D pose estimation of template geometries, to which manipulation objects should be connected. To diminish design complexity and boost optimal policy, reinforcement learning (RL) [7] presents an alternative by leveraging trial-and-error learning over precise modeling. To heighten approach robustness amidst environmental variations, spatial attention point network models [29] have been introduced, employing visual attention to extract pertinent image features for motion controllers and utilizing offline training to enhance sample efficiency. However, it is expensive to collect sufficient experience data from real-world scenarios. Furthermore, the low clearance and contact dynamics in precision assembly tasks complicate demonstrations [30], simulation-to-real transfer[31], and offline training.

Several studies have also leveraged model-based and learning-based modules to construct robust and efficient policies. Based on learning in multi-stage, Lee M A et al. [32] employed vision-based uncertainty estimation to differentiate between free-space and contact-rich regions, applying model-based methods in free-space for minimal environmental interaction and RL techniques to navigate inaccuracies in perception/action pipeline. Zhao et al. [9] proposed a fine positioning policy learned by DRL under an eye-in-hand camera view with a traditional coarse positioning method and impedance control. Based on the residual learning, Shi Y et al. [8] combined an eye-in-hand vision-based fixed policy with a contact force-based parametric policy to enhance the robustness and efficiency of the RL algorithm. Besides the force-based trajectory generators, [33] introduced an image-based trajectory generator trained by DRL to enable a robot to adapt to assembly parts with different shapes. Similarly, [34] proposed a residual high-level visual policy to determine the robot pose increment in Cartesian space through deep RL.However, these methods with direct bonding cannot solve the large position uncertainty and complex contact dynamics simultaneously and it is difficult to adapt to new tasks quickly because they do not fully utilize the advantages of various modules.

II-B Visual Representation for Robot Learning

The integration of prior visual models into the RL framework has shown promise in enhancing learning efficiency and generalization in unstructured settings by detection [14], pose estimation [35, 16], visual affordances [15]. Unsupervised [13], self-supervised learning [36] and hybrid observation-synthesis [37] has been applied to learn the prior visual models for different robotic skills. Specifically for grasping, [38] propose self-supervised visual affordance models that are grounded in real human behavior from teleoperated play data, driving the model-based planner to the vicinity of afforded regions and guiding a local grasping RL policy to favor the same object regions favored by people. Building on this prior work, this study introduces a semi-supervised visual representation to provides structured information including spatial location and task attention information for assembly skill learning.

II-C Operational Representation for Robot Learning

Operational representation can reduce the complexity of the solution space of a given manipulation problem by applying a well-designed but still flexible structure, such as formal method for task and domain-specific knowledge [18], stochastic graph [39], switching functions [40], manipulation primitives [41], graph-based skill formalism [17, 19]. Especially, [19] uses temporal abstraction and task decomposition as the higher-level policy in the hierarchical reinforcement learning method to reduce problem complexity. Based on this work, we further extend the skill graph to fixture-less assembly tasks by integrating object detection, coarse operation planning, and residual fine manipulation policy.

II-D Cognitive Systems and Learning Mechanisms

Research in cognitive robotics aims to emulate human intelligence, paving the way for the development of human-level artificial intelligence by cognitive architectures leveraging core capabilities such as sensing, cognition, learning, and control [42]. The existing theories offer crucial insights for creating foundational elements and learning strategies for cognitive systems, such as hybrid neural-symbolic models [20] and top-down learning [43].Especially, [44] adopted a connectionist-based approach for object recognition and compliant motion learning based on adaptive resonance theory (ART), aiming to design robotic agents for assembly tasks. This study employs a skill graph to integrate the neural models, which enable human-like operation and learning.

III Problem Statement

This work focuses on the ability to locate the master object and insert the slave object where the master object is randomly positioned within the workspace, as shown in Fig. 2.

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (2)

The task can be formulated as a Markov Decision Process (MDP) with a transition function P(St+1|St,At)𝑃conditionalsubscript𝑆𝑡1subscript𝑆𝑡subscript𝐴𝑡P(S_{t+1}|S_{t},A_{t})italic_P ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which is a probability distribution over the next state St+1subscript𝑆𝑡1S_{t+1}italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT conditioned on the execution of a certain action Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the current state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We want to find a policy π(At|St)𝜋conditionalsubscript𝐴𝑡subscript𝑆𝑡\pi(A_{t}|S_{t})italic_π ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) that dictates the probability over actions conditioned on a given state. The complexity of the transition function P𝑃Pitalic_P is determined by the degree of the environment, which directly affects the difficulty of designing or learning the policy π𝜋\piitalic_π. Unlike the environment in which humans live, more structured industrial scenarios can provide more prior knowledge for the learning process [45, 46, 32, 47]. Therefore, the assumption of known partial knowledge of the state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and transition function P𝑃Pitalic_P is exploited as follows:

  1. 1.

    Semi-structured industrial scenarios, such as flat workbenches and limited work areas, add constraints to the robot’s behavior and can also simplify the robot’s perception and operation requirements. In this work, the dimensions of the master object’s pose are categorized into constrained and unconstrained parts. The constrained segments of z𝑧zitalic_z, rx𝑟𝑥rxitalic_r italic_x, and ry𝑟𝑦ryitalic_r italic_y are restricted within the workspace, influenced by the task and workspace shape. Conversely, the unconstrained segment of x𝑥xitalic_x, y𝑦yitalic_y and rz𝑟𝑧rzitalic_r italic_z varies within a specific range (Suncon:[(xmin,xmax),(ymin,ymax),(rzmin,rzmax)]:subscript𝑆𝑢𝑛𝑐𝑜𝑛subscript𝑥𝑚𝑖𝑛subscript𝑥𝑚𝑎𝑥subscript𝑦𝑚𝑖𝑛subscript𝑦𝑚𝑎𝑥𝑟subscript𝑧𝑚𝑖𝑛𝑟subscript𝑧𝑚𝑎𝑥S_{uncon}:[(x_{min},x_{max}),(y_{min},y_{max}),(rz_{min},rz_{max})]italic_S start_POSTSUBSCRIPT italic_u italic_n italic_c italic_o italic_n end_POSTSUBSCRIPT : [ ( italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) , ( italic_y start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) , ( italic_r italic_z start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_r italic_z start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) ]).

  2. 2.

    General knowledge about manipulation tasks can guide strategy design and learning, considering the spatial separation between the master and slave objects. The manipulation process can be divided into contact-free Scfsubscript𝑆𝑐𝑓S_{cf}italic_S start_POSTSUBSCRIPT italic_c italic_f end_POSTSUBSCRIPT and contact-rich Scrsubscript𝑆𝑐𝑟S_{cr}italic_S start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT regions based on task geometry, with attention to uncertainties in pose estimation Ersubscript𝐸𝑟E_{r}italic_E start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and contact dynamics Fmaxsuperscript𝐹𝑚𝑎𝑥F^{max}italic_F start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT in contact-rich Scrsubscript𝑆𝑐𝑟S_{cr}italic_S start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT regions for precise assembly tasks. Humans utilize global and local fields of view sequentially to enhance fine manipulation tasks, addressing the constraints of a single camera.

  3. 3.

    The geometric parameters and even the forward and inverse kinematics of the robot are often available from the robot supplier. It allows us to design and learn the manipulation policy in task space. We can also use it to obtain geometry information of tools, platforms, and tasks by demonstration.

The challenge is to incorporate general prior knowledge and a learning-based model to address task-specific uncertainty. This work aims to propose a neural-symbolic cognitive manipulation method for assembly skill learning, enabling the utilization of prior knowledge to train a visual representation and fine manipulation policy to handle the uncertainties of robot, environment and tasks.

IV Method

This work introduces a novel cognitive manipulation method for solving assembly tasks in semi-structured environments, as shown in Fig. 3. Central to our approach is the skill graph, which orchestrates multiple modules within a mixed-strategy framework, driving three key modes of operation: manipulation in semi-structured environments, vision model training, and fine manipulation training.This section presented the proposed methodology in three parts: 1) Cognitive Manipulation Architecture: This component introduces a neural-symbolic framework that integrates multimodal and scalable information through a combination of model-based and learning-based methods.2) Semi-supervised Visual Representation learning: This stage outlines a cost-effective method for hand-eye-task calibration and object detection training, enhancing the visual representation capabilities essential for precise manipulation.3) Classroom-to-Real Residual Reinforcement Learning: The final part involves training the residual policy within a specially designed classroom environment and task execution in semi-structured settings.

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (3)

IV-A Cognitive Manipulation Architecture

The cognitive manipulation architecture is designed to enhance the effective contact-rich manipulation in semi-structured environments. It leverages a skill graph to integrate multiple modules including an object detection module to manage positional uncertainties, a model-based system for trajectory planning and compliance settings, and a residual policy for managing pose error and complex contact dynamics. These components work together to support effective manipulation control, as depicted in Fig. 4 and Alg. 1.

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (4)

IV-A1 Skill Graph Based on General Knowledge and Task Specification

The abstract expert knowledge of assembly tasks in semi-structured environments can be harnessed to segment the manipulation process into distinct stages and components, as depicted in Fig. 5. The skill graph integrate symbolic and subsymbolic representations according to general versus task-specific knowledge and degrees of accessibility for a well-designed yet flexible structure.

We define general knowledge using a partial model that encapsulates spatial, temporal, and causal information. Initially, we consider the spatial information concerning the end-effector (EE) pose in the manipulation, which includes the home position XEsBsuperscriptsuperscriptsubscript𝑋𝐸𝑠𝐵{}^{B}X_{E}^{s}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, assembly bottleneck pose XEmBsuperscriptsuperscriptsubscript𝑋𝐸𝑚𝐵{}^{B}X_{E}^{m}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and assembly goal pose XEgBsuperscriptsuperscriptsubscript𝑋𝐸𝑔𝐵{}^{B}X_{E}^{g}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT within the robot base frame. Temporally, the process is segmented into four stages: reaching XEsBsuperscriptsuperscriptsubscript𝑋𝐸𝑠𝐵{}^{B}X_{E}^{s}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT for global perception to estimate the XEgBsuperscriptsuperscriptsubscript𝑋𝐸𝑔𝐵{}^{B}X_{E}^{g}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, planning a coarse operation from XEsBsuperscriptsuperscriptsubscript𝑋𝐸𝑠𝐵{}^{B}X_{E}^{s}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT to XEmBsuperscriptsuperscriptsubscript𝑋𝐸𝑚𝐵{}^{B}X_{E}^{m}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and then to XEgBsuperscriptsuperscriptsubscript𝑋𝐸𝑔𝐵{}^{B}X_{E}^{g}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, executing the coarse operation to reach XEmBsuperscriptsuperscriptsubscript𝑋𝐸𝑚𝐵{}^{B}X_{E}^{m}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and performing the fine operation for insertion at XEgBsuperscriptsuperscriptsubscript𝑋𝐸𝑔𝐵{}^{B}X_{E}^{g}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT. Causal transition conditions between these stages are defined based on the positional relationships and contact states between the peg and hole. This partial model is illustrated in Eqn. (1) and Fig. 5. Although the introduction of sequential logic and causality is crucial for operating in semi-structured environments—enhancing safety and reducing learning costs due to potential interference from other agents (robots or humans), this paper primarily focuses on the sequential logic and causal enhancement of robot learning methods, while not extensively addressing multiple exceptional states and their management strategies.

{n1:fgp(θ1;Ω1)c1:BXOSunconn2:fpl(θ2;Ω2)c2:donen3:fcf(θ3;Ω3)c3:XXEmBEthn4:fcr(θ4;Ω4)c4:XXEgBEth&Fz>Fzmax\begin{cases}n_{1}:f_{gp}(\theta_{1};\Omega_{1})&c_{1}:^{B}X_{O}\in S_{uncon}%\\n_{2}:f_{pl}(\theta_{2};\Omega_{2})&{c_{2}:done}\\n_{3}:f_{cf}(\theta_{3};\Omega_{3})&c_{3}:X-{}^{B}X_{E}^{m}\in E_{th}\\n_{4}:f_{cr}(\theta_{4};\Omega_{4})&c_{4}:X-{}^{B}X_{E}^{g}\in E_{th}\text{\&}%F_{z}>F_{z}^{max}\end{cases}{ start_ROW start_CELL italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_f start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; roman_Ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_u italic_n italic_c italic_o italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_f start_POSTSUBSCRIPT italic_p italic_l end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; roman_Ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_d italic_o italic_n italic_e end_CELL end_ROW start_ROW start_CELL italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT : italic_f start_POSTSUBSCRIPT italic_c italic_f end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ; roman_Ω start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT : italic_X - start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ italic_E start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT : italic_f start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ; roman_Ω start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT : italic_X - start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ∈ italic_E start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT & italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT > italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT end_CELL end_ROW(1)

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (5)

Despite the generic nature of temporal and causal information in assembly tasks, task-specific information remains essential. For the manipulation process, the bottleneck poses XEmBsuperscriptsuperscriptsubscript𝑋𝐸𝑚𝐵{}^{B}X_{E}^{m}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and goal pose XEgBsuperscriptsuperscriptsubscript𝑋𝐸𝑔𝐵{}^{B}X_{E}^{g}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT of EE are determined by the task’s geometry and assembly relationships, master object pose, grasp pose for the slave object, and the tool center position (TCP) offset. Object-Embodiment-Centric (OEC) geometry representation is derived from demonstrations to enable direct prediction of key waypoints in the operational process through the master object pose estimation, as outlined in our prior work [48]. The OEC representation selects a grasping point on the task board as the coordinate origin XOBsuperscriptsubscript𝑋𝑂𝐵{}^{B}X_{O}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT and extrapolates the bottleneck pose of EE XEmBsuperscriptsuperscriptsubscript𝑋𝐸𝑚𝐵{}^{B}X_{E}^{m}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and goal pose XEgBsuperscriptsuperscriptsubscript𝑋𝐸𝑔𝐵{}^{B}X_{E}^{g}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT in the robot base frame from the teaching, then transform these to XEmOsuperscriptsuperscriptsubscript𝑋𝐸𝑚𝑂{}^{O}X_{E}^{m}start_FLOATSUPERSCRIPT italic_O end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and XEgOsuperscriptsuperscriptsubscript𝑋𝐸𝑔𝑂{}^{O}X_{E}^{g}start_FLOATSUPERSCRIPT italic_O end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT in the task frame.In the semi-structured environment, the master object is randomly placed within a workspace of range Sunconsubscript𝑆𝑢𝑛𝑐𝑜𝑛S_{uncon}italic_S start_POSTSUBSCRIPT italic_u italic_n italic_c italic_o italic_n end_POSTSUBSCRIPT, as specified by a human operator. The home position of the EE XEsBsuperscriptsuperscriptsubscript𝑋𝐸𝑠𝐵{}^{B}X_{E}^{s}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is strategically set outside this region to prevent occlusion of the eye-to-hand camera’s field of view. With partial constraints of the workspace, the pose can be determined by estimating the x𝑥xitalic_x, y𝑦yitalic_y, and rz𝑟𝑧rzitalic_r italic_z dimensions. The uncertainties introduced by the pose estimation and demonstration system, as well as contact dynamics, are also considered for precise contact-rich task execution.In conclusion, employing the general knowledge base, both planning and learning-based methods are utilized for developing a complex policy, which includes object detection for pose estimation and task-related visual information extraction, spatially dependent trajectory planning as the basic strategy, and task-specific residual strategies for handling uncertainties. A graph structure is devised to reflect the required sequence of motions and modules necessary to complete the task, as illustrated in Fig. 4.

IV-A2 Compliance Controller

Employing the virtual force-driven spring-mass-damping modal and robot kinematics, we utilize a modified Cartesian parallel position and force controller as the low-level control for robot learning, generating velocity commands. The control law for joint velocities q˙˙𝑞\dot{q}over˙ start_ARG italic_q end_ARG is expressed as:

q=J1M1s+M1B[K(XdX)(FdF)]𝑞superscript𝐽1superscript𝑀1𝑠superscript𝑀1𝐵delimited-[]𝐾subscript𝑋𝑑𝑋subscript𝐹𝑑𝐹\overset{\cdot}{q}=\frac{J^{-1}M^{-1}}{s+M^{-1}B}\left[K\left(X_{d}-X\right)-%\left(F_{d}-F\right)\right]over⋅ start_ARG italic_q end_ARG = divide start_ARG italic_J start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_M start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_s + italic_M start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_B end_ARG [ italic_K ( italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_X ) - ( italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_F ) ](2)

where Jacobian matrix J𝐽Jitalic_J provides the relation between end-effector and joint velocities. Desired poses Xdsubscript𝑋𝑑X_{d}italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and forces Fdsubscript𝐹𝑑F_{d}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT govern the behavior, with the stiffness matrix K𝐾Kitalic_K balancing the six-dimensional tracking error for position/orientation and force/torque. The inertia matrix M𝑀Mitalic_M and damping matrix B𝐵Bitalic_B influence the response speed and stability.

IV-A3 Object Detection for Pose Estimation and Task Attention

Object detection is a popular algorithm used to locate objects in an image or video stream. It predicts multiple bounding boxes for objects in the image I𝐼Iitalic_I, and each bounding box contains the predicted values for the object’s position (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), size (w,h)𝑤(w,h)( italic_w , italic_h ), confidence cconsubscript𝑐𝑐𝑜𝑛c_{con}italic_c start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT, and category ccatesubscript𝑐𝑐𝑎𝑡𝑒c_{cate}italic_c start_POSTSUBSCRIPT italic_c italic_a italic_t italic_e end_POSTSUBSCRIPT, as shown in (3). With a pre-defined confidence level, the effective predicted bounding box for the object is selected.

[ccate,x,y,w,h,ccon]=detect(I)subscript𝑐𝑐𝑎𝑡𝑒𝑥𝑦𝑤subscript𝑐𝑐𝑜𝑛𝑑𝑒𝑡𝑒𝑐𝑡𝐼[c_{cate},x,y,w,h,c_{con}]=detect(I)[ italic_c start_POSTSUBSCRIPT italic_c italic_a italic_t italic_e end_POSTSUBSCRIPT , italic_x , italic_y , italic_w , italic_h , italic_c start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT ] = italic_d italic_e italic_t italic_e italic_c italic_t ( italic_I )(3)

To cover the entire workspace and accurately detect the object of interest, we attach an eye-to-hand RGB camera at the top of the workspace to capture the 2D image Iethsubscript𝐼𝑒𝑡I_{eth}italic_I start_POSTSUBSCRIPT italic_e italic_t italic_h end_POSTSUBSCRIPT, as shown in Fig. 4 (b). An object detection-based coarse perception system generates one bounding box around the object of interest to obtain the location (x0,y0)subscript𝑥0subscript𝑦0(x_{0},y_{0})( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and two other bounding boxes around the predefined feature structures to obtain the location (x1,y1)subscript𝑥1subscript𝑦1(x_{1},y_{1})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and (x2,y2)subscript𝑥2subscript𝑦2(x_{2},y_{2})( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). According to the eye-to-hand transformation Tcrsuperscriptsubscript𝑇𝑐𝑟{}^{r}T_{c}start_FLOATSUPERSCRIPT italic_r end_FLOATSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, the estimated points (x0,y0)superscriptsubscript𝑥0superscriptsubscript𝑦0(x_{0}^{\prime},y_{0}^{\prime})( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) are transformed to the robot frame as shown in (4). Considering the partial pose information of the object, including rxcon𝑟subscript𝑥𝑐𝑜𝑛rx_{con}italic_r italic_x start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT, rycon𝑟subscript𝑦𝑐𝑜𝑛ry_{con}italic_r italic_y start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT, and zconsubscript𝑧𝑐𝑜𝑛z_{con}italic_z start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT, the pose XOBsuperscriptsubscript𝑋𝑂𝐵{}^{B}X_{O}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT can be determined as in (5). Global perception is used in the first stage n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to determine whether the main assembly object is ready for the assembly operation on the one hand, and to provide location information for the assembly operation on the other hand.

(xi,yi)=tramsform(xi,yi),i=0,1,2formulae-sequencesuperscriptsubscript𝑥𝑖superscriptsubscript𝑦𝑖𝑡𝑟𝑎𝑚𝑠𝑓𝑜𝑟𝑚subscript𝑥𝑖subscript𝑦𝑖𝑖012(x_{i}^{\prime},y_{i}^{\prime})=tramsform(x_{i},y_{i}),i=0,1,2( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_t italic_r italic_a italic_m italic_s italic_f italic_o italic_r italic_m ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i = 0 , 1 , 2(4)
XOB=[x0,y0,zcon,rxcon,rycon,arctan(y2y1x2x1)]superscriptsubscript𝑋𝑂𝐵superscriptsubscript𝑥0superscriptsubscript𝑦0subscript𝑧𝑐𝑜𝑛𝑟subscript𝑥𝑐𝑜𝑛𝑟subscript𝑦𝑐𝑜𝑛subscript𝑦2subscript𝑦1subscript𝑥2subscript𝑥1{}^{B}X_{O}=[x_{0}^{\prime},y_{0}^{\prime},z_{con},rx_{con},ry_{con},\arctan(%\frac{y_{2}-y_{1}}{x_{2}-x_{1}})]start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT , italic_r italic_x start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT , italic_r italic_y start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT , roman_arctan ( divide start_ARG italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) ](5)

The second model uses the image Ieihsubscript𝐼𝑒𝑖I_{eih}italic_I start_POSTSUBSCRIPT italic_e italic_i italic_h end_POSTSUBSCRIPT from an eye-in-hand camera to provide local task detection as attention, enhancing the ability to differentiate tasks from the environment as indicated in Fig. 4 (b). Object detection utilizes a bounding box as a region of interest (ROI) to identify specific structures crucial for vision-based precise alignment in assembly tasks. The work uses a simple attention strategy that utilizes this bounding box (xa,ya,wa,ha)subscript𝑥𝑎subscript𝑦𝑎subscript𝑤𝑎subscript𝑎(x_{a},y_{a},w_{a},h_{a})( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) to crop and resize the task-related area from the input image Ieihsubscript𝐼𝑒𝑖I_{eih}italic_I start_POSTSUBSCRIPT italic_e italic_i italic_h end_POSTSUBSCRIPT and generate an attention-guided observation Iattensubscript𝐼𝑎𝑡𝑡𝑒𝑛I_{atten}italic_I start_POSTSUBSCRIPT italic_a italic_t italic_t italic_e italic_n end_POSTSUBSCRIPT for fine manipulation, enabling the residual policy to concentrate on the specific structure amidst varying environments.

Iatten=crop((xa,ya,wa,ha),Ieih)subscript𝐼𝑎𝑡𝑡𝑒𝑛𝑐𝑟𝑜𝑝subscript𝑥𝑎subscript𝑦𝑎subscript𝑤𝑎subscript𝑎subscript𝐼𝑒𝑖I_{atten}=crop((x_{a},y_{a},w_{a},h_{a}),I_{eih})italic_I start_POSTSUBSCRIPT italic_a italic_t italic_t italic_e italic_n end_POSTSUBSCRIPT = italic_c italic_r italic_o italic_p ( ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) , italic_I start_POSTSUBSCRIPT italic_e italic_i italic_h end_POSTSUBSCRIPT )(6)

IV-A4 Coarse Operation Planing

With the partial model and coarse perception system, we can plan a coarse operation as the second stage n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We first obtain the assembly goal point XEgBsuperscriptsuperscriptsubscript𝑋𝐸𝑔𝐵{}^{B}X_{E}^{g}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and bottleneck pose XEmBsuperscriptsuperscriptsubscript𝑋𝐸𝑚𝐵{}^{B}X_{E}^{m}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT with the estimated pose XOBsuperscriptsubscript𝑋𝑂𝐵{}^{B}X_{O}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT and the OEC geometry information of XEmOsuperscriptsuperscriptsubscript𝑋𝐸𝑚𝑂{}^{O}X_{E}^{m}start_FLOATSUPERSCRIPT italic_O end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and XEgOsuperscriptsuperscriptsubscript𝑋𝐸𝑔𝑂{}^{O}X_{E}^{g}start_FLOATSUPERSCRIPT italic_O end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, which divide the operation into contact-free and contact-rich manipulation.

The uncertainty due to pose estimation and compliance control can be ignored in the contact-free region Scfsubscript𝑆𝑐𝑓S_{cf}italic_S start_POSTSUBSCRIPT italic_c italic_f end_POSTSUBSCRIPT. A fast min-jek trajectory τcfsubscript𝜏𝑐𝑓\tau_{cf}italic_τ start_POSTSUBSCRIPT italic_c italic_f end_POSTSUBSCRIPT between the home point XEsBsuperscriptsuperscriptsubscript𝑋𝐸𝑠𝐵{}^{B}X_{E}^{s}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and the bottleneck pose XEmBsuperscriptsuperscriptsubscript𝑋𝐸𝑚𝐵{}^{B}X_{E}^{m}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT can be generated. In addition, a high-stiffness Kcfsubscript𝐾𝑐𝑓K_{cf}italic_K start_POSTSUBSCRIPT italic_c italic_f end_POSTSUBSCRIPT of compliance controller is used to ensure acceptable position tracking errors. The trajectory and stiffness provide coarse operation for the contact-free region, which can be defined as Eqn. (7).

πHcf(τcf(t),Kcf)similar-tosubscriptsuperscript𝜋𝑐𝑓𝐻subscript𝜏𝑐𝑓𝑡subscript𝐾𝑐𝑓\pi^{cf}_{H}\sim(\tau_{cf}(t),K_{cf})italic_π start_POSTSUPERSCRIPT italic_c italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∼ ( italic_τ start_POSTSUBSCRIPT italic_c italic_f end_POSTSUBSCRIPT ( italic_t ) , italic_K start_POSTSUBSCRIPT italic_c italic_f end_POSTSUBSCRIPT )(7)

However, the uncertainty cannot be ignored in the contact-rich region. A slow trajectory τcrsubscript𝜏𝑐𝑟\tau_{cr}italic_τ start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT and small stiffness are used in the contact-rich region Scrsubscript𝑆𝑐𝑟S_{cr}italic_S start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT. We define an exploration space W𝑊Witalic_W to represent the offset range of a compliance robot disturbed by safe external forces Fmaxsubscript𝐹𝑚𝑎𝑥F_{max}italic_F start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT. It should cover the assembly depth and error range to ensure safe contact and effective error compensation, as shown in Eqn. (8). Furthermore, the small stiffness matrix of compliance control Kcrsubscript𝐾𝑐𝑟K_{cr}italic_K start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT is obtained with the estimated exploration space W𝑊Witalic_W and maximum contact force Fmaxsubscript𝐹𝑚𝑎𝑥F_{max}italic_F start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, which can be defined as,

W=Er+abs(XEmBXEgB)𝑊subscript𝐸𝑟𝑎𝑏𝑠superscriptsuperscriptsubscript𝑋𝐸𝑚𝐵superscriptsuperscriptsubscript𝑋𝐸𝑔𝐵\displaystyle W=E_{r}+abs({}^{B}X_{E}^{m}-{}^{B}X_{E}^{g})italic_W = italic_E start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + italic_a italic_b italic_s ( start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT - start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT )(8)
Kcr=Fmaxdiag(W)1subscript𝐾𝑐𝑟subscript𝐹𝑚𝑎𝑥𝑑𝑖𝑎𝑔superscript𝑊1\displaystyle K_{cr}=F_{max}\cdot diag(W)^{-1}italic_K start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ⋅ italic_d italic_i italic_a italic_g ( italic_W ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
πHcr(τcr(t),Kcr)similar-tosubscriptsuperscript𝜋𝑐𝑟𝐻subscript𝜏𝑐𝑟𝑡subscript𝐾𝑐𝑟\displaystyle\pi^{cr}_{H}\sim(\tau_{cr}(t),K_{cr})italic_π start_POSTSUPERSCRIPT italic_c italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∼ ( italic_τ start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT ( italic_t ) , italic_K start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT )

IV-A5 Residual Policy for Fine manipulation

The manipulation is divided into two phases according to the planned coarse operation and carried out by the compliance controller in (2). In the third stage n3subscript𝑛3n_{3}italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, the end effector of the robot is moved from the home point XEsBsuperscriptsuperscriptsubscript𝑋𝐸𝑠𝐵{}^{B}X_{E}^{s}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT to the bottleneck pose XEmBsuperscriptsuperscriptsubscript𝑋𝐸𝑚𝐵{}^{B}X_{E}^{m}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT with a planned efficient policy πHcfsubscriptsuperscript𝜋𝑐𝑓𝐻\pi^{cf}_{H}italic_π start_POSTSUPERSCRIPT italic_c italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT.

Since the contact-rich assembly manipulation in the fourth stage n4subscript𝑛4n_{4}italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT requires a higher level of accuracy than conventional robot and vision systems, the planned safe policy πHcfsubscriptsuperscript𝜋𝑐𝑓𝐻\pi^{cf}_{H}italic_π start_POSTSUPERSCRIPT italic_c italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is switched and the residual policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is enabled to refine the initial policy for precise localization and complex force control. In addition to guidance from a fixed policy, the residual policy also receives attentional observation guidance Iattensubscript𝐼𝑎𝑡𝑡𝑒𝑛I_{atten}italic_I start_POSTSUBSCRIPT italic_a italic_t italic_t italic_e italic_n end_POSTSUBSCRIPT from object detection. The tactile F𝐹Fitalic_F from the force sensor mounted on the wrist and the relative pose Rp=XEBXEgBsubscript𝑅𝑝superscriptsubscript𝑋𝐸𝐵superscriptsuperscriptsubscript𝑋𝐸𝑔𝐵R_{p}={}^{B}X_{E}-{}^{B}X_{E}^{g}italic_R start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT - start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT of the end-effector serve as additional observations for contact dynamics. The residual policy generates the desired force and torque Fdsubscript𝐹𝑑F_{d}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT in Eqn. (9), which together with πHcfsubscriptsuperscript𝜋𝑐𝑓𝐻\pi^{cf}_{H}italic_π start_POSTSUPERSCRIPT italic_c italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT as input to the compliance controlle in Eqn. (2).

Fd=πθ(Iatten,Rp,F)subscript𝐹𝑑subscript𝜋𝜃subscript𝐼𝑎𝑡𝑡𝑒𝑛subscript𝑅𝑝𝐹F_{d}=\pi_{\theta}(I_{atten},R_{p},F)italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_a italic_t italic_t italic_e italic_n end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_F )(9)

With the help of coarse operation planning, the learning of cognitive manipulation is carried out separately. The object detection models and hand-eye-task calibration are trained by semi-supervised visual representation learning in subsection B. The residual policy is trained by classroom-to-real residual reinforcement learning in subsection C.

IV-B Semi-supervised Visual Representation Learning

This section delves into training two object detection models and calibrating hand-eye-task relationships using collected samples based on the geometric model of a specific assembly task and the robot’s kinematics. Our approach enables the gathering of a varied sample set through a carefully planned coarse operation in pick-and-place. To address the challenges related to accurate data labeling, we suggest a streamlined calibration and labeling process that significantly reduces the engineering effort. Furthermore, fine-tuning from a pre-trained model is utilized to reduce the reliance on extensive sample volumes.

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (6)

IV-B1 Data Collection via Coarse Operation

The master object may appear at different points P𝑃Pitalic_P throughout the workspace Swsubscript𝑆𝑤S_{w}italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, requiring global localization, further global localization exists as possible relative pose error RPsubscript𝑅𝑃R_{P}italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT within the range Ersubscript𝐸𝑟E_{r}italic_E start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, requiring further local perception as visual attention for fine manipulation. To ensure data diversity for training the position and attention models, we first uniformly collected m𝑚mitalic_m points within the workspace where the task board will appear during the real scene, denoted by P=[xr,yr]𝑃superscript𝑥𝑟superscript𝑦𝑟P=[x^{r},y^{r}]italic_P = [ italic_x start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ]. Second, we used an offset to the bottleneck pose XEmBsuperscriptsuperscriptsubscript𝑋𝐸𝑚𝐵{}^{B}X_{E}^{m}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and generated n𝑛nitalic_n points to cover the uncertainty space while avoiding collisions, defined them as RP=[Δxr,Δyr,Δzr]subscript𝑅𝑃Δsuperscript𝑥𝑟Δsuperscript𝑦𝑟Δsuperscript𝑧𝑟R_{P}=[\Delta x^{r},\Delta y^{r},\Delta z^{r}]italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = [ roman_Δ italic_x start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , roman_Δ italic_y start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , roman_Δ italic_z start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ]. The sampling points and poses are shown in Fig. 6 (b). We collected eye-in-hand and eye-to-hand images, as well as the corresponding position and relative pose, using a hand-designed trajectory for data diversity, as shown in Eqn. (10).

(Iieth,[xir,yir]),i=1,2,,m×nformulae-sequencesuperscriptsubscript𝐼𝑖𝑒𝑡superscriptsubscript𝑥𝑖𝑟superscriptsubscript𝑦𝑖𝑟𝑖12𝑚𝑛\displaystyle(I_{i}^{eth},[x_{i}^{r},y_{i}^{r}]),i=1,2,...,m\times n( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_t italic_h end_POSTSUPERSCRIPT , [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ] ) , italic_i = 1 , 2 , … , italic_m × italic_n(10)
(Iieih,[Δxir,Δyir,Δzir]),i=1,2,,m×nformulae-sequencesuperscriptsubscript𝐼𝑖𝑒𝑖Δsuperscriptsubscript𝑥𝑖𝑟Δsuperscriptsubscript𝑦𝑖𝑟Δsuperscriptsubscript𝑧𝑖𝑟𝑖12𝑚𝑛\displaystyle(I_{i}^{eih},[\Delta x_{i}^{r},\Delta y_{i}^{r},\Delta z_{i}^{r}]%),i=1,2,...,m\times n( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e italic_i italic_h end_POSTSUPERSCRIPT , [ roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ] ) , italic_i = 1 , 2 , … , italic_m × italic_n

IV-B2 Calibration and Automate Label

For eye-to-hand, the transformation between camera and robot, Trcsuperscriptsubscript𝑇𝑟𝑐{}^{c}T_{r}start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and Tcrsuperscriptsubscript𝑇𝑐𝑟{}^{r}T_{c}start_FLOATSUPERSCRIPT italic_r end_FLOATSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, need to be calibrated for automated label and vision-based localization in robot control. Considering only x𝑥xitalic_x and y𝑦yitalic_y dimensions, the transform can be formulated as Eqn. (11) and (12).

[xic,yic]T=[a11a12a13a21a22a23][xir,yir,1]Tsuperscriptsuperscriptsubscript𝑥𝑖𝑐superscriptsubscript𝑦𝑖𝑐𝑇delimited-[]subscript𝑎11subscript𝑎12subscript𝑎13subscript𝑎21subscript𝑎22subscript𝑎23superscriptsuperscriptsubscript𝑥𝑖𝑟superscriptsubscript𝑦𝑖𝑟1𝑇[x_{i}^{c},y_{i}^{c}]^{T}=\left[\begin{array}[]{ccc}a_{11}&a_{12}&a_{13}\\a_{21}&a_{22}&a_{23}\end{array}\right][x_{i}^{r},y_{i}^{r},1]^{T}[ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT(11)
[xir,yir]T=[b11b12b13b21b22b23][xic,yic,1]Tsuperscriptsuperscriptsubscript𝑥𝑖𝑟superscriptsubscript𝑦𝑖𝑟𝑇delimited-[]subscript𝑏11subscript𝑏12subscript𝑏13subscript𝑏21subscript𝑏22subscript𝑏23superscriptsuperscriptsubscript𝑥𝑖𝑐superscriptsubscript𝑦𝑖𝑐1𝑇[x_{i}^{r},y_{i}^{r}]^{T}=\left[\begin{array}[]{ccc}b_{11}&b_{12}&b_{13}\\b_{21}&b_{22}&b_{23}\end{array}\right][x_{i}^{c},y_{i}^{c},1]^{T}[ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = [ start_ARRAY start_ROW start_CELL italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL italic_b start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL italic_b start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL italic_b start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL start_CELL italic_b start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT(12)

For eye-in-hand, we mainly focus on the mapping relationship of relative motion between the robot and the task in the robot coordinate system and pixel coordinate system, to realize semi-automatic annotation. Because the image Jacobian can be considered as constant in a limited space, the transform J𝐽Jitalic_J can be formulated as Eq. (13).

[ΔlxicΔlyicΔrxicΔryic]=[c11c12c13c14c21c22c23c24c31c32c33c34c41c42c43c44][ΔxirΔyirΔzir1]delimited-[]superscriptΔ𝑙superscriptsubscript𝑥𝑖𝑐missing-subexpressionmissing-subexpressionmissing-subexpressionsuperscriptΔ𝑙superscriptsubscript𝑦𝑖𝑐missing-subexpressionmissing-subexpressionmissing-subexpressionsuperscriptΔ𝑟superscriptsubscript𝑥𝑖𝑐missing-subexpressionmissing-subexpressionmissing-subexpressionsuperscriptΔ𝑟superscriptsubscript𝑦𝑖𝑐missing-subexpressionmissing-subexpressionmissing-subexpressiondelimited-[]subscript𝑐11subscript𝑐12subscript𝑐13subscript𝑐14subscript𝑐21subscript𝑐22subscript𝑐23subscript𝑐24subscript𝑐31subscript𝑐32subscript𝑐33subscript𝑐34subscript𝑐41subscript𝑐42subscript𝑐43subscript𝑐44delimited-[]Δsuperscriptsubscript𝑥𝑖𝑟missing-subexpressionmissing-subexpressionmissing-subexpressionΔsuperscriptsubscript𝑦𝑖𝑟missing-subexpressionmissing-subexpressionmissing-subexpressionΔsuperscriptsubscript𝑧𝑖𝑟missing-subexpressionmissing-subexpressionmissing-subexpression1missing-subexpressionmissing-subexpressionmissing-subexpression\left[\begin{array}[]{cccc}{}^{l}\Delta x_{i}^{c}\\{}^{l}\Delta y_{i}^{c}\\{}^{r}\Delta x_{i}^{c}\\{}^{r}\Delta y_{i}^{c}\end{array}\right]=\left[\begin{array}[]{cccc}c_{11}&c_{%12}&c_{13}&c_{14}\\c_{21}&c_{22}&c_{23}&c_{24}\\c_{31}&c_{32}&c_{33}&c_{34}\\c_{41}&c_{42}&c_{43}&c_{44}\end{array}\right]\left[\begin{array}[]{cccc}\Deltax%_{i}^{r}\\\Delta y_{i}^{r}\\\Delta z_{i}^{r}\\1\end{array}\right][ start_ARRAY start_ROW start_CELL start_FLOATSUPERSCRIPT italic_l end_FLOATSUPERSCRIPT roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL start_FLOATSUPERSCRIPT italic_l end_FLOATSUPERSCRIPT roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL start_FLOATSUPERSCRIPT italic_r end_FLOATSUPERSCRIPT roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL start_FLOATSUPERSCRIPT italic_r end_FLOATSUPERSCRIPT roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ] = [ start_ARRAY start_ROW start_CELL italic_c start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_c start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT 24 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_c start_POSTSUBSCRIPT 31 end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT 32 end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT 34 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_c start_POSTSUBSCRIPT 41 end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT 42 end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT 43 end_POSTSUBSCRIPT end_CELL start_CELL italic_c start_POSTSUBSCRIPT 44 end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_Δ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ](13)

Using a small number of coordinates in the pixel frame Lethi=[xic,yic]superscriptsubscript𝐿𝑒𝑡𝑖superscriptsubscript𝑥𝑖𝑐superscriptsubscript𝑦𝑖𝑐L_{eth}^{i}=[x_{i}^{c},y_{i}^{c}]italic_L start_POSTSUBSCRIPT italic_e italic_t italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] and Leihi=[Δlxic,Δlyic,Δrxic,Δryic]superscriptsubscript𝐿𝑒𝑖𝑖superscriptΔ𝑙superscriptsubscript𝑥𝑖𝑐superscriptΔ𝑙superscriptsubscript𝑦𝑖𝑐superscriptΔ𝑟superscriptsubscript𝑥𝑖𝑐superscriptΔ𝑟superscriptsubscript𝑦𝑖𝑐L_{eih}^{i}=[{}^{l}\Delta x_{i}^{c},{}^{l}\Delta y_{i}^{c},{}^{r}\Delta x_{i}^%{c},{}^{r}\Delta y_{i}^{c}]italic_L start_POSTSUBSCRIPT italic_e italic_i italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = [ start_FLOATSUPERSCRIPT italic_l end_FLOATSUPERSCRIPT roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , start_FLOATSUPERSCRIPT italic_l end_FLOATSUPERSCRIPT roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , start_FLOATSUPERSCRIPT italic_r end_FLOATSUPERSCRIPT roman_Δ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , start_FLOATSUPERSCRIPT italic_r end_FLOATSUPERSCRIPT roman_Δ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] provided by manual annotation and coordinates in the robot frame captured during sampling, the transformation from the robot base frame to the eye-to-hand pixel frame Trcsuperscriptsubscript𝑇𝑟𝑐{}^{c}T_{r}start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, Tcrsuperscriptsubscript𝑇𝑐𝑟{}^{r}T_{c}start_FLOATSUPERSCRIPT italic_r end_FLOATSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the relative motion relationship between the robot base frame and the eye-to-hand pixel frame can be estimated. The calibrated transform relationship can be used to automate the labeling of the remaining images. In addition, the transform Tcrsuperscriptsubscript𝑇𝑐𝑟{}^{r}T_{c}start_FLOATSUPERSCRIPT italic_r end_FLOATSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT from the pixel frame to the robot’s base frame can be used as hand-eye-task calibration, which involves estimating the assembly pose of the robot’s end-effector for localization by the eye-to-hand camera.

IV-B3 Fine-tuning from Pre-trained Model

This work used a one-stage real-time object detection approach, YOLO (You Only Look Once), to estimate assembly goal pose and visual attention, which is famous for robustness and fast computation. In addition, the pre-trained model using the ImageNet dataset can be used as initial parameters for training in custom datasets, which require fewer samples. The image and labels are divided into training and test sets for model training and evaluation.

IV-C Classroom-to-Real Residual Reinforcement Learning

In this subsection, we discuss a practical Residual Reinforcement Learning for fine manipulation to address the challenges posed by exploration efficiency and safety in semi-structured environments. Classroom-to-real learning trains the residual policy within a simplified structured environment and subsequently transfers it to semi-structured environments. The visual representation and coarse operation provide a base policy and task-relevant features for context generalization to facilitate effective learning and seamless transfer.

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (7)

IV-C1 Curriculum Residual Learning

In a structured environment, the key poses of the end effector can be obtained through demonstration. The coarse operation as a base policy is generated and local task detection as attention is loaded for the initialization of the cognitive manipulation. This work formulated the combination of base and residual sub-policies based on the compliance controller in task space as follows:

q˙=fcr(πH,πθ)˙𝑞subscript𝑓𝑐𝑟subscript𝜋𝐻subscript𝜋𝜃\dot{q}=f_{cr}(\pi_{H},\pi_{\theta})over˙ start_ARG italic_q end_ARG = italic_f start_POSTSUBSCRIPT italic_c italic_r end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT )(14)

To increase the robustness of residual policy to the perceptual uncertainty of the pose estimator, a random error is injected into the trajectory. However, the error range may affect the learning efficiency due to the difficulty of exploration and the diversity of samples, as shown in Fig. 7. Therefore, this work automatically controls the task difficulty by increasing or decreasing ε𝜀\varepsilonitalic_ε to the guidance error range of Er0subscript𝐸𝑟0E_{r0}italic_E start_POSTSUBSCRIPT italic_r 0 end_POSTSUBSCRIPT to keep the success rate srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT within the desired interval [α,β]𝛼𝛽[\alpha,\beta][ italic_α , italic_β ], as shown in Eqn. (15).

Er=Er0+ε1sr>βε1sr<αsubscript𝐸𝑟subscript𝐸𝑟0𝜀subscript1subscript𝑠𝑟𝛽𝜀subscript1subscript𝑠𝑟𝛼E_{r}=E_{r0}+\varepsilon*1_{s_{r}>\beta}-\varepsilon*1_{s_{r}<\alpha}italic_E start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_r 0 end_POSTSUBSCRIPT + italic_ε ∗ 1 start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT > italic_β end_POSTSUBSCRIPT - italic_ε ∗ 1 start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT < italic_α end_POSTSUBSCRIPT(15)

IV-C2 Reward Shaping

This work normalizes the Euclidean distance between the end-effector current X𝑋Xitalic_X and target pose XEgBsuperscriptsuperscriptsubscript𝑋𝐸𝑔𝐵{}^{B}X_{E}^{g}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT to create a guide reward Rguidsubscript𝑅𝑔𝑢𝑖𝑑R_{guid}italic_R start_POSTSUBSCRIPT italic_g italic_u italic_i italic_d end_POSTSUBSCRIPT that increases as it gets closer to the target pose as in Eqn. (16). The force penalty Rforcsubscript𝑅𝑓𝑜𝑟𝑐R_{forc}italic_R start_POSTSUBSCRIPT italic_f italic_o italic_r italic_c end_POSTSUBSCRIPT is defined according to the interaction force F𝐹Fitalic_F for smooth operation as in Eqn. (17). Additionally, if the transition condition c4subscript𝑐4c_{4}italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT is satisfied, it gets a positive reward of Rsuccsubscript𝑅𝑠𝑢𝑐𝑐R_{succ}italic_R start_POSTSUBSCRIPT italic_s italic_u italic_c italic_c end_POSTSUBSCRIPT as in Eqn. (18). Therefore a multi-objective reward function is defined as Eqn. (19), which uses weights λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT to balance multiple sub-objectives.

Rguid=diag(W)1(XXEgB)subscript𝑅𝑔𝑢𝑖𝑑norm𝑑𝑖𝑎𝑔superscript𝑊1𝑋superscriptsuperscriptsubscript𝑋𝐸𝑔𝐵R_{guid}=\left\|diag(W)^{-1}(X-{}^{B}X_{E}^{g})\right\|italic_R start_POSTSUBSCRIPT italic_g italic_u italic_i italic_d end_POSTSUBSCRIPT = ∥ italic_d italic_i italic_a italic_g ( italic_W ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_X - start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) ∥(16)
Rforc=diag(Fmax)1Fsubscript𝑅𝑓𝑜𝑟𝑐norm𝑑𝑖𝑎𝑔superscriptsuperscript𝐹1𝐹R_{forc}=\left\|diag({F}^{\max})^{-1}F\right\|italic_R start_POSTSUBSCRIPT italic_f italic_o italic_r italic_c end_POSTSUBSCRIPT = ∥ italic_d italic_i italic_a italic_g ( italic_F start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_F ∥(17)
Rsucc,d={100,1ifc40,0otherwise.subscript𝑅𝑠𝑢𝑐𝑐𝑑cases1001ifsubscript𝑐400otherwise.{R_{succ},d}=\begin{cases}100,1&{\text{if}}\ c_{4}\\{0,0}&{\text{otherwise.}}\end{cases}italic_R start_POSTSUBSCRIPT italic_s italic_u italic_c italic_c end_POSTSUBSCRIPT , italic_d = { start_ROW start_CELL 100 , 1 end_CELL start_CELL if italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , 0 end_CELL start_CELL otherwise. end_CELL end_ROW(18)
r(s)=λ1Rguid+λ2Rforc+λ1Rsucc𝑟𝑠subscript𝜆1subscript𝑅𝑔𝑢𝑖𝑑subscript𝜆2subscript𝑅𝑓𝑜𝑟𝑐subscript𝜆1subscript𝑅𝑠𝑢𝑐𝑐r(s)=\lambda_{1}R_{guid}+\lambda_{2}R_{forc}+\lambda_{1}R_{succ}italic_r ( italic_s ) = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_g italic_u italic_i italic_d end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_f italic_o italic_r italic_c end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_s italic_u italic_c italic_c end_POSTSUBSCRIPT(19)

IV-C3 Soft-Actor-Critic

A model-free DRL, soft actor-critic, is introduced to achieve a real-time optimal control strategy for intricate fine manipulation. Unlike pure RL, residual learning enhances the performance of the entire policy by optimizing the residual parameterized part of the policy. The state and action of the residual policy, along with the reward of the overall policy, are gathered in multiple recurring episodes and stored in a data replay buffer for off-policy learning. The training was carried out in a structured environment and then transferred to a semi-structured environment. The cognitive manipulation algorithm is depicted in Aglo.1, where Line 1-2 acquire the coarse operation πHcfsubscriptsuperscript𝜋𝑐𝑓𝐻\pi^{cf}_{H}italic_π start_POSTSUPERSCRIPT italic_c italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and πHcrsubscriptsuperscript𝜋𝑐𝑟𝐻\pi^{cr}_{H}italic_π start_POSTSUPERSCRIPT italic_c italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT with the estimated localization XEgBsuperscriptsuperscriptsubscript𝑋𝐸𝑔𝐵{}^{B}X_{E}^{g}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, Line 3 -9 obtain the desired pose, Xdsubscript𝑋𝑑X_{d}italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, stiffness K𝐾Kitalic_K and desired force/torque Fdsubscript𝐹𝑑F_{d}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, Line 6 drives the robot to the assembly bottleneck pose, Line 11-18 perform fine manipulation for accurate assembly tasks.

0:ΔPgΔsubscript𝑃𝑔\Delta P_{g}roman_Δ italic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, ΔPpΔsubscript𝑃𝑝\Delta P_{p}roman_Δ italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, X𝑋Xitalic_X, F𝐹Fitalic_F, Iethsubscript𝐼𝑒𝑡I_{eth}italic_I start_POSTSUBSCRIPT italic_e italic_t italic_h end_POSTSUBSCRIPT, Ieihsubscript𝐼𝑒𝑖I_{eih}italic_I start_POSTSUBSCRIPT italic_e italic_i italic_h end_POSTSUBSCRIPT

0:Xdsubscript𝑋𝑑X_{d}italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, Fdsubscript𝐹𝑑F_{d}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, K𝐾Kitalic_K

1:Estimate localization P𝑃Pitalic_P with Eqn. (1-4)

2:Planning motion guidance πHcfsubscriptsuperscript𝜋𝑐𝑓𝐻\pi^{cf}_{H}italic_π start_POSTSUPERSCRIPT italic_c italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and πHcrsubscriptsuperscript𝜋𝑐𝑟𝐻\pi^{cr}_{H}italic_π start_POSTSUPERSCRIPT italic_c italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT with Eqn. (6-9)

3:foreach time stepdo

4:Xd,KπHcfsubscript𝑋𝑑𝐾subscriptsuperscript𝜋𝑐𝑓𝐻X_{d},K\leftarrow\pi^{cf}_{H}italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_K ← italic_π start_POSTSUPERSCRIPT italic_c italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT

5:Fd0subscript𝐹𝑑0F_{d}\leftarrow 0italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ← 0

6:Apply action Xdsubscript𝑋𝑑X_{d}italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, Fdsubscript𝐹𝑑F_{d}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and K𝐾Kitalic_K to robot controller

7:if|XPp|Ec𝑋subscript𝑃𝑝subscript𝐸𝑐|X-P_{p}|\leq E_{c}| italic_X - italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | ≤ italic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPTthen

8:Break

9:endif

10:endfor

11:foreach time stepdo

12:Estimate attention Iattensubscript𝐼𝑎𝑡𝑡𝑒𝑛I_{atten}italic_I start_POSTSUBSCRIPT italic_a italic_t italic_t italic_e italic_n end_POSTSUBSCRIPT with Eqn. (5)

13:Xd,KπHcrsubscript𝑋𝑑𝐾subscriptsuperscript𝜋𝑐𝑟𝐻X_{d},K\leftarrow\pi^{cr}_{H}italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_K ← italic_π start_POSTSUPERSCRIPT italic_c italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT

14:Fdπθsubscript𝐹𝑑subscript𝜋𝜃F_{d}\leftarrow\pi_{\theta}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ← italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT

15:Apply action K𝐾Kitalic_K, Fdsubscript𝐹𝑑F_{d}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and Xdsubscript𝑋𝑑X_{d}italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT to robot controller

16:if|XPg|Ec𝑋subscript𝑃𝑔subscript𝐸𝑐|X-P_{g}|\leq E_{c}| italic_X - italic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT | ≤ italic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPTthen

17:Break

18:endif

19:endfor

V Experiment

This section delineates the experimental validation of the proposed cognitive manipulation method, specifically designed for robotic assembly tasks within semi-structured environments.Firstly, we introduce the robot hardware and software and establish several baselines to compare the proposed method with existing methodologies. Secondly, comparison and ablation experiments are performed in simulation to validate hand-eye calibration and auto-annotation methods using small amounts of manually annotated data, the effect of embodied data acquisition on the object detection model, and the effect of object detection on the training and performance of fine manipulation in semi-structured environments by providing location and visual attention. Finally, the comprehensive evaluation with two real tasks underscores the practical applications of our approach for robotic assembly within semi-structured environments.

V-A Experiment Setup

V-A1 Hardware and Software

The experiments are conducted on a computer equipped with an Nvidia GeForce RTX 2060 GPU and an Intel i7-9700 CPU. The Robot Operating System (ROS) is utilized as the middleware, facilitating seamless communication between the learning algorithms, control modules, and the robotic system.

V-A2 Partial Model, Data Collection and Reward Design

The geometry information for data collection in semi-supervised object detection models and classroom-to-real fine manipulation policy training is obtained by demonstration. Maximum contact force Fmaxsuperscript𝐹𝑚𝑎𝑥F^{max}italic_F start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT is set as 10 N in x, y and z directions and 0.1 N*m in Rx, Ry and Rz directions. The hybrid policy updates the pose and force commands to the controller at a frequency of 5 Hz, while the controller outputs the target joint velocities for the robot at 120 Hz. Each experimental episode is capped at 120 steps, with policy networks undergoing 200 gradient updates per episode. The reward weights w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and w3subscript𝑤3w_{3}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are set to 1, 0.8, and 1, respectively, as determined by preliminary experiments to balance operation speed and smoothness. The curriculum increases or decreases by 0.5 mm to the error range from 2 mm to keep the success rate within the desired interval [0.5, 0.7].

V-A3 Baselines for Comparative Study

To underscore the advantages of our cognitive manipulation architecture in terms of learning efficiency and context generalization, we compare our method in assembly tasks against the following baselines:

  1. 1.

    Baseline 1 [28]: This baseline directly predicts the desired final poses of slave object for manipulation using only raw RGB-D images and implements a simple open-loop control scheme on a real robot. The training data for pose estimation is synthetically generated, facilitating easy manipulation of geometry and texture.

  2. 2.

    Baseline 2 [49, 27]: This approach leverages direct teaching to specify global master object poses and plans robotic motion accordingly. It utilizes hand-mounted cameras and visual classifiers to predict and correct positioning errors, enabling precise attachment of master and slave objects without calibration by search policy.

  3. 3.

    Baseline 3 [32]: This baseline introduces a perception system with uncertainty estimates to delineate regions where the model-based policy is reliable from those where it may be flawed or undefined, blending the strengths of model-based and learning-based methods.

  4. 4.

    Baseline 4 [8]: This work combines a vision-based fixed policy with a contact-based residual parametric policy, enhancing the robustness and efficiency of the RL algorithm.

  5. 5.

    Baseline 5 ([50]): This baseline Employs similar residual learning techniques with Cartesian impedance control, utilizing visual inputs for larger error adjustments during contact-rich manipulation.

V-B Simulation Experiment for Comparative and Ablation Study

Our proposed approach seeks to enhance the sampling efficiency and reduce the engineering effort required for policy reconfiguration in contact-rich tasks within semi-structured environments. This is accomplished by integrating semi-supervised learning of object detection with classroom-to-real residual reinforcement learning of fine manipulation. To facilitate a comparative and ablation study, we have developed a simulation environment based on Gazebo, which allows for dynamic loading and deletion of objects at any position within the defined space, thereby constructing a robot assembly task in a semi-structured setting. The application of our proposed cognitive manipulation framework to a new assembly task is structured into four distinct stages: Embodied hand-eye-task calibration and semi-automatic annotation, supervised fine-tuning of the object detection models, residual reinforcement learning of fine manipulation, and application of the integrated strategy to a semi-structured environment. Each stage is progressively compared with established baselines to evaluate the advantages of the proposed methodologies.

V-B1 Embodied Hand-Eye-Task Calibration and Semi-Automatic Annotation

The dataset is curated based on prior knowledge of uncertainty to ensure sample diversity. The initial stage aims to minimize the costs associated with manual labeling and hand-eye calibration while generating high-quality labeling data and accurately estimating the target assembly pose of the end-effector. This stage encompasses four critical processes: data acquisition, manual labeling, hand-eye-task relationship fitting, and semi-automatic labeling. We explore the dependency on the proportion of manually annotated samples and its advantages over purely manual annotation. The standard deviation of the hand-eye relationship fitting serves as a quantitative index for evaluating the calibration and annotation.

NumberTrcsuperscriptsubscript𝑇𝑟𝑐{}^{c}T_{r}start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT STDJ𝐽Jitalic_J STDTcrsuperscriptsubscript𝑇𝑐𝑟{}^{r}T_{c}start_FLOATSUPERSCRIPT italic_r end_FLOATSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT STD
ManualAllManualAll
80.00390.00070.00620.00100.0020
190.00430.00100.00640.00140.0025
380.00410.00130.00680.00210.0023

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (8)

The number of manually labeled samples and the corresponding accuracy of Trcsuperscriptsubscript𝑇𝑟𝑐{}^{c}T_{r}start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, J𝐽Jitalic_J, and Tcrsuperscriptsubscript𝑇𝑐𝑟{}^{r}T_{c}start_FLOATSUPERSCRIPT italic_r end_FLOATSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT represented by standard deviation, as well as the standard deviation of Trcsuperscriptsubscript𝑇𝑟𝑐{}^{c}T_{r}start_FLOATSUPERSCRIPT italic_c end_FLOATSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and J𝐽Jitalic_J on all data sets are shown in Table III. The results demonstrate that a small number of manually labeled samples can effectively establish hand-eye calibration relations and facilitate semi-automatic labeling of the remaining images, reducing the standard deviation of data annotation. Examples of both manually and automatically annotated data are illustrated in Fig. 12. The semi-automatic annotation reduces the cost of manual annotation by 97.4%.

V-B2 Supervised Fine-tuning for Object Detection

In this phase, We examine the impact of sample diversity on model performance by designing various sampling schemes and comparing models trained on different dataset sizes. We mainly consider two factors in the sampling process: pick-and-place position and robot posture, which together affect sample size and diversity, including 10 (5*2), 30 (10*3), 60 (14*5), 120 (15*8), 180 (18*10), 384 (24*16).In addition, the combination of 10 (5*2) after data enhancement to 300 is used as a baseline to facilitate the comparison between embodied data acquisition and traditional data enhancement methods based on graph transformation. 20%percent\%% of the embodied data enhanced 385 data is taken as the verification set to represent the complex state in the assembly operation process. Precision, recall, mAP@.5, and mAP@.5:.95 are employed to assess the influence of sample quantity and data enhancement methods on the performance of the object detection models.

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (9)

ModelsDatasetsPrecisionRecallmAP@.5mAP@.5:.95
Pose estimation100.2330.4480.1560.109
10-aug0.9991.00.9950.935
300.7500.7410.7740.374
600.9991.00.9950.950
1200.9991.00.9950.994
1800.9991.00.9950.993
3840.9991.00.9950.990
Task attention100.0220.0510.0120.003
10-aug0.9991.00.9950.714
300.6860.9800.8150.539
600.9991.00.9950.879
1200.9991.00.9950.936
1800.9981.00.9950.964
3841.01.00.9950.989

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (10)

The results indicate that data acquisition based on prior knowledge significantly enhances the environmental perception capabilities of the detection model. The mAP@.5:.95 values for the two object detection models during the training process are displayed in Fig. 12, highlighting the importance of sample diversity and the limitations of data enhancement based solely on graphical transformations. Contrasting the effect of the number of samples on the model performance, too few samples, less than 60, will cause the training process to fail to converge. With the increase in the number of samples, the trained model can obtain higher precision, recall, mAP@.5, and mAP@.5:.95 values. This is due to improved sample diversity by more background changes and robot-task relative posture. In particular, the performance of local task attention models is more sensitive to sample diversity due to unavoidable occlusion during contact-rich operations. Contrasting methods of sample generation, although data enhancement based on graph transformation can increase the number of samples to avoid overfitting, it lacks diversity. The position estimation model achieved acceptable results, while the task attention model performed poorly. The embodied data collection increases the mAP@.5:.95 by 5.5%percent\%% in global perception and 27.5%percent\%% in local perception compared to existing data augmentation methods.

V-B3 Residual Reinforcement Learning of Fine Manipulation

This stage involves training a residual policy, supported by a hand-designed base policy and task-focused view, on a fixed master object setting.Physical contact states are identified using LSTM networks to encode time-series data from touch and proprioception sensors, combined with visual feedback processed through a CNN for serious position error. An MLP then integrates the low-dimensional latent features from LSTM and CNN to generate residual actions. We contrast this procedure with three baselines 3, 4, and 5 to highlight the advantages of knowledge-informed learning.

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (11)

Results suggest that our approach facilitates more efficient and effective learning by concentrating on task-relevant details and addressing intricate contact dynamics and positional uncertainties. The success rate and error range throughout the training of the manipulation policy are presented in Fig. 11. In baseline 3, without the base policy, pure RL struggles in precise insertion tasks due to a local optimum created by penalizing contact forces. Combining the base policy with RL-based residual policy in baseline 4 can result in success, but the curriculum only achieves an 8 mm error in force-based residual policy training due to limited observations of contact force and vision-based pose estimation, creating a Partially Observable Markov Decision Processes (POMDP). In baseline 5, utilizing raw visual information can improve observations, with the curriculum achieving around 20 mm. The ROI-based attention in our approach, allowing the policy to rely on limited features and introducing perturbations due to restricted detection accuracy, marginally influences efficiency.

V-B4 Cognitive Manipulation in Semi-structured Environment

After individual training phases, the integrated policy is evaluated for its effectiveness in terms of success rate and completion time during assembly tasks in a semi-structured environment. The manipulator grasps the slave object, a gear, while the master object, a task board, is randomly positioned within a confined workspace of 350*350 mm. We carry out 16 trials to compare our method with other baselines, excluding the non-convergent baseline 3.

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (12)

MethodsSuccess rateCosted steps
Baseline 10.12564.03±18.45plus-or-minus64.0318.4564.03\pm 18.4564.03 ± 18.45
Baseline 20.313104.3±23.63plus-or-minus104.323.63104.3\pm 23.63104.3 ± 23.63
Baseline 40.8785.3±19.11plus-or-minus85.319.1185.3\pm 19.1185.3 ± 19.11
Baseline 50.6773.4±32.86plus-or-minus73.432.8673.4\pm 32.8673.4 ± 32.86
Ours1.054.4±17.17plus-or-minus54.417.1754.4\pm 17.1754.4 ± 17.17

The results indicate that our approach exhibits superior performance in handling challenging assembly tasks, achieving a perfect success rate and significantly reducing completion steps, as outlined in Table III. Baseline 1 can only complete 12.5 %percent\%% of trials with an average of 64 steps, while baseline 2 can only complete 40 %percent\%% of trials with an average of 104 steps, nearing the maximum time step. The object detection, single-camera usage, and the simple model-based method fall short of meeting the task’s accuracy and the environment’s uncertainty requirements. Although random residual actions can help to compensate for the perception errors, the semi-structured environment poses additional challenges due to movable objects. The contact force generated during the random search can displace the task board, leading to larger errors or even causing the gear to slip off the peg. Baseline 4 outperforms baseline 2 in both success rate and cost steps because force-based agents can enhance the search policy by regulating the contact force and position reference based on the estimated contact state derived from interaction forces. Although baseline 5 is more robust than baseline 4 in training, it only performs 13.95 %percent\%% better in costed steps and even worse in success rate. The raw visual information enables the residual policy to compensate for larger errors in training, but its performance may degrade in different locations with different backgrounds. In comparison, the proposed method utilizes visual attention to assist the agent in focusing on the task, resulting in a success rate of 1 with an average of 54.4 costed steps. In conclusion, the success rate is increased by 13%percent\%% and the number of steps is reduced by 15.4%percent\%% compared to competing methods.

V-C Comprehensive Evaluation on Real Tasks

The primary objective of this research is to develop and validate a cognitive manipulation framework suitable for robot learning in real-world robotic applications. To assess the effectiveness of our proposed architecture, we conducted experiments using a UR5 robot to execute two precision assembly tasks: peg-in-hole and gear-insertion. These tasks, depicted in Fig. 13 (a), are designed to test the robot’s ability to handle complex manipulations in real settings. The robot was programmed to perform tasks based on geometric information derived from a teaching phase, which was used to construct a skill graph that encapsulates common assembly knowledge. Critical points including grasp and bottleneck pose were identified in semi-structured environments to facilitate hand-eye-task calibration and semi-supervised fine-tuning of object detection, as shown in Fig. 13 (b) and (c). In structured environments, critical points including grasp and assembly goal pose guided the learning of contact-rich fine manipulation, as shown in Fig. 13 (d). An Object-Embodiment-Centric (OEC) task representation, incorporating home, grasp, bottleneck, and assembly goal points, was employed to reconstruct the basic operational strategy. This strategy was integrated with visual and fine manipulation models to accomplish assembly tasks within a confined area of 500x500 mm, as shown in Fig. 13 (e).

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (13)

The training process was optimized based on insights from simulation experiments, focusing on minimizing training costs while maximizing operational efficiency. For object detection, 5 points were gathered from the workspace to improve environmental robustness. We captured five images from a global perspective at each point for pose estimation and an additional 18 images per point from a local view to enhance task-specific attention. This embodied data collection strategy ensured diversity and data enhancement is further used to enhance the robustness against robot pose variations in manipulation. The fine manipulation training extended over 150 episodes, deemed adequate for achieving resilience against uncertainties in the base policy arising from pose estimation errors and unknown contact dynamics. We evaluated the success rate and completion time of the assembly tasks, using these metrics to benchmark the performance of our proposed architecture against two other baselines.

Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (14)
TaskMethodsSuccess RateCompletion Time
Peg-in-holeBaselines10.188.13±4.96plus-or-minus8.134.968.13\pm 4.968.13 ± 4.96
Baselines20.43718.06±6.65plus-or-minus18.066.6518.06\pm 6.6518.06 ± 6.65
Ours0.9376.11±1.32plus-or-minus6.111.32\ 6.11\pm 1.326.11 ± 1.32
Gear-insertionBaselines10.069.2±3.65plus-or-minus9.23.659.2\pm 3.659.2 ± 3.65
Baselines20.31319.10±5.75plus-or-minus19.105.7519.10\pm 5.7519.10 ± 5.75
Ours0.8757.04±1.42plus-or-minus7.041.42\ 7.04\pm 1.427.04 ± 1.42

The results, detailed in Table IV, indicate that our approach significantly outperformed the baselines in both tasks.Baselines 1 and 2 faced challenges with the tasks, primarily due to inaccuracies in pose estimation and control, as well as their inability to effectively navigate the semi-structured environment. In addition, traditional search-based methods proved ineffective due to the mobility of the master object, often causing the robot to get stuck on the object’s surface. Performance relies on the expert’s experience and parameter tuning. It took approximately 8 hours and multiple attempts to gather samples and fine-tune policy and controller parameters for new tasks, whereas our method only required 2.58 hours and minimal human intervention. Both global localization and local attention models can be trained within 0.58 hours, including 20 minutes for sampling and 15 minutes for training two models. Learning a robust residual policy for robustness to 15 mm error took 150 episodes, consuming 2 hours. The experimental results underscore the superiority of our cognitive manipulation framework in the practical applicability in complex, real-world environments, improving learning efficiency and engineering effort while showcasing the precision and efficiency of manipulation for robotic assembly tasks in real-world scenarios.

VI Discussion

The architecture presented in this study leverages a skill graph that merges the generalization capabilities of a pre-trained object detection model with the optimization ability of reinforcement learning to facilitate efficient learning with minimal reliance on extensive human knowledge and interaction data. Separate training of different components has proven more practical for real robots in precise assembly tasks compared to end-to-end training methods [29]. This method mirrors human learning processes, where theoretical knowledge is acquired first and then practiced in controlled settings before tackling complex real-world tasks.The skill graph not only enhances the efficiency of the object detection learning process by providing explicit prior knowledge for sample collection and annotation but also enables the system to estimate the assembly target pose similar to [28]. Motion planning guided by the skill graph enables diverse data collection from various perspectives and locations, reducing the need for manual labeling through low-cost calibration and automated labeling processes. The large pre-trained model benefits from this setup by allowing effective generalization through fine-tuning with implicit prior knowledge.Furthermore, the learned visual model excels in providing interpretable spatial location and task correlation information, surpassing structured visual representations [13] and uncertainty-aware pose estimation [32]. This information is essential for guiding and constraining the exploration process in reinforcement learning, enabling the system to efficiently learn about contact dynamics and pose uncertainty. the residual policy, guided by the base policy and focused multimodal observation, is optimized through a multi-objective reward system, enhancing the capability to tackle complex tasks and generalize across various contexts without the need for fixtures. Therefore, the separation and guidance for learning tasks using prior knowledge significantly enhances learning efficiency in controlled environments.

Compared to the existing combination of model-based and learning-based approaches [32, 8, 9], our cognitive manipulation method excels in semi-structured environments. It mimics the human approach of transitioning from global to local perception and from coarse to fine manipulation. In contact-free areas, the object detection estimates the position uncertainty of the main object problem caused by fewer constraints. The skill graph enables global perception beyond the workspace to avoid occlusion and directs coarse operations with rich geometric information, facilitating flexible and safe robot movement. In contact-rich areas, object detection provides visual information about the task’s attention and resolves the variable background interference caused by other dynamic objects. The residual strategy integrates task-focused visual and tactile information to solve the pose estimation error and the complex contact dynamics.

The partial models within our method are utilized to address diverse configurations in semi-structured environments. While this study primarily focuses on knowledge-driven robot learning and experiments validate the impact of such learning on efficiency and strategy robustness, this approach can be adapted to different environments by acquiring geometric information through teaching and adjusting temporal logic and transition conditions within the partial model.

VII Conclusion

This study introduces a novel cognitive manipulation framework for robotic assembly tasks in semi-structured environments. The framework employs a skill graph that integrates object detection, coarse operation planning, and fine operation execution. The training process, guided by skill maps and coarse-operation planning in controlled environments, involves semi-supervised learning for object detection and residual reinforcement learning of multimodal fine-operation strategies. The cognitive manipulation models are subsequently transferred to a semi-structured environment, where object detection and coarse operation, enhanced by skill graph, handle the uncertainty of the environment and provide guidance for residual policy to address pose estimation and contact dynamics uncertainty.Experimental results from simulation demonstrate that our cognitive manipulation facilitates reducing manual annotation costs by 97.4%percent\%% and enables learning assembly tasks involving a 20 mm error and a 0.1 mm gap within 300 episodes, showing significant progress in a semi-structured environment—an area where existing methods struggle—with a 13%percent\%% increase in success rate and a 15.4%percent\%% reduction in completion time. The practicality of the method was further confirmed in real experiments.

Despite these advancements, the learning efficiency and generalization capabilities in semi-structured environments have been substantially enhanced, yet challenges persist. Our method effectively utilizes prior knowledge to streamline the learning process for contact manipulation, especially simplifying the reinforcement learning challenges associated with uncertainties in pose estimation and contact dynamics. However, there is potential to further improve learning efficiency. Future work will focus on advancing learning efficiency through the application of offline enhancement methods, including sim-to-real transfer or meta-learning, to streamline the residual reinforcement learning process. The efficient learning method using prior knowledge increases the possibility of contact-rich manipulation multitasking or meta-learning in real robots. In addition, our approach, which utilizes object detection and skill graphs, aims to mitigate uncertainties in semi-structured environments, but further generalization to diverse environments and tasks remains a goal. Future research could explore more sophisticated 3D or 6D pose estimation techniques, develop more precise quality estimation and monitoring methods based on visuo-tactile fusion, and incorporate large language models (LLMs) for common sense reasoning. Considering complex state and exception handling methods, such as addressing failure and success scenarios, could reduce assumptions about semi-structured environments and enhance the quality and reliability of robot operations.

References

  • [1]A.Perzylo, M.Rickert, B.Kahl, N.Somani, C.Lehmann, A.Kuss, S.Profanter, A.B. Beck, M.Haage, M.R. Hansen etal., “Smerobotics: Smart robots for flexible manufacturing,” IEEE Robotics & Automation Magazine, vol.26, no.1, pp. 78–90, 2019.
  • [2]J.Hughes, K.Gilday, L.Scimeca, S.Garg, and F.Iida, “Flexible, adaptive industrial assembly: driving innovation through competition: Flexible manufacturing,” Intelligent Service Robotics, vol.13, pp. 169–178, 2020.
  • [3]K.Dharmara, R.P. Monfared, P.S. Ogun, and M.R. Jackson, “Robotic assembly of threaded fasteners in a non-structured environment,” The International Journal of Advanced Manufacturing Technology, vol.98, pp. 2093–2107, 2018.
  • [4]K.Nottensteiner, A.Sachtler, and A.Albu-Schäffer, “Towards autonomous robotic assembly: Using combined visual and tactile sensing for adaptive task execution,” Journal of Intelligent & Robotic Systems, vol. 101, no.3, p.49, 2021.
  • [5]F.Suárez-Ruiz, X.Zhou, and Q.-C. Pham, “Can robots assemble an ikea chair?” Science Robotics, vol.3, no.17, p. eaat6385, 2018.
  • [6]H.Chen, G.Zhang, H.Zhang, and T.A. Fuhlbrigge, “Integrated robotic system for high precision assembly in a semi-structured environment,” Assembly Automation, vol.27, no.3, pp. 247–252, 2007.
  • [7]J.Luo, O.Sushkov, R.Pevceviciute, W.Lian, C.Su, M.Vecerik, N.Ye, S.Schaal, and J.Scholz, “Robust multi-modal policies for industrial assembly via reinforcement learning and demonstrations: A large-scale study,” arXiv preprint arXiv:2103.11512, 2021.
  • [8]Y.Shi, Z.Chen, H.Liu, S.Riedel, C.Gao, Q.Feng, J.Deng, and J.Zhang, “Proactive action visual residual reinforcement learning for contact-rich tasks using a torque-controlled robot,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).IEEE, 2021, pp. 765–771.
  • [9]J.Zhao, Z.Wang, L.Zhao, and H.Liu, “A learning-based two-stage method for submillimeter insertion tasks with only visual inputs,” IEEE Transactions on Industrial Electronics, 2023.
  • [10]S.Stevsic, S.Christen, and O.Hilliges, “Learning to assemble: Estimating 6d poses for robotic object-object manipulation,” IEEE Robotics and Automation Letters, p. 1159–1166, Apr 2020.
  • [11]Q.Yu, C.Hao, J.Wang, W.Liu, L.Liu, Y.Mu, Y.You, H.Yan, and C.Lu, “Manipose: A comprehensive benchmark for pose-aware object manipulation in robotics,” 2024.
  • [12]M.Köhler, M.Eisenbach, and H.-M. Gross, “Few-shot object detection: A comprehensive survey,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–21, 2023.
  • [13]F.Zhang, Y.Chen, H.Qiao, and Z.Liu, “Surrl: Structural unsupervised representations for robot learning,” IEEE Transactions on Cognitive and Developmental Systems, vol.15, no.2, p. 819–831, Jun 2023. [Online]. Available: http://dx.doi.org/10.1109/tcds.2022.3187186
  • [14]S.Demura, K.Sano, W.Nakajima, K.Nagahama, K.Takesh*ta, and K.Yamazaki, “Picking up one of the folded and stacked towels by a single arm robot,” in 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO).IEEE, 2018, pp. 1551–1556.
  • [15]L.Yen-Chen, A.Zeng, S.Song, P.Isola, and T.-Y. Lin, “Learning to see before learning to act: Visual pre-training for manipulation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).IEEE, 2020, pp. 7286–7293.
  • [16]Y.-Z. Hsieh, F.-X. Xu, and S.-S. Lin, “Deep convolutional generative adversarial network for inverse kinematics of self-assembly robotic arm based on the depth sensor,” IEEE Sensors Journal, vol.23, no.1, pp. 758–765, 2022.
  • [17]L.Johannsmeier, M.Gerchow, and S.Haddadin, “A framework for robot manipulation: Skill formalism, meta learning and adaptive control,” 2019 International Conference on Robotics and Automation (ICRA), pp. 5844–5850, 2018.
  • [18]X.Li, Z.T. Serlin, G.Yang, and C.A. Belta, “A formal methods approach to interpretable reinforcement learning for robotic planning,” Science Robotics, vol.4, 2019.
  • [19]X.Liu, G.Wang, Z.Liu, Y.T. Liu, Z.Liu, and P.Huang, “Hierarchical reinforcement learning integrating with human knowledge for practical robot skill learning in complex multi-stage manipulation,” IEEE Transactions on Automation Science and Engineering, 2023.
  • [20]R.Sun, “Dual-process theories, cognitive architectures, and hybrid neural-symbolic models,” Neurosymbolic Artificial Intelligence, pp. 1–9, 03 2024.
  • [21]C.Yang, C.Chen, W.He, R.Cui, and Z.Li, “Robot learning system based on adaptive neural control and dynamic movement primitives,” IEEE Transactions on Neural Networks and Learning Systems, vol.30, pp. 777–787, 2019.
  • [22]R.Rayyes, H.Donat, J.J. Steil, and M.Spranger, “Interest-driven exploration with observational learning for developmental robots,” IEEE Transactions on Cognitive and Developmental Systems, vol.15, pp. 373–384, 2023.
  • [23]H.-C. Song, Y.-L. Kim, and J.-B. Song, “Automated guidance of peg-in-hole assembly tasks for complex-shaped parts,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.IEEE, 2014, pp. 4517–4522.
  • [24]M.G. Krishnan, A.T. Vijayan, and A.Sankar, “Performance enhancement of two-camera robotic system using adaptive gain approach,” Industrial Robot: The International Journal of Robotics Research and Application, vol.47, no.1, pp. 45–56, 2020.
  • [25]Y.-C. Peng, D.Jivani, R.J. Radke, and J.Wen, “Comparing position-and image-based visual servoing for robotic assembly of large structures,” in 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE).IEEE, 2020, pp. 1608–1613.
  • [26]R.Haugaard, J.Langaa, C.Sloth, and A.Buch, “Fast robust peg-in-hole insertion with continuous visual servoing,” in Conference on Robot Learning.PMLR, 2021, pp. 1696–1705.
  • [27]F.Mou, H.Ren, B.Wang, and D.Wu, “Pose estimation and robotic insertion tasks based on yolo and layout features,” Engineering Applications of Artificial Intelligence, vol. 114, p. 105164, 2022.
  • [28]“Learning to assemble: Estimating 6d poses for robotic object-object manipulation,” IEEE Robotics and Automation Letters, vol.5, pp. 1159–1166, 2020.
  • [29]A.Y. Yasutomi, H.Ichiwara, H.Ito, H.Mori, and T.Ogata, “Visual spatial attention and proprioceptive data-driven reinforcement learning for robust peg-in-hole task under variable conditions,” IEEE Robotics and Automation Letters, vol.8, pp. 1834–1841, 2023.
  • [30]G.Schoettler, A.Nair, J.Luo, S.Bahl, J.A. Ojea, E.Solowjow, and S.Levine, “Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards,” 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5548–5555, 2019.
  • [31]Y.Wang, L.Zhao, Q.Zhang, R.Zhou, L.Wu, J.Ma, B.Zhang, and Y.Zhang, “Alignment method of combined perception for peg-in-hole assembly with deep reinforcement learning,” J. Sensors, vol. 2021, pp. 5 073 689:1–5 073 689:12, 2021.
  • [32]M.A. Lee, C.Florensa, J.Tremblay, N.Ratliff, A.Garg, F.Ramos, and D.Fox, “Guided uncertainty-aware policy optimization: Combining learning and model-based strategies for sample-efficient policy learning,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).IEEE, 2020, pp. 7505–7512.
  • [33]K.Ahn, M.-W. Na, and J.-B. Song, “Robotic assembly strategy via reinforcement learning based on force and visual information,” Robotics Auton. Syst., vol. 164, p. 104399, 2023.
  • [34]Z.Zhang, Y.Wang, Z.Zhang, L.Wang, H.Huang, and Q.Cao, “A residual reinforcement learning method for robotic assembly using visual and force information,” Journal of Manufacturing Systems, 2024.
  • [35]P.Chen and W.Lu, “Deep reinforcement learning based moving object grasping,” Information Sciences, vol. 565, pp. 62–76, 2021.
  • [36]P.Jin, Y.Lin, Y.Song, T.Li, and W.Yang, “Vision-force-fused curriculum learning for robotic contact-rich assembly tasks,” Frontiers in Neurorobotics, vol.17, 2023.
  • [37]H.Chen, W.Wan, M.Matsush*ta, T.Kotaka, and K.Harada, “Automatically prepare training data for yolo using robotic in-hand observation and synthesis,” IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2023 SEP 4 2023.
  • [38]J.Borja-Diaz, O.Mees, G.Kalweit, L.Hermann, J.Boedecker, and W.Burgard, “Affordance learning from play for sample-efficient policy learning,” in 2022 International Conference on Robotics and Automation (ICRA).IEEE, 2022, pp. 6372–6378.
  • [39]C.Xiong, N.Shukla, W.Xiong, and S.-C. Zhu, “Robot learning with a spatial, temporal, and causal and-or graph,” 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 2144–2151, 2016.
  • [40]X.Yu, B.Li, W.He, Y.Feng, L.Cheng, and C.Silvestre, “Adaptive-constrained impedance control for human–robot co-transportation,” IEEE Transactions on Cybernetics, vol.52, no.12, pp. 13 237–13 249, 2022.
  • [41]M.Tavassoli, S.Katyara, M.Pozzi, N.Deshpande, D.G. Caldwell, and D.Prattichizzo, “Learning skills from demonstrations: A trend from motion primitives to experience abstraction,” IEEE Transactions on Cognitive and Developmental Systems, vol.16, pp. 57–74, 2022.
  • [42]J.Li, Z.Li, F.Chen, A.Bicchi, Y.Sun, and T.f*ckuda, “Combined sensing, cognition, learning, and control for developing future neuro-robotics systems: A survey,” IEEE Transactions on Cognitive and Developmental Systems, vol.11, no.2, pp. 148–161, 2019.
  • [43]Z.Xie and Y.Jin, “An extended reinforcement learning framework to model cognitive development with enactive pattern representation,” IEEE Transactions on Cognitive and Developmental Systems, vol.10, no.3, p. 738–750, Sep 2018. [Online]. Available: http://dx.doi.org/10.1109/tcds.2018.2796940
  • [44]I.Lopez-Juarez, J.CORONACASTUERA, M.PENACABRERA, and K.ORDAZHERNANDEZ, “On the design of intelligent robotic agents for assembly,” Information Sciences, p. 377–402, May 2005. [Online]. Available: http://dx.doi.org/10.1016/j.ins.2004.09.011
  • [45]H.Chen, G.Zhang, H.B. Zhang, and T.A. Fuhlbrigge, “Integrated robotic system for high precision assembly in a semi‐structured environment,” Assembly Automation, vol.27, pp. 247–252, 2007.
  • [46]F.von Drigalski, K.Kasaura, C.C. Beltran-Hernandez, M.Hamaya, K.Tanaka, and T.Matsubara, “Uncertainty-aware manipulation planning using gravity and environment geometry,” IEEE Robotics and Automation Letters, vol.7, no.4, pp. 11 942–11 949, 2022.
  • [47]J.Eßer, N.Bach, C.Jestel, O.Urbann, and S.Kerner, “Guided reinforcement learning: A review and evaluation for efficient and effective real-world robotics [survey],” IEEE Robotics & Automation Magazine, vol.30, pp. 67–85, 2023.
  • [48]C.Wang, C.Su, B.Sun, G.Chen, and L.Xie, “Extended residual learning with one-shot imitation learning for robotic assembly in semi-structured environment,” Frontiers in Neurorobotics, vol.18, 2024.
  • [49]J.Zhang, W.Wan, N.Tanaka, M.Fujita, K.Takahashi, and K.Harada, “Integrating a pipette into a robot manipulator with uncalibrated vision and tcp for liquid handling,” IEEE Transactions on Automation Science and Engineering, 2023.
  • [50]A.Ranjbar, N.A. Vien, H.Ziesche, J.Boedecker, and G.Neumann, “Residual feedback learning for contact-rich manipulation tasks with uncertainty,” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2383–2390, 2021.
Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (15)Chuang Wang received the B.S. degree in School of Mechanical and Power Engineering from Zhengzhou University, China, in 2017. He is currently pursuing the Ph.D. degree with the Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, China. His research interests include robotic manipulation, compliance control, deep reinforcement learning, and assembly robot.
Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (16)Lie Yang received the Ph.D. degree from the Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou, China, in 2021. He was a lecturer in the School of Computer Science and Technology, Hainan University, in 2022. He is currently a Research Fellow with the Department of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore. His research interests mainly focus on deep learning, computer vision, pattern recognition, driver state monitoring, and brain-computer interface.
Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (17)Ze Lin is currently pursuing the B.S. degree in Intelligent Manufacturing Engineering with the Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, China. His current research interests include robotic manipulation, machine vision and assembly robot.
Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (18)Yizhi Liao received a M.S. degree in Advanced Computer Science at the University of Sheffield, United Kingdom. Currently, he is pursuing a second M.S. degree in Information Technology at the University of Melbourne, Australia. His current research interests include robotic manipulation and computer vision.
Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (19)Gang Chen received the bachelor’s and master’s degrees in mechanical engineering from Shanghai Jiao Tong University, Shanghai, China, in 2012 and 2015, respectively and his Ph.D. degree in mechanical and aerospace engineering from the University of California, Davis, Davis, CA, in 2020. He was a research fellow with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore from 2020 to 2021. He is currently an associate professor at the Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, China. His research interests include machine learning, formal methods, control, signal processing and fault diagnosis.
Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (20)Longhan Xie received a B.S. degree and a M.S. degree in mechanical engineering in 2002 and 2005, respectively, from Zhejiang University. He received a Ph.D. degree in mechanical and automation engineering in 2010 from the Chinese University of Hong Kong. From 2010 to 2016, he was an Assistant Professor and Associate Professor in the School of Mechanical and Automotive Engineering at the South China University of Technology. Since 2017, he has been a professor in Shien-Ming Wu School of Intelligent Engineering at the same university. His research interests include biomedical engineering and robotics. He is a member of ASME and IEEE.
Cognitive Manipulation: Semi-supervised Visual Representation and Classroom-to-real Reinforcement Learning for Assembly in Semi-structured Environments (2024)

FAQs

Is semi-supervised same as reinforcement learning? ›

Reinforcement learning is not the same as semi-supervised learning. Reinforcement learning is a method where there are reward values attached to the different steps that the model is supposed to go through. So the algorithm's goal is to accumulate as many reward points as possible and eventually get to an end goal.

What is semi-supervised machine learning with example? ›

Semi-supervised learning is a broad category of machine learning that uses labeled data to ground predictions, and unlabeled data to learn the shape of the larger data distribution. Practitioners can achieve strong results with fractions of the labeled data, and as a result, can save valuable time and money.

At what situation is semi-supervised learning most desirable? ›

Semi-supervised learning methods are especially relevant in situations where obtaining a sufficient amount of labeled data is prohibitively difficult or expensive, but large amounts of unlabeled data are relatively easy to acquire.

What are the disadvantages of semi-supervised learning? ›

Disadvantages of semi-supervised learning

Complexity: Integrating labeled and unlabeled data often requires sophisticated pre-processing techniques such as normalizing data ranges, imputing missing values, and dimensionality reduction.

What is an example of supervised learning in real life? ›

Example 1: We may use supervised learning to predict house prices. Data having details about the size of the house, price, the number of rooms in the house, garden and other features are needed. We need data about various parameters of the house for thousands of houses and it is then used to train the data.

What is an example of reinforcement learning in machine learning? ›

This type of algorithm should be applied to environments with a dynamic nature and where we don't have complete knowledge about them. A reinforcement learning example is autonomous driving cars that have a dynamic environment where there can be a lot of changes in traffic routes.

What are the benefits of semi-supervised learning? ›

Benefits and Use Cases

Semi-Supervised Learning offers considerable benefits for businesses, such as: Improved accuracy: By using unlabeled data, the model can learn underlying data structures, thereby improving prediction accuracy. Cost-effectiveness: Labeling data can be time-consuming and costly.

Is reinforcement learning the same as supervised learning? ›

Reinforcement learning is neither supervised nor unsupervised as it does not require labeled data or a training set. It relies on the ability to monitor the response to the actions of the learning agent. Most used in gaming, robotics, and many other fields, reinforcement learning makes use of a learning agent.

What are the 4 types of machine learning? ›

There are four types of machine learning algorithms: supervised, semi-supervised, unsupervised and reinforcement.

Is unsupervised learning the same as reinforcement? ›

And, unsupervised learning is where the machine is given training based on unlabeled data without any guidance. Whereas reinforcement learning is when a machine or an agent interacts with its environment, performs actions, and learns by a trial-and-error method.

Top Articles
ECU 128 Code On Freightliner: Meaning, Causes & Diagnosis - Mechanic Ask
Freightliner DD13 DD15 Won't Accelerate Past 55 MPH
Spasa Parish
The Machine 2023 Showtimes Near Habersham Hills Cinemas
Gilbert Public Schools Infinite Campus
Rentals for rent in Maastricht
159R Bus Schedule Pdf
11 Best Sites Like The Chive For Funny Pictures and Memes
Finger Lakes 1 Police Beat
Craigslist Pets Huntsville Alabama
Paulette Goddard | American Actress, Modern Times, Charlie Chaplin
Red Dead Redemption 2 Legendary Fish Locations Guide (“A Fisher of Fish”)
What's the Difference Between Halal and Haram Meat & Food?
Haverhill, MA Obituaries | Driscoll Funeral Home and Cremation Service
Rogers Breece Obituaries
Ella And David Steve Strange
Ems Isd Skyward Family Access
Elektrische Arbeit W (Kilowattstunden kWh Strompreis Berechnen Berechnung)
Omni Id Portal Waconia
Banned in NYC: Airbnb One Year Later
Four-Legged Friday: Meet Tuscaloosa's Adoptable All-Stars Cub & Pickle
Patriot Ledger Obits Today
Harvestella Sprinkler Lvl 2
Storm Prediction Center Convective Outlook
Experience the Convenience of Po Box 790010 St Louis Mo
modelo julia - PLAYBOARD
Poker News Views Gossip
Abby's Caribbean Cafe
Joanna Gaines Reveals Who Bought the 'Fixer Upper' Lake House and Her Favorite Features of the Milestone Project
Pull And Pay Middletown Ohio
Navy Qrs Supervisor Answers
Trade Chart Dave Richard
Sweeterthanolives
How to get tink dissipator coil? - Dish De
Lincoln Financial Field Section 110
1084 Sadie Ridge Road, Clermont, FL 34715 - MLS# O6240905 - Coldwell Banker
Kino am Raschplatz - Vorschau
Classic Buttermilk Pancakes
Pick N Pull Near Me [Locator Map + Guide + FAQ]
'I want to be the oldest Miss Universe winner - at 31'
Gun Mayhem Watchdocumentaries
Ice Hockey Dboard
Infinity Pool Showtimes Near Maya Cinemas Bakersfield
Dermpathdiagnostics Com Pay Invoice
A look back at the history of the Capital One Tower
Alvin Isd Ixl
Maria Butina Bikini
Busted Newspaper Zapata Tx
Rubrankings Austin
2045 Union Ave SE, Grand Rapids, MI 49507 | Estately 🧡 | MLS# 24048395
Upgrading Fedora Linux to a New Release
Latest Posts
Article information

Author: Maia Crooks Jr

Last Updated:

Views: 6167

Rating: 4.2 / 5 (43 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Maia Crooks Jr

Birthday: 1997-09-21

Address: 93119 Joseph Street, Peggyfurt, NC 11582

Phone: +2983088926881

Job: Principal Design Liaison

Hobby: Web surfing, Skiing, role-playing games, Sketching, Polo, Sewing, Genealogy

Introduction: My name is Maia Crooks Jr, I am a homely, joyous, shiny, successful, hilarious, thoughtful, joyous person who loves writing and wants to share my knowledge and understanding with you.