Random is the New Black: Real-World Position Estimation from Simulated Images

Simulation is a cheap and abundant source of data, which allows training deep neural networks on huge tailor-made datasets. Especially the ease of data labeling – in particular for costly labels such as in pixel-wise segmentation – makes simulation an interesting tool in deep learning. However, training a network on simulated data with the intent of real-world application gives rise to the simulation-to-reality (or sim2real) gap; a gap that arises from the difference in data and that impairs the performance of the trained network.

Within this block, I will briefly describe my latest effort of reducing the sim2real gap on the task of vision-based position (and later pose) estimation of a target-object (at first a simple cube) during robotic manipulation. I am extending an approach presented by Tobin et al. and considered as domain randomization, which randomizes the textures of all objects in simulation. The key idea is to train on a lot of variations of the same scene in the hope that the real-world scene appears just as another variation.

Unlike the original approach from Tobin et al., I have found that it is advantageous to train two network instead of one: One for foreground separation, and one for position/pose estimation based on background-pruned images (i.e., images only containing the foreground). This enhances the application on real-world scenes with very complex and crowded backgrounds. I am training the first network using semantic segmentation. The second network simply maps an image to a vector describing the position/pose of the target object. Let me quickly summarize what this blog will cover:

What randomizations I did during dataset generation.
How I did train a foreground separation network.
How I did train a position/pose estimation network based on background-pruned images.
What are the current challenges and what could I do better?
Is training on simulated data instead of real-world data worth it?

This blog will not cover:

Any coding examples.
Underlying maths.

Note however that the respective code is going to uploaded to my GitHub in the near future! I will announce it here once uploaded.

Further, It is recommended that the reader is (at least a bit) familiar with computer vision. Having that said, let’s take a look on the data generation process.

Dataset Generation

I am using RLBench together with a model of the Panda robot arm as my simulation environment. The former is a recently introduced reinforcement learning benchmark targeted towards realistic robotic manipulation. The key intent during dataset generation is to randomize as much as possible. Here is a list of all randomization steps:

Random camera positions: Three cameras are used to collect data; one collecting front- facing images, and two shoulder cameras. All camera feeds are used for foreground separation, while only the front-facing camera feed is used for position/pose estimation. In each step of the simulation, the position of each camera is randomly chosen within a boundary. The cameras are orientated in a way that the images always include the base of the Panda robot as well as the target. The resulting image is then randomly rotated.
Random textures: Approximately 2000 textures are taken from the Describable Textures Dataset (DTD), which includes real and artificial textures. These textures are applied to all surfaces in the simulation. The texture of an object is randomly changed in each time step of the simulation. Note that the target object did not get random textures. Further, 50% of the generated images had a white Panda robot.
Random distractors: 77 3D models with textures are taken from the Yale-Carnegie Mellon-Berkeley object and model dataset (YCB). The latter includes real-world models specifically targeted towards robotic manipulation. At the beginning of each episode, a random number of objects is loaded. At each time step, the objects are randomly placed within a boundary. Physics is turned off for the distractors. The latter can float in the air and are non-collidable to other objects and the robot. In doing so, they do not interfere with the task.
Random real-world background: 50% of the time, the background of the simulated image is substituted with a randomly chosen real-world image. Therefore, random images of the COCO dataset are chosen. The latter includes over 330.000 real-world images. Distractors are included in the new image as well. For fine-tuning, domain-specific background images – e.g., images of the laboratory where the physical robot resides – are used.
Random lights: 3 light sources are randomly placed above the scene. Their positions change at each time step within a boundary. Similar to the cameras, they are re-oriented to face the scene after changing the position. 50% of the time, shadows are rendered.
Random robot configuration and target object position: The joint configuration of the Panda robot arm is constantly changed using either a random or a trained policy. Further, the position of the target is constantly changed within a boundary.

During simulation, I collected around 500.000 images, their respective segmentation masks as well as the respective poses of the target object. Here are some samples taken from the dataset:

Tip for better visibility: click on the image and then on the magnifier in the upper right corner; this works for all upcoming images as well.

Dataset used for foreground separation and position/pose estimation.

Let’s go through the samples column by column. The first column shows the images taken in simulation. The textures on nearly all surfaces are randomized and distractors are added. The second column shows the mask, where the table, the floor and the wall are considered as one joint label; the background. The third column shows the image with a random real-world background. The latter was created using the mask from the second column. The fourth column shows the ground truth mask – a mask without distractors – used for training the foreground separation network. Finally, the fifth column shows the background-pruned images (using either of the images and the ground truth mask) used for training the position/pose estimation network.

Even though I have already applied a lot of randomization steps, it is very important to apply data augmentation as well. This is essential as the images still share some properties like contrast, saturation, hue, or brightness. Moreover, I wanted to try both colored and grayscale images. Here are the data augmentation steps I have been using in my training pipeline:

Non-spatial Data Augmentation		Spatial Data Augmentation
Colored Images	Colored and Grayscale Images	Colored and Grayscale Image + Mask
Random appearance (hue value)	Random brightness parameter	Random compression in x or y axis
Random saturation parameter	Gaussian noise with random variance	Random shear in x or y axis
Random contrast parameter	Gaussian blur with random variance	Random

Data augmentation using in the training pipeline.

Note that the non-spatial augmentations were used for training both, the foreground separation network and the position/pose estimation network, while spatial augmentation was only used for training the foreground separation network. While most data augmentation techniques are rather standard, adding Gaussian blur is quite unusual (at least I have not seen it used before). I have added the latter as a lot of real-world images had motion blur (as the robot is constantly moving). That is, the borders of a moving object become blurred. Without Gaussian blur, semantic segmentation did not work very well on those images (to be honest not at all). So take it as an insider tip!

Okay, now that we have our dataset, let’s take a look at the actual networks and training processes. Therefore, we start with the foreground separation network.

Foreground Separation Network

As mentioned before, I am training a semantic segmentation network to classify relevant foreground (robot and target) as well as the background. In semantic segmentation, each pixel of an input image is mapped to a certain class. That is, the segmentation network takes an image as the input and outputs a segmentation mask. I have used semantic segmentation instead of instance segmentation to keep things simple on the first trial. However, it is straightforward to extend this approach to instance segmentation as well. If you have no clue what the difference between semantic and instance segmentation is, checkout this blog. I have used the U-Net architecture as the latter is a popular choice for this task. Okay, let’s take a look at the training pipeline:

Training pipeline of the foreground segmentation network.

On the upper right corner you can see the domain-randomized images and on the lower left corner you can see the respective masks with distractors. At each step, it is decided whether the image gets a random background or not. If it does get a random background, it is chosen whether it gets a random background from the COCO dataset or a randomly chosen domain-specific background. The latter is sampled from a small set of images taken at the real robot’s domain (here the laboratory in which the real robot resides). Note that the domain-specific images show only the real robot’s background and not robot itself.

After an image is chosen, data augmentation is applied as explained above. Finally, the segmentation network is trained using the image and the target mask (a mask without distractors).

Position/Pose Estimation Network

After the foreground separation network is trained, its weights are frozen and it is used in the training pipeline of the position/pose estimation network. Let’s take a look at the training pipeline:

It is started with the domain-randomized images, which are passed to the foreground separation network to predict a mask. These masks are then used to prune the background from the source images. Then, data augmentation is applied to the background-pruned images. Finally, the resulting images are used for optimization, where the goal is to predict either the position or the pose of an object on the background-pruned images. I have used the Yolov2 network architecture.

Note that the training pipeline shown above can be modified in the sense that we can use ground truth mask instead of mask predicted by the foreground separation network. As a matter of fact, I am doing exactly this in the beginning of training, while using the foreground separation network later for fine-tuning. In doing so the position/pose estimation network learns to deal with imperfect masks.

Results

Okay, let’s take a look at the results. Here you can see some real-world images, their background-pruned counter parts (generated using the masks predicted by the foreground separation network) as well as the errors calculated using the ground truth positions and positions predicted by the position estimation network:

I have evaluated the networks on simple as well as on hard backgrounds. For simple images, I have covered the background with a black cloth (without covering the table). Hard images simply include the crowded laboratory background. As can be seen, foreground separation works well on simple images. However, on complex background it works worse. This also results in bad predictions of the position/pose estimation network. On background-pruned images that do not show the robot or the target object, position estimation is nothing more than a random guess.

It’s important to note that the results are highly influenced by the current lighting situation in the lab. As an example, I experienced daylight to be more challenging (the above images are taken in the evening). Also, the type of camera or rather its colors highly affect the results. Here are some of the challenges I am currently experiencing:

Not enough real-world validation data: I have collected around 400 real-world validation samples labeled with poses of the target object. However, labeling these images with segmentation masks as well is very costly, which is why I have not done it yet. This is why I can currently only validate the position/pose estimation network on real-world data during training.
Not enough hardware resources: This was supposed to be a side-project, which I was hoping to use in my main robotic manipulation projects (shown on my main page). This is why I trained it on private hardware (Intel 7700k CPU & Nvidia GTX 1070). As you might already have guessed, this took ages (or at least a few weeks) as the neither the U-Net nor the Yolov2 architecture is small. It was not very cheap as well (I have just realized how expensive electricity can be…).
Necessity of ablation experiments: I have done a lot of randomization, was it too much? Or do I need to do even more? Well, I don’t know. I need to do more ablation experiments to be able to evaluate the effect of each randomization step. Obviously, this requires even more hardware resources and proper validation data…
High precision needed during robotic manipulation: I have reached an average error of 3.5cm at the end of training (measured on real-world validation data). I know that I can do better when training longer (Tobin et al. reached around 1.5cm), but when separating position estimation and robotic control, I need to deal with the problem of cumulating errors: My trained position estimator has a certain error, and my trained policy for manipulation has a certain error as well. Taking the cube (shown in the samples above) with an edge length of 5cm as the target object, only 1.5cm are left in the best case between the cube and each gripper finger. So there is not much space for errors …

Conclusion

So, what’s my final verdict on this approach? Well, this is not easy to say. But, let’s start with why I did foreground separation.

Why to use foreground separation? I have used foreground separation, because I wanted to do better on complicated real-world background. Moreover, I wanted my approach to be more interpretable. To explain this a bit better, let’s imagine for a moment that I directly used the position estimation network on images with backgrounds. If my network then outputs some bad predictions (which it did), it is hard to say what went wrong, i.e., the decision of my network is not very interpretable. If I remove the background (which usually covers the majority of the image), I can at least ensure that the latter isn’t the reason. However, training two networks instead of one introduces yet another source of errors and complicates training and validation. It is worth saying that there also exist approaches explicitly targeted towards enhanced interpretability of deep neural networks on vision tasks, such as saliency maps. However, such approaches are beyond the scope of this blog.

Another advantage of training a foreground separation network is that it is straightforward to do position estimation on images with multiple target objects. Therefore, we predict a mask (which includes multiple target objects), create multiple background-pruned images each containing only one target object, and pass them separately to the position estimation network as done before. This frees us from ambiguities that occur when only training a position/pose estimation network directly on images with multiple target objects. However, therefore we need to do instance segmentation rather than semantic segmentation.

Was it worth training on simulated data instead of real-world data? Well, it was definitely more complicated as the randomization effort is not to be underestimated. Even though more effort need to be put into it, I still think learning on randomized simulated data is promising. In an ideal case, this could result in a position/pose estimator, which is able to generalize to a huge number of real-world scenarios. Achieving the same result on real-world data requires a really huge dataset, which is costly to obtain (especially if you are a single person like me). In my eyes, training on simulated data is especially useful, when you have a very specific optimization problem, which calls for a tailor-made dataset in case no other is available yet.

What do you think about training on simulated data? Let me know in the comments below! If you have any suggestions for improvement (also for future blogs), please write them in the comments as well.

Random is the New Black: Real-World Position Estimation from Simulated Images

Dataset Generation

Foreground Separation Network

Position/Pose Estimation Network

Results

Conclusion

Share this Blog

Leave a Reply Cancel reply

Latest News

Paper accepted at ICLR!

LocoMuJoCo accepted at ROL@NeurIPS

LS-IQ accepted at EWRL