Pseudo-LIDAR Projection via Depth Maps

In many navigation and layout applications, there is a need for spatial data. Historically, this meant that most hardware choices were radio frequency based - mainly LiDAR or RADAR.

Motivation

Both LiDAR and RADAR can produce a 3D point cloud but can also be composed into a depth map. Although both are still used extensively for spatial imaging, the biggest drawbacks have been the price, size, and processing power needed to get these methods working efficiently.

An alternative method has been to use stereosopic cameras - where a camera system has two lenses set a certain distance apart. Similar to how our eyes work, stereosopic cameras utilize the difference in the two images along with the distance between the two lenses to calculate spatial data. Typically this data is in the form of a depth map, but it can also be converted into a 3D point cloud if the intrinsics of the two lenses are known. The downside of stereosopic cameras is also hardware cost and the relatively large housing requirements for the two lenses. The quality of the depth map output can also decrease based on various environmental factors such as clutter.

So what could you do if you are limited by a tight budget or dimension requirement? In many computer vision applications, you’re usually stuck with a less-than-ideal, single lens camera, but a client could ask you to develop something that will require you to have spatial data. I was ruminating about this kind of situation and what my options could be.

One potential method of deriving good spatial data from a traditional single lens camera would be applying a neural network trained for monocular depth estimation. Then, if you have the cameras intrinsics (mainly focal length and sensor center coordinates), you can generate a rough 3D point cloud. Additionally, if you had a set of reliable reference points, you could also derive a metric 3D point cloud.

In recent years, advancements in image encoding and segmentation methods have indirectly improved many monocular depth estimation networks. One of the best papers has been Depth Anything. The basic idea of the paper is to use a "data engine" with a massive amount of unlabeled data to generate a model with great generalization abilities (very similar to the approach taken by Segment Anything Method). Along with the data engine, the authors employ different training techniques related to augmentation and auxiliary supervision to develop a foundational model for monocular depth estimation.

Methodology

The general idea is to generate or retrieve the intrinsics of the camera you are using - mainly the focal length (single value) and optical centers. After that, for each image taken, process it through Depth Anything. Finally, based on this survey paper, use the following equation to generate the 3D point cloud:

$$ x \leftarrow (z/F)(u-C_{u}) $$ $$ y \leftarrow (z/F)(v-C_{v}) $$ $$ z \leftarrow d_{u,v} $$

$$ d_{u,v} \rightarrow \text{the depth value at a given image coordinate (u, v)} $$ $$ C_{u,v} \rightarrow \text{the optical center tuple on the x and y coordinate } $$ $$ F \rightarrow \text{the focal length of the camera} $$

The results were okay... The model performed pretty well on all the sample images, but the point cloud was not as accurate as I hoped it to be. This is probably because of the camera intrinsics. I was using my phone camera and couldn't get the intrinsics from online specifications or the operating system so had to manually calibrate it, which could lead to incorrect values.

Examples:

**Fig. 1** - Projection of a water bottle sitting on top of a laptop.

**Fig. 2** - Projection of a printer on the ground next to a cardboard box.

**Fig. 3** - Projection of a plush toy on top of a stack of pillows. I have pillowcases; I was just washing them.

**Fig. 4** - Two lamps with a TV stand in between them. The projection is pretty bad with this one.

**Fig. 5** - Crowded desk with a piece of foam sticking towards the camera.