Hollywood is talking about virtual reality. At the Oculus Connect conference last month, a whole panel of Hollywood alums talked about the technology and it’s applications in filmmaking.
Meanwhile, heavy hitters in the industry are starting to weigh in. James Cameron hates it. David Attenborough is making a documentary about it. The recent (excellent) film Interstellar had a VR experience promoting it.
Virtual reality is a new way of communicating with your viewer, and many people with a background in traditional filmmaking find the possibilities exciting. Virtual reality, rather than just providing a window to a new world, allows directors to take control of the entire world around the viewer.
What Can You do With a VR Camera?
It doesn’t take much imagination to get excited about the idea of VR cameras. Filmmakers could literally bring audiences face to face with their characters, and immerse them in spectacular, bizarre worlds. Photographers could capture whole scenes, frozen in time, to be perused by anyone, anywhere in the world.
Documentarians could take audiences to places they would otherwise never be able to visit. They could send a VR camera to the bottom of the ocean and let viewers stand in the middle of the sunken ballroom of the Titanic. Nature documentaries could manipulate time and space, putting users among ants the size of dogs, or building immersive time lapse sequences. NASA could mount a VR camera on a Mars rover and allow millions of people to stand on the red planet.
There are also, of course, more mundane applications:
One of the keys to consumer VR success will be stereoscopic panoramic cat videos.
— John Carmack (@ID_AA_Carmack) November 6, 2014
Live VR video could also be very compelling. Sports games could be remotely attended, VR cameras would give everyone court-side seats. Even tourism could be virtual.
Users could rent a simple telepresence robot (perhaps a Segway with a VR camera sitting on the handlebars), and pilot it around, anywhere in the world. The Segway would stream its video back live, allowing tourists to virtually “teleport” themselves across the planet to explore anywhere. It seems safe to say that VR is going to change the world.
VR filmmaking has many challenges though. How can directors move the camera while keeping the viewer comfortable? How do directors cut film without disorienting the viewer? How do they make sure the viewer is looking in the right direction to catch important plot events? Do closeups even make sense?
Maybe the biggest issues, though, are the practical ones: how do you record content for virtual reality? Rendering live VR content for games is computationally intensive, but conceptually straightforward. Recording real life, in contrast, poses some serious challenges.
The simplest solution (and the only widely used one at the moment) is simple panoramic video capture. In this scheme, a ball of conventional cameras are used to record video in every direction, and the results are stitched together with software to create a seamless sphere of video. These are much like the panoramas you take with your phone, but recorded simultaneously in video format. The output of the process looks something like this:
This is straightforward and cheap. You can pre-order a panoramic camera for about $700, but it has limitations. The most important is the lack of depth: the panoramas are rendered onto an infinitely large sphere, so the parallax between your eyes is zero, even for parts of the image that really should have depth, like a person standing next to you.
Despite this shortcoming, the experience provided by panoramic video is still surprisingly cool, especially for content that takes place at a distance (aerial photography is a good example). About a week ago, I built an Oculus Rift app that renders a virtual cockpit inside the video above, and the results are compelling: it feels like riding in a submarine surrounded by sea turtles the size of small buildings.
Think of this sort of VR content as like a personal super-IMAX theater in which you stand suspended in the middle of a vast spherical display. The sense of place provided by spherical video is already something that’s impossible with traditional filmmaking tools. Even with its limitations, this is probably what most VR video is going to look like in the immediate future. Richard Attenborough’s documentary (“The Conquest of the Skies“) is being shot in this format.
Stereo Panoramic Camera
Let’s say a director is unhappy with the limitation of monoscopic panoramas. One obvious extension of the technology is to bring in side-by-side 3D technology. To do this, the hardware needs two parallel cameras facing each direction, offset by about 6.3 cm. Then, the camera uses software to stitch together two panoramic images: one for the left eye, and one for the right. The difference between them creates the illusion of depth. Products that support this experience are available, although they’re expensive ($995, plus the cost of ten GoPro cameras).
In an effort to make this sort of content more mainstream, Samsung recently announced “Project Beyond”, a VR stereo panoramic camera for the Oculus-Samsung Gear VR mobile headset. The current prototype has the form factor of a small puck, and uses 17 HD cameras, and renders a gigapixel per second of data.
At 30 fps, that works out to panoramic frames that are about 15 megapixels per eye, or about 50,000 pixels per eye per visual degree. Pricing information is still something of a mystery, and Samsung emphasizes that this is not a finished project. You can see their preview video below.
Stereo panoramas are clearly a better experience than their monoscopic brothers – big things look big, small things look small, objects have depth and position, and it feels a lot more like being there. That said, the experience is still far from perfect. As John Carmack describes in his Oculus Connect keynote, stereo panoramas have a lot of issues.
“…stereoscopic panoramas, whether still or videos, are absolutely a hack. There is — we know what right is and this isn’t right. What they wind up doing is you’ve got slices taken from multiple cameras, so straight ahead it’s the proper stereo for a para-wise and then over here it’s proper for this. But that means that if you’re looking at what was right for the eyes over here but you are looking at out of the corner of your eye over here, it’s definitely not right. It’s not the right disparity for the eyes.
And then even worse if you turn your head like this [rolls head], it gets all kind of bad, because it’s set up just for the eyes straight ahead. So this was an interesting thing. We’ve got the stuff where we basically know in some ways this can be poisoned, this can be a really bad experience if people you spend a lot of time slouched over. […]
These are technical problems that could, perhaps, be resolved by better hardware. However, there is a deeper issue: what happens when you move your head? The panoramas for both eyes are still rendered at infinity: physically moving your head will result in the nauseating sensation that the world is moving with you, particularly if there are objects close to you. There’s no straightforward way to figure out what a stereoscopic image would look like from a new viewpoint.
Despite these limitations, stereo panoramic experiences are still compelling. The Gear VR platform will focus on these sorts of experiences, since they can be created with modern hardware and displayed without taxing the rendering capabilities of the hardware. Stereo panoramas will probably be the gold standard for VR content production, at least for the next few years.
An alternative to capturing two side-by-side images (as with traditional 3D movies) is to capture what are known as depth images: a single image captured from a single perspective, which contains an additional color channel that stores the distance from the lens of the pixel in question.
If you’ve got that, software can simulate virtual cameras viewing the image from new perspectives, making sure to always have a new, correct image from each eye. It’s possible to generate panoramic depth images that allow for natural head movement and rotation in a way that isn’t possible with stereo panoramas. There are a few technologies you can use to capture these depth images.
Time Of Flight
The version of this technology that you’re probably most familiar with is the one that’s used in the Kinect. The Kinect V2 (the version bundled with the Xbox One) relies on what’s known as a time-of-flight camera.
The theory here is straightforward: time-of-flight cameras are infrared cameras which are capable of recording not only where light is striking the sensor, but when light is striking the sensor, with a precision of a few microseconds. This is coupled with a color video camera and an infrared strobe light. At the start of each frame, the IR strobe flashes, illuminating the scene very briefly. By timing how long it takes each pixel to observe the flash, the camera can deduce from the speed of light how far away each pixel is from the camera.
This technology is enormously powerful. Hackers have done some incredible things with it. By using several Kinects in an overlapping configuration, it may be possible to create a panorama of a scene, with a precise depth value for each pixel, which could be rendered in virtual reality to create an immersive experience with correct depth.
To get an idea of the sorts of results this approach produces, check out this video showing output from just the depth camera of the Kinect V2.
This is a high quality depth image – lots of detail, clean edges, and not too much noise. There are, however, some limitations: the biggest caveat is that the Kinect in this example is recording an indoor scene with carefully controlled lighting conditions.
In real world scenarios (and especially outdoors) ambient IR interference from direct and indirect sunlight and incandescent light sources can degrade the accuracy. There’s also a more fundamental problem, which is that time of flight cameras depend on active illumination. That puts some hard limits on how far they can see. They also don’t cope well with transparent and reflective surfaces. And, because the depth resolution is limited by the accuracy of the timing, time-of-flight cameras aren’t very useful for recording small objects, making playing with scale impossible.
A different technology for capturing depth images is known as ‘light field’ photography.
Here’s how it works: in conventional photography, the camera lens focuses incoming light onto a sensor. Each element of the sensor records the quantity of light hitting it. Light field cameras use a special sensor, in which each “pixel” is actually a tiny lens with many sensors underneath it. This allows the camera to measure not just how much light is hitting each pixel, but also the angle the light is coming in at.
This is useful for a few reasons. The simplest application is that, by changing how this large ‘light field’ is sampled, end users can refocus a photograph after it’s been taken. The application that’s interesting for VR, is that light field cameras are also, incidentally, depth cameras! The angle of the incoming light from an object is a function of how far away the object is from the lens, relative to the size of the aperture. Far away objects produce light that is nearly perpendicular to the lens. Very close objects produce light that is nearly parallel. From this, it’s possible to (very accurately) determine the depth map of an image.
You can see some results from an early light field video camera below, and what the image looks like reprojected from a different angle.
Because it’s a passive process, the range limit and spatial accuracy is defined by the resolution and the size of the aperture, and nothing else. That means that by using magnification lenses, it’s possible to take light-field depth images of pretty much any object at any scale under any conditions. To get an example of what’s possible with larger, more accurate light fields, watch this video, which uses several frames from a handheld light field camera to simulate a much larger light field. It generates some fairly compelling 3D geometry from it.
Light field cameras are a much less mature technology than time of flight cameras (there’s only one light field camera in the consumer market right now, and it doesn’t support video capture). That said, with more development time, light field cameras should offer a much more robust depth video experience in the long run.
Dealing with Disocclusion
There is one major problem with depth videos that’s worth mentioning: head motion. Yes, it is possible to reproject depth videos to new perspectives, and all the pixels wind up where they should be. Depth video will not make you sick. Unfortunately, they do introduce a new problem: disocclusion.
When you move your head in a such a way that you’re looking at a part of the world not visible in the original image or panorama, you get a nasty visual artifact: a shadow. To get an idea of what I’m talking about, watch this video:
In that video, a programmer hacked the Kinect to render a depth video of what it’s seeing in space. By moving the virtual camera, he reprojects the scene from a number of perspectives.
Its a first generation Kinect, so the video feed is a little glitchy, but the results are pretty impressive. The biggest downside, which becomes obvious as he begins to turn the camera, is the shadows in the scene. The portion of the wall behind his body has an enormous, person-shaped hole cut out of it: the part that the camera can’t see and has no data for. These black shadows are going to appear in depth panoramas as soon as your head starts to move. So, how do VR cameras deal with these holes? Well, there are a few approaches to this problem:
The simplest solution is to actually just record the stuff around corners and behind occluding surfaces. To do this, you add more cameras — a lot more. In order to allow people to move their heads up to, say, one meter in any direction, the camera needs to be expanded to create a 2-meter-wide sphere studded with high FOV depth cameras, so that software can synthesize any viewpoint within the sphere.
This is the most robust approach, but also the least practical. A two meter sphere of cameras isn’t a nice, portable steadicam, it’s an installation, and an expensive one. This might be practical for some high-end Hollywood productions, but certainly not for most real-world applications. You can see a prototype of this idea below, implemented in the form of a live 3D teleconferencing application:
Another approach, if the video creator is primarily recording a few dynamic objects against a static backdrop, is to use a depth camera to map the environment before they start filming, and use this data to fill in holes in the recorded images. This can be done automatically using a technique called SLAM (Simultaneous Location And Mapping), which automatically merges many depth images to create a complete 3D map of a scene. The results look something like this:
This works pretty well, but isn’t appropriate for all situations. It’s not hard to imagine trying to film a scene in a busy public place, where much of the background consists of people who move around and occlude each other. Capturing a single static version of that scene to fill in the holes simply isn’t possible. Furthermore, for documentary, live video, or news purposes, it won’t be practical to exhaustively map the environment beforehand.
Just Making Things Up
The last approach to the problem is to resort to the usual answer in cases where you don’t have enough data: outright lies.
The insight here is that, in real life, the viewer isn’t going to be getting up and trying to walk around the scene. They’ll be sitting down, and what the software really needs to correct for is small variations in pose, caused by the viewer leaning and shifting in their seat – the disocclusions simply won’t be that big. That means that the data used to fill in the holes doesn’t actually have to be accurate, it just has to look plausible. Those of you who have played with photoshop’s content-aware image fill (or its competitors) know where this is going.
As it turns out, researchers have come up with some pretty good algorithms for filling in holes in live video streams in real time. You can check out some examples below:
Imagine decomposing a depth image into layers, subtracting them out one at a time to see where shadows could possibly occur, and then using these sorts of in-painting algorithms to generate plausible images to fill in the holes.
This is a little bit harder than simple 2D in-painting, since the algorithm also needs to make up reasonable depth values for the holes, but many of the same techniques can be used. These approaches won’t work perfectly in all situations, but so long as those artifacts are less intrusive than big black holes in the world, that still counts as a win.
How Long Until They’re Done?
With VR cameras, even more than other things, perfect is the enemy of good.
Even with the best technology money can buy and footage carefully planned to minimize occlusion errors, the results would still be imperfect. Specular highlights, for example, are spots of brightness that appear on shiny surfaces, which vary in position depending on the position of your head, since they rely on light being reflected at a very specific angle.
Specular highlights recorded in even the best VR video will appear as baked-on white spots on the surface, and won’t look right on nearby objects during head motion. That’s a limitation that’s going to be hard to get around. Furthermore, filling in occlusion errors in complicated scenes with lots of moving objects is hard — doing it perfectly is impossible and will be for a long time.
It’s going to be years and maybe even decades before VR cameras can provide a perfect experience in the same way that traditional 2D film can. That’s the sacrifice you make in order to experience a fundamentally more powerful medium.
With all that said, some really cool stuff is coming down the pipe the near future. Every single option mentioned in this article can create genuinely valuable experiences. Samsung’s announcement of “Project Beyond” is a promising sign of things to come.
The Oculus Rift is scheduled to launch sometime in 2015, and sales figures in the millions of units don’t seem like a stretch. If virtual reality takes off the way it seems like it might, an enormous amount of technological progress is going to happen, fast.
Demand for content will drive VR cameras to get better and smaller and cheaper to meet demand. It probably won’t be many years before a device that costs less than a new phone and fits in the palm of your hand is going to provide a compelling, comfortable VR recording of anything – and that is going to be very, very cool.
What would you do with your own VR camera? What kind of content are you most excited for? Let us know in the comments!
Image Credits: Glasses concept Via Shutterstock