Five Questions About Microsoft's "Project HoloLens"

Wednesday morning, Microsoft showed off a project they've been working on for seven years, an augmented reality headset called Project HoloLens.

The vision is ambitious: they want to fundamentally change the way people interact with computers, by building a pair of glasses that can fluidly mix virtual and real content together in the physical space of the user. This is like virtual reality technology, but fundamentally more powerful. Furthermore, they want to do all the processing locally on the glasses -- no computer, no phone, no cables. They're even launching a special version of Windows just for the new hardware. This is the next stage in technological evolution for all those AR games you installed on your phone that one time and haven't touched since.

Their time frame is even more ambitious than their goals: they want to ship developer kits this spring, and the consumer product "during the Windows 10 timeframe". Here's the pitch.

All of this sounds great, but I admit to a fairly high degree of skepticism.

The technologies Microsoft is using have serious, fundamental challenges, and so far Microsoft has been very tight-lipped about how (or if) they've solved them. If they haven't solved them, then their goal of shipping within a year is very concerning. The last thing VR and AR need is a big company shipping another half-baked product like the Kinect. Remember the Project Natal demo from 2009?

Without further ado, here are the five most important things I'd like to know about the HoloLens.

Is This a Light Field Display?

In order to understand this one, we have to look a little deeper into 3D, and how it works. In order to get the sensation of a real, tangible 3D world, our brains integrate a lot of different kinds of information. We get depth cues about the world in three primary ways:

Stereo depth -- the disparity between what both of our eyes see. Faking this is how 3D movies work
Motion parallax -- subtle motions of our head and torso give us additional depth cues for objects that are farther away
Optical focus -- when we focus on something, the lenses of our eyes physically deform until it comes into focus; near-field objects require more lens distortion, which provides depth information about what we're looking at

Optical focus is easy to check out for yourself: close one eye and hold your thumb up in front of a wall across the room. Then, shift your focus from your thumbnail to the surface behind it. When looking past your thumb, your thumb will shift out of focus because the lens of your eye is now less deformed and can't correctly collect the light coming from it.

VR headsets like the Oculus Rift provide the first two cues extremely accurately, but not the last, which works out surprisingly well: our eyes default to relaxing completely, since the optics focus the images as through the light were coming from infinitely far away. The lack of the optical focus cue is unrealistic, but it usually isn't distracting. You can still have very cool gaming experiences without it.

In augmented reality, the problem is different, because you have to mix light from real and virtual objects. The light from the real world will naturally be focused at a variety of depths. The virtual content, however, will be all be focused at a fixed, artificial distance dictated by the optics -- probably on infinity. Virtual objects won't look like they're really part of the scene. They'll be out of focus when you look at real things at the same depth and vice versa. It won't be possible to move your eye fluidly across the scene while keeping it in focus, as you do normally. The conflicting depth cues will be confusing at best, and sickening at worst.

In order to fix this, you need something called a light field display. Light field displays are displays that use an array of tiny lenses to display light focused at many depths simultaneously. This allows the user to focus naturally on the display, and (for augmented reality) solves the problem described above.

There is, however, a problem: light field displays essentially map a single 2D screen onto a three-dimensional light field, which means that each "depth pixel" that the user perceives (and exists at a particular focal depth in the scene) is actually made up of light from many pixels on the original display. The finer-grained the depth you want to portray, the more resolution you have to give up.

Generally, light fields have about an eight-fold resolution decrease in order to give adequate depth precision. The best microdisplays available have a resolution of about 1080p. Assuming one high-end microdisplay driving each eye, that would make the actual resolution of Microsoft's headset only about 500 x 500 pixels per eye, less even than the Oculus Rift DK1. If the display has a high field of view, virtual objects will be incomprehensible blobs of pixels. If it doesn't, immersion will suffer proportionately. We never actually get to see through the lens (just computer re-creations of what the user is seeing), so we have no idea what the user experience is really like.

It's possible that Microsoft has come up with some novel solution to this problem, to allow the use of a light field display without the resolution tradeoff. However, Microsoft's been extremely cagey about their display technology, which makes me suspect that they haven't. Here's the best explanation we've got so far (from the WIRED demo).

To create Project HoloLens’ images, light particles bounce around millions of times in the so-called light engine of the device. Then the photons enter the goggles’ two lenses, where they ricochet between layers of blue, green and red glass before they reach the back of your eye.

This sort of description of the technology could mean practically anything (though, in fairness to Microsoft, the hardware did impress WIRED, though the article was light on details).

We won't know more for sure until Microsoft starts to release technical specs, probably months from now. On a further note of nit picking, is it really necessary to drown the project in this much marketing-speak? The dedicated processor they're using for head tracking is called a "holographic processor" and the images are called "holograms," for no particular reason. The product is fundamentally cool enough that it really isn't necessary to gild it like this.

Is the Tracking Good Enough?

The Project HoloLens headset has a high FOV depth camera mounted on it (like the Kinect), which it uses to figure out where the headset is in space (by trying to line up the depth image it's seeing with its model of the world, composited from past depth images). Here's their live demo of the headset in action.

The tracking is impressive considering that it uses no markers or other cheats, but even in that video (under heavily controlled conditions), you can see a certain amount of wobble: the tracking is not completely stable. That's to be expected: this sort of inside-out tracking is extremely hard.

However, the big lesson from the various Oculus Rift prototypes is that accuracy of tracking matters a lot. Jittery tracking is merely annoying when it's a few objects in a largely stable real world, but in scenes like the Mars demo they showed in their concept video, where almost everything you're seeing is virtual, imprecise tracking could lead to a lack of "presence" in the virtual scene, or even simulator sickness. Can Microsoft get the tracking up to the standard set by Oculus (sub-millimeter tracking accuracy and less than 20 ms total latency) by their shipping date at the end of this year?

Here's Michael Abrash, a VR researcher who has worked for both Valve and Oculus, talking about the problem

[Because there’s always a delay in generating virtual images, [...] it’s very difficult to get virtual and real images to register closely enough so the eye doesn’t notice. For example, suppose you have a real Coke can that you want to turn into an AR Pepsi can by drawing a Pepsi logo over the Coke logo. If it takes dozens of milliseconds to redraw the Pepsi logo, every time you rotate your head the effect will be that the Pepsi logo will appear to shift a few degrees relative to the can, and part of the Coke logo will become visible; then the Pepsi logo will snap back to the right place when you stop moving. This is clearly not good enough for hard AR

Can the Display Draw Black?

Another issue alongside focal depth and tracking has to do with drawing dark colors. Adding more light to a scene is relatively simple, using beam splitters. Taking light out is a lot harder. How do you selectively darken parts of the real world? Putting up a selectively transparent LCD screen won't cut it, since it can't always be at the correct focus to block what you're looking at. The optical tools to solve this problem, unless Microsoft has invented them secretly, simply don't exist.

This matters, because for a lot of the applications Microsoft is showing off (like watching Netflix on your wall), the headset really needs the ability to remove the light coming from the wall, or else your movie will always have a visible stucco pattern overlaid with it: it'll be impossible for imagery to block out real objects in the scene, making the use of the headset heavily dependent on the ambient lighting conditions. Back to Michael Abrash:

[S]o far nothing of the sort has surfaced in the AR industry or literature, and unless and until it does, hard AR, in the SF sense that we all know and love, can’t happen, except in near-darkness.

That doesn’t mean AR is off the table, just that for a while yet it’ll be soft AR, based on additive blending [...] Again, think translucent like “Ghostbusters.” High-intensity virtual images with no dark areas will also work, especially with the help of regional or global darkening – they just won’t look like part of the real world.

What About Occlusion?

"Occlusion" is the term for what happens when one object passes in front of another and stops you from seeing what's behind it. In order for virtual scenery to feel like a tangible part of the world, it's important for real objects to occlude virtual objects: if you hold your hand up in front of a piece of virtual imagery, you shouldn't be able to see it through your hand. Because of the use of a depth camera on the headset, this is actually possible. But, watch the live demo again:

By and large, they carefully control the camera angles to avoid real objects passing in front of virtual ones. However, when the demonstrator interacts with the Windows menu, you can see that her hand doesn't occlude it at all. If this is beyond the reach of their technology, that's a very bad sign for the viability of their consumer product.

And speaking of that UI...

Is This Really the Final UI?

The UI shown off by Microsoft in their demo videos seems to work by using some combination of gaze and hand tracking to control a cursor in the virtual scene, while using voice controls for selecting between different options. This has two major drawbacks: it makes you look like the little kid in the Shining who talks to his finger, but more importantly, it also represents a fundamentally flawed design paradigm.

Historically, the best user interfaces have been ones that bring physical intuitions about the world into the virtual world. The mouse brought clicking, dragging, and windows. Touch interface brought swipe to scroll and pinch to zoom. Both of these were critical in making computers more accessible and useful to the general population -- because they were fundamentally more intuitive than what came before.

VR and AR give you a lot more freedom as a designer: you can place UI elements anywhere on a 3D space, and have the users interact with them naturally, as though they were physical objects. A huge number of obvious metaphors suggest themselves. Touch a virtual UI element to select it. Pinch to it pick it up and move it. Slide it out of the way to store it temporarily. Crush it to delete it. You can imagine building a user interface that's so utterly intuitive that it requires no explanation. Something that your grandmother can instantly pick up, because it's built on a foundation of basic physical intuitions that everyone builds up over a lifetime of interacting with the world. Take a minute, and listen to this smart person describe what immersive interfaces could be.

In other words, it seems obvious (to me) that an immersive user interface should be at least as intuitive as the touch interfaces pioneered by the iPhone for 2D multitouch screens. Building an interface around manipulating a VR "mouse" is a step backward, and exposes either deep technological shortcomings in their hand tracking technology or a fundamental misunderstanding of what's interesting about this new medium. Either way, it's a very bad sign for this product being more than a colossal, Kinect-scale flop.

Hopefully, Microsoft has time to get feedback on this and do a better job. As an example, here's an interface designed by one hobbyist for the Oculus Rift DK2 and the Leap Motion. An immersive UI designed by a large company should be at least this good.

A Sign of Things to Come

On the whole, I'm extremely skeptical of the HoloLens Project as a whole. I'm very glad that a company with Microsoft's resources is investigating this issue, but I'm concerned that they're trying to rush a product out without solving some critical underlying technical issues, or nailing down a good UI paradigm. The HoloLens is a sign of things to come, but that doesn't mean that the product itself is going to provide a good experience to consumers.

Image Credit: courtesy of Microsoft