AI is back.
For the first time since the 1980’s, artificial intelligence researchers are making tangible progress on hard problems, and people are starting to talk seriously about strong AI again. In the mean time, our increasingly data-driven world has kicked off an arms race between companies seeking to monetize the new intelligence, particularly in the mobile space.
The two titans leading the pack are Google and Microsoft. The first battle? A new domain in artificial intelligence called “Deep Learning.”
So who’s winning?
The Google Brain
Google’s research efforts have been centered around a project called ‘Google Brain.’ Google Brain is the product of Google’s famous/secret ‘Google X’ research lab, which is responsible for moon-shot projects with low odds of success, but with very high potential. Other products of Google X include Project Loon, the balloon Internet initiative, and the Google self-driving car project.
Google Brain is an enormous machine learning initiative that is primarily aimed at image processing, but with much wider ambitions. The project was started by Stanford Professor Andrew Ng, a machine learning expert who has since left the project to work for Baidu, China’s largest search engine.
Google has a long history of involvement with AI research. Matthew Zeiler, the CEO of a machine visual startup, and an intern who worked on the Google Brain puts it like this:
“Google is not really a search company. It’s a machine-learning company [..] Everything in the company is really driven by machine learning.”
The goal of the project is to find ways to improve deep learning algorithms to construct neural networks that can find deeper and more meaningful patterns in data using less processing power. To this end, Google has been aggressively buying up talent in deep learning, making acquisitions which include the $500 million purchase of AI startup DeepMind.
DeepMind was worried enough about the applications of their technology that they forced Google to create an ethics board designed to prevent their software from destroying the world. DeepMind had yet to release its first product, but the company did employ a significant fraction of all deep learning experts in the world. To date, their only public demo of their technology has been a toy AI that’s really, really good at Atari.
Because deep learning is a relatively new field, it hasn’t had time to produce a large generation of experts. As a result, there’s a very small number of people with expertise in the area, and that means it’s possible to gain significant advantage in the field by hiring everyone involved.
Google Brain has been applied, so far, to Android’s voice recognition feature and to automatically catalogue StreetView images, identifying important features like addresses. An early test was the famous cat experiment, in which a Google deep learning network automatically learned to identify cats in Youtube videos with a higher rate of accuracy than the previous state of the art. In their paper on the subject, Google put it like this:
“Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not […] The network is sensitive to high-level concepts such as cat faces and human bodies. Starting with these learned features, we trained it to obtain 15.8 percent accuracy in recognizing 20,000 object categories, a leap of 70 percent relative improvement over the previous state-of-the-art [networks].”
Eventually, Google would like its deep learning algorithms to do… well, pretty much everything, actually. Powerful AI platforms like IBM’s Watson rely on these sorts of low-level machine learning algorithms, and improvements on this front make the overall field of AI that much more powerful.
A future version of Google Now, powered by Google Brain could identify both speech and images, and provide intelligent insights about that data to help users make smarter decisions. Google brain could improve everything from search results to Google Translate.
Microsoft’s approach to the deep learning war has been a little different. Rather than trying to buy up deep learning experts to refine their algorithms, Microsoft has been focusing on improving the implementation, and finding better ways to parallelize the algorithms used to train deep learning algorithms.
This project is called “Microsoft Adam.” Their techniques reduce redundant computation, doubling the quality of results while using fewer processors to obtain them. This has lead to impressive technical achievements, including a network that can recognize individual breeds of dogs from photographs with high accuracy.
Microsoft describes the project like this:
The goal of Project Adam is to enable software to visually recognize any object. It’s a tall order, given the immense neural network in human brains that makes those kinds of associations possible through trillions of connections.[…] Using 30 times fewer machines than other systems, [internet image data] was used to train a neural network made up of more than two billion connections. This scalable infrastructure is twice more accurate in its object recognition and 50 times faster than other systems.
The obvious application for this technology is in Cortana, Microsoft’s new virtual assistant, inspired by the AI character in Halo. Cortana, aimed to compete with Siri, can do a number of clever things, using sophisticated speech recognition techniques.
The design goal is to build an assistant with more natural interaction, and can perform a wider array of useful tasks for the user, something that deep learning would help with enormously.
Microsoft’s improvements to the back end of deep learning are impressive, and have led to applications not previously possible.
How Deep Learning Works
In order to understand the issue a little better, let’s take a minute to understand this new technology. Deep learning is a technique for building intelligent software, often applied to neural networks. It builds large, useful networks by layering simpler neural networks together, each finding patterns in the output of its predecessor. To understand why this is useful, it’s important to look at what came before deep learning.
Backpropagating Neural Networks
The underlying structure of a neural network is actually pretty simple. Each ‘neuron’ is a tiny node that takes an input, and uses internal rules to decide when to “fire” (produce output). The inputs feeding into each neuron have “weights” — multipliers that control whether the signal is positive or negative and how strong.
By connecting these neurons together, you can build a network that emulates any algorithm. You feed your input into the input neurons as binary values, and measure the firing value of the output neurons to get the output. As such, the trick to neural networks of any type is to take a network and find the set of weights that best approximates the function you’re interested in.
Backpropagation, the algorithm used to train the network based on data, is very simple: you start your network off with random weights, and then try to classify data with known answers. When the network is wrong, you check why it’s wrong (producing a smaller or larger output than the target), and use that information to nudge the weights in a more helpful direction.
By doing this over and over again, for many data points, the network learns to classify all of your data points correctly, and, hopefully, to generalize new data points. The key insight of the backpropagation algorithm is that you can move error data backwards through the network, changing each layer based on the changes you made to the last layer, thus allowing you to build networks several layers deep, which can understand more complicated patterns.
Backprop was invented in 1974 by Geoffrey Hinton, and had the remarkable effect of making neural networks useful for broad applications for the first time in history. Trivial neural networks have existed since the 50’s, and were originally implemented with mechanical, motor-driven neurons.
Another way to think about the backprop algorithm is as an explorer on a landscape of possible solutions. Each neuron weight is another direction in which it can explore, and for most neural networks, there are thousands of these. The network can use its error information to see which direction it needs to move in and how far, in order to reduce error.
It starts at a random point, and by continually consulting its error compass, moves ‘downhill’ in the direction of fewer errors, eventually settling at the bottom of the nearest valley: the best possible solution.
So why don’t we use backpropagation for everything? Well, backprop has several problems.
The most serious problem is called the ‘vanishing gradient problem.’ Basically, as you move error data back through the network, it becomes less meaningful each time you go back a layer. Trying to build very deep neural networks with backpropagation doesn’t work, because the error information won’t be able to penetrate deeply enough into the network to train the lower levels in a useful way.
A second, less serious problem is that neural networks converge only to local optima: often they get caught in a small valley and miss deeper, better solutions that aren’t near their random starting point. So, how do we solve these problems?
Deep Belief Networks
Deep belief networks are a solution to both of these problems, and they rely on the idea of building networks that already have insight into the structure of the problem, and then refining those networks with backpropagation. This is a form of deep learning, and the one in common use by both Google and Microsoft.
The technique is simple, and is based on a kind of network called a “Restricted Boltzman Machine” or “RBM”, which relies on what’s known as unsupervised learning.
Restricted Boltzman Machines, in a nutshell, are networks that simply try to compress the data they’re given, rather than trying to explicitly classify it according to training information. RBMs take a collection of data points, and are trained according to their ability to reproduce those data points from memory.
By making the RBM smaller than the sum of all the data you’re asking it to encode, you force the RBM to learn structural regularities about the data in order to store it all in less space. This learning of deep structure allows the network to generalize: If you train an RBM to reproduce a thousand images of cats, you can then feed a new image into it – and by looking at how energetic the network becomes a result, you can figure out whether or not the new image contained a cat.
The learning rules for RBMs resemble the function of real neurons inside the brain in important ways that other algorithms (like backpropagation) do not. As a result, they may have things to teach researchers about how the human mind works.
Another neat feature of RBMs is that they’re “constructive”, which means that they can can also run in reverse, working backwards from a high level feature to create imaginary inputs containing that feature. This process is called “dreaming.”
So why is this useful for deep learning? Well, Boltzman Machines have serious scaling problems — the deeper you try to make them, the longer it takes to train the network.
The key insight of deep belief networks is that you can stack two-layer RBM’s together, each trained to find structure in the output of its predecessor. This is fast, and leads to a network that can understand complicated, abstract features of the data.
In an image recognition task, the first layer might learn to see lines and corners, and the second layer might learn to see the combinations of those lines that make up features like eyes and noses. The third layer might combine those features and learn to recognize a face. By turning this network over to back-propagation, you can hone in on only those features which relate to the categories you’re interested in.
In a lot of ways, this is a simple fix to backpropagation: it lets backprop “cheat” by starting it off with a bunch of information about the problem it’s trying to solve. This helps the network reach better minima, and it ensures that the lowest levels of the network are trained and doing something useful. That’s it.
On the other hand, deep learning methods have produced dramatic improvements in machine learning speed and accuracy, and are almost single-handedly responsible for the rapid improvement of speech to text software in recent years.
Race for Canny Computers
You can see why all of this is useful. The deeper you can build networks, the bigger and more abstract the concepts that the network can learn.
Want to know whether or not an email is spam? For the clever spammers, that’s tough. You have to actually read the email, and understand some of the intent behind it – try to see if there’s a relationship between the sender and receiver, and deduce the receiver’s intentions. You have to do all that based on colorless strings of letters, most of which are describing concepts and events that the computer knows nothing about.
That’s a lot to ask of anyone.
If you were asked to learn to identify spam in a language you didn’t already speak, provided only some positive and negative examples, you’d do very poorly — and you have a human brain. For a computer, the problem has been almost impossible, until very recently. Those are the sorts of insights that deep learning can have, and it’s going to be incredibly powerful.
Right now, Microsoft’s winning this race by a hair. In the long run? It’s anyone’s guess.