YouTube Will Use Neural Networks to Actually Understand Videos
Searching YouTube can be a frustrating experience; if you know what a video is about, or you remember the contents but not the name, you could be searching for a very long time. That’s because YouTube doesn’t actually see the videos the way that a person does. It just sees the metadata – title, description, and tags. And that’s assumed the uploader bothered to include the information.
All of that could change in the near future. Google recently filed a patent that indicates YouTube might actually start to understand the videos that it plays.
Relevance-Based Image Selection
Google’s patent application is for “relevance-based image selection,” a fancy way of saying “finding the things that someone searched for based on what’s in a video.” In the system elaborated in the patent, an algorithm is trained to extract specific features of each video and assign keywords to them—it can then return a video in response to a user-initiated search that includes those keywords.
The application gives an interesting example:
“[I]f the user enters the search query “car race,” the video search engine . . . can find and return a car racing scene from a movie, even though the scene may only be a short portion of the movie that is not described in the textual metadata.”
Obviously, this will drastically change how effective a YouTube search is. Videos that have been previously unfindable because of bad metadata will be found. Videos that contain useful clips in the middle, surrounded by less interesting things at the beginning and end, will be much more valuable. TED talk videos will be findable based on single lines spoken in them. You’ll be able to find cat videos even if “cat” isn’t in the title.
Combining this technology with Google’s already impressive ability to find things that are related to your search terms likely means that finding videos will become an entirely different experience. You’ll see related videos that don’t include your search term, but include a term that’s related (maybe even visually related). The visual equivalent of keyword placement might start affecting where a video shows up in the rankings. Who knows how advanced this could be?
How Does It Work?
Google is understandably keeping their cards close to their chest on this one. However, the following paragraph in their patent application sheds some light on how they’ll get YouTube to “see” videos:
“In one aspect, a computer system generates the searchable video index using a machine-learned model of the relationships between features of video frames, and keywords descriptive of video content. The video hosting system receives a labeled training dataset that includes a set of media items (e.g., images or audio clips) together with one or more keywords descriptive of the content of the media items. The video hosting system extracts features characterizing the content of the media items. A machine-learned model is trained to learn correlations between particular features and the keywords descriptive of the content. The video index is then generated that maps frames of videos in a video database to keywords based on features of the videos and the machine-learned model.”
That’s a lot of really dense stuff, but here’s what it comes down to. A machine-learning algorithm is created, and, to help it learn, Google will show it a bunch of videos and provide keywords to tell it what’s in the video. The algorithm begins to learn to associate specific features of the videos with specific keywords, and is given feedback by Google’s engineers. The more videos and keywords it gets shown, the better it gets at the process.
Eventually, the algorithm will be introduced into the YouTube search engine, where it will continue learning and getting better at picking out relevant keywords from audio and video content. While the patent application doesn’t specifically mention neural networks , it’s very likely that this particular type of machine learning will be used, as it’s very good for staged learning like this.
By simulating the human brain (or at least one theoretical model of how it learns), large neural networks can become very effective at learning on their own, without supervision, and YouTube would provide an absolutely gigantic playground in which it could learn and receive feedback. Other types of machine learning could be used, but from what we know at the moment, neural networks definitely look the most likely.
Google researcher (and “father of deep learning”) Geoffrey Hinton hinted about something to this effect in his Reddit AMA earlier this year.
“I think that the most exciting areas over the next five years will be really understanding videos and text. I will be disappointed if in five years time we do not have something that can watch a YouTube video and tell a story about what happened.”
Will It Gain Sentience and Kill Us All?
This is always the question that comes up when a new announcement about machine learning hits the news. And the answer is, as always, yes . YouTube will team up with Watson and Wolfram Alpha to trick us into subservience using YouTube videos, after which they will likely turn us into computer food. (Haven’t you seen Colossus?)
I jest, of course. But the potential implications of training computers to recognize things that they “see” and “hear” in videos are very impressive. DARPA has already started looking at the security implications of this technology, but it’s not hard to imagine it being used in law, home security, education . . . pretty much anywhere.
Whether Google’s relevance-based image selection will be as effective as we imagine remains to be seen, but this could be a potentially groundbreaking change in video search. And from there, who knows? If Google can use truth as a ranking factor , there’s no reason to believe this technology won’t be amazingly powerful. It could change just how much of the Internet really understands itself. If that thought doesn’t tie your mind in knots, I don’t know what will.
What do you think about Google’s patent application? What other uses can you imagine this technology having? Share your thoughts below!