Think of your favorite open databases.
I’m sure Wikipedia and IMDb instantly spring to mind, but you might not be in the need of all that knowledge ever, or a comprehensive database of all things entertainment. Sometimes you need a bit of VLDB (Very Large Data Base) flavor. Something to spice up your data analysis. Something to put the “big” in your big data. Whelp, good person, you’re in the right place.
The 2003 completion of the Human Genome Project (HGP) was just the beginning. Since then advances in sequencing technology have vastly reduced the per-person cost allowing vast expansion of the HGP from its initial research base of twenty university labs, into a sprawling, globalized network of interconnected genome mapping facilities.
You can download part of the 1000 Genomes Project, containing sequencing information for over 2,600 people from 26 populations around the world. This is a 200TB file, so be prepared. We would suggest using it in conjunction with a powerful cloud computing platform.
See also: The Animal Genome Size Database for genome data relating to 5635 species.
The planespotters heaven. A massive image database featuring 2,532,457 photos of all manner of aircraft, from the smallest individual craft to hulking great flying fortresses.
Airliners also features an extensive aircraft data and history section always kept updated in cooperation with Aerospace Publications to ensure factual accuracy. This has made it one of the single most detailed aircraft databases on the Internet.
The site formerly known as The Internet Archive, has gone through a massive redesign. The site hadn’t changed much since around 2002, but a lot has changed since then. The Internet Archive has done even more growing since the early days.
Archiving everything on the Internet, the site gives you free access to digital media including books, music, games, videos, and much more. The collection is currently estimated at around 10 petabytes, and as their webcrawlers keep crawling, it will continue to grow.
Freebase is “a community-curated database of well-known people, places, and things,” stored in a data structure called a graph. A graph is comprised of nodes, connected by their edges, which allowed Freebase to rapidly expand its content without disrupting existing records.
Unfortunately, Freebase, owned by Google, switched to read-only mode early this year, before the standalone service database is transferred over to the Wikimedia Foundation for integration in the Wikidata project (end of June, 2015). Developers can currently still access Freebase using existing APIs, but once the switch is made, developers will have to use Wikimedia APIs to access the data.
From the home base of an Internet knowledge dream-team of Google and Wikimedia, we move to the morbid. Find a Grave is a massive, 121 million record database of burials around the globe.
Most comprehensive records come from the US, but there are some smaller countries with large data. Complete with photos, interesting monuments, and a number of interesting epitaphs…if you need inspiration?
A database maintained by the ever-present reviewing team at Gamespot. GameRankings gives a well-rounded portrayal of a game’s popularity by covering on-and-offline gaming reviews from reputable sources.
In a similar vein to the massive IMDb, The Big Cartoon Database focuses exclusively on all things animated: cartoons, films, television shows, adverts, and more. If it is an animation, you’ll find it here – and if not, sign up as contributor to this ever growing database.
The Big Cartoon Database has a sister site in The Big Comic Database, home to a further 100,000 or more comic book records, spanning some 5,000 series, with over 35,000 cover scans. It also contains a comprehensive search function, including a comic book price guide detailing current resale values at the various grading levels.
See also: The Grand Comics Database, a non-commercial enterprise database of comics worldwide.
An invaluable tool for students and academics alike, CiteSeerX is a public search engine and digital library of academic and scientific papers. Often considered the first automated citation indexing system, it was the inspiration for Google Scholar and Microsoft Academic Search. Though the latter has since been integrated into the Bing search engine.
CiteSeerX focuses on indexing public scholarly documents. If your research paper is openly distributed, it has a higher chance of appearing within the search engine. CiteSeerX is an excellent example of the power of shared knowledge made available to a much wider audience.
See also: Google Scholar for a different range of books and citations.
Unfortunately not a database of each cat picture on the Internet. Now that would be something! WorldCat is much more useful than that. The reference site documents the collections of over 72,000 libraries around the world, covering 170 countries and territories. This is useful if you’re researching in a foreign country, or just have a desire to read rare books in person.
The only downside is the update method. WorldCat uses a batch processing model rather than allow users to access the data in real-time. So, WorldCat does not indicate the loan status of catalogued books, whether a library owns multiple copies of one book, or whether the book in question is directly accessible to those wishing to visit. It is still a very useful tool, especially when used in conjunction with CiteSeerX.
“The Internet’s clearinghouse of Simpsons guides, news, and information.” I couldn’t have put it better myself. The long-standing fan favorite began way back 1994, and is still going strong even without any interactive multimedia, if only to escape the watchful eye of Fox’s legal department.
You will find one of the single largest databases of customization tools for Windows here, spanning from XP up to Windows 8.1. I’m sure it won’t take long for Windows 10 to begin making the rounds. Its vast popularity stems from a combination of forces. Owner Stardock, subsidizes the site meaning there are little-to-no advertisements. It also benefits from the number of individuals funneled to the site from Stardock.
Ah, a trip down nostalgia lane to a database reminding me I was never to be Roger Waters. In fact, I can still barely play, but that’s another story.
The Ultimate Guitar Archive, or just Ultimate-Guitar (UG), has over 1,500,000 registered members around the world, overseeing a ridiculously large amount of community content. It is almost mind-boggling how much guitar related information is scattered out from a single source. The community just doesn’t maintain a massive database, they also frequently collaborate with one another to create sprawling music projects.
Plants for a Future documents ecologically sustainable horticulture. It has a big hand in spreading knowledge on species diversity and the importance of permaculture. What started as a small project in the depths of Cornwall has slowly grown into a worldwide database.
Growth is somewhat slow, and largely focuses on permaculture in the UK and EU, but many of the records can be swapped for specific locations in the US once you have the species details.
Power up with this Excel add-in to process and analyze data. The main Quandl site acts as a database search, locating databases from around the world that match your search terms. Try it if you need some extra data in a hurry, or just like playing with large datasets (honestly, who doesn’t?!).
See also: The Enigma database search engine.
The Tiny Images dataset acts as a visual dictionary. Click anywhere within the image and a search term pops up with extra information. You can also use specific terms to sift through 80 million images.
The database is part of wider machine learning project focused upon teaching computers to “see” and “read” semantic fields within images.
Bonus Source: /r/datasets
The “front page of the Internet” is a solid home for data mining enthusiasts around the globe. There are subreddits dedicated to machine learning, data mining, text to data, and datasets. If you need something specific make a request. New datasets appear every week.
Watch out for the interesting datasets posted like the Immunization Levels in Child Care and Schools for California.
Do You Use The Wealth?
The Internet has created the single-clearest opportunity for individuals to come together and concentrate their knowledge into a single database. We are valiantly trying to document everything about anything. Some of these databases are for perusing, others are for learning, but we hope you enjoy them all.
What are your favorite databases? Are there any open massive reference sources I should have included in this list?