When it comes to online databases and information that can be found inside what is commonly known as the “invisible web“, I’m not your typical user. Sure, I spend a little bit too much of my time sifting through online databases at places like the National Archives and the CIA FOIA reading room, but I have to say nothing makes me more excited than when I find an HTML based table filled with volumes of seemingly complex and unconnected data.
The fact is, data tables are a goldmine of important truths. Data often gets collected by armies of data-collection grunts with boots on the ground. You’ve got people from the U.S. Census traveling the entire country for household and family information. You’ve got non-profit environmental groups collecting all sorts of interesting information about the environment, pollution, global warming and more. And if you’re into the paranormal or Ufology, there are also constantly updated tables of information about sightings of strange objects in the sky above us.
Ironically, you would think that any government in the world would be interested to know what sort of foreign craft are being spotted in the skies over any country, but apparently not – at least not in the U.S. anyway. In America, the collection of unusual sightings of crafts have been relegated to teams of amateur hobbyists who flock to new UFO sightings like moths to a flame. My interest in these sightings actually stems not from a fascination with aliens or crafts from other planets, but from a scientific fascination with patterns – where and why more people are seeing things in the sky, and whether those sightings could reflect something very real and much more down-to-Earth actually going on.
To explore the volumes of data collected by teams of UFO hobbyists, I’ve actually developed a way to import large HTML tables of data into a Google Spreadsheet, and then manipulate and analyze that data to extract and discover meaningful and important information. In this article, I intend to show you how to do the same.
Important HTML Data Into Google Spreadsheet
In this example, I’m going to show you how to import any data that might be stored in a table on any website on the Internet, into your Google Spreadsheet. Think about the enormous volume of data that’s available on the Internet today in the form of HTML tables. Wikipedia alone has data in tables for topics like global warming, the U.S. Census Bureau has tons of population datasets, and a little bit of Googling will land you a whole lot more beyond that.
In my example, I’m starting out with a database on the National UFO Reporting Center that actually looks like it might be a query-style deep-web database, but if you observe the URL structuring, it’s actually a semi-complex web-based reporting system consisting of static web pages and static HTML tables – exactly what we want when looking for data to import.
NUForc.org is one of those organizations that serves as one of the biggest reporting centers for UFO sightings. It isn’t the only one, but it’s big enough to find new datasets with current sightings for every month. You choose to view the data sorted by criteria like State or Date, and each of those is provided in the form of a static page. If you sort by date and then click on the most recent date, you’ll see that the table listed there is a static web page named according to the date format.
So, we now have a pattern to regularly extract the latest sightings information from this HTML-based database. All you have to do is import the first table, use the most recent entry (the top one) to identify the latest update, and then use the date of that posting to build the URL link where the latest HTML data table exists. Doing this will simply require a couple of instances of the ImportHTML function, and then a few creative uses of text manipulation functions. When you’re done, you’ll have one of the coolest, self-updating reporting spreadsheets of your very own. Let’s get started.
Importing Tables and Manipulating Data
The first step, of course, is to create the new spreadsheet.
So, how do you import HTML tables? All you need is the URL where the table is stored, and the number of the table on the page – usually the one first listed is 1, the second is 2, and so on. Since I know the URL of that first table listing dates and counts of sightings listed, it’s possible to import by typing the following function into cell A1.
H2 holds the function “=hour(now())“, so the table will update every hour. This is probably extreme for data that updates this infrequently, so I could probably get away with doing it daily. Anyway, the above ImportHTML function brings in the table as shown below.
You’ll need to do a bit of data manipulation on this page before you can piece together the URL for the second table with all of the UFO sightings. But go ahead and create the second sheet on the workbook.
Before trying to build that second sheet, it’s time to extract the post date from this first table, in order to build the link to the second table. The problem is that the date is brought in as a date format, not a string. So, first you need to use the TEXT function to convert the report post date into a string:
In the next cell to the right, you need to use the SPLIT function with the “/” delimiter to break the date up into month, day and year.
Looking good! However, each number needs to be forced to two digits. You do this in the cells right below them using the TEXT command again.
A format of “00” (those are zeroes) forces two digits, or a “0” as a placeholder.
Now you’re ready to rebuild the entire URL to the latest HTML table of new sightings. You can do this by using the CONCATENATE function, and piecing together all of the bits of information you just extracted from the first table.
Now, on the new sheet you created above (the blank sheet), you’re going to do a new “importhtml” function, but this time for the first URL link parameter, so you’re going to navigate back to the first spreadsheet and click on the cell with the URL link you just created.
The second parameter is “table” and the last is “1” (because the sightings table is the first and only on the page). Hit enter, and now you’ve just imported the entire volume of sightings that were posted on that particular date.
So, you’re probably thinking this is a nice novelty act and everything – I mean, after all, what you’ve done is extracted existing information from a table on the Internet and migrated it to another table, albeit a private one in your Google Docs account. Yes, that’s true. However, now that it’s in your own private Google Docs account, you have at your fingertips the tools and functions to better analyze that data, and start discovering amazing connections.
Using Pivot Reports to Analyze Imported Data
Just recently, I wrote an article about using Pivot Reports in Google Spreadsheet to perform all sorts of cool data analysis feats. Well, you can do the same amazing data analysis acrobatics on the data that you’ve imported from the Internet – giving you the ability to uncover interesting connections that possibly no one else has uncovered before you.
For example, from the final sightings table, I might decide to use a pivot report to take a look at the number of different unique shapes reported in each state, compared to the overall number of sightings in that particular state. Finally, I also filter out anything mentioning “aliens” in the comments section, to hopefully weed out some of the more wingnut entries.
This actually reveals some pretty interesting things right off the bat, such as the fact that California clearly has the highest number of reported sightings of any other State, along with the distinction of reporting the highest number of craft shapes in the country. It also shows that Massachusetts, Florida and Illinois are big hitters in the UFO sightings department as well (at least in the most recent data).
Another cool thing about Google Spreadsheet is the wide array of charts available to you, including a Geo-Map that lets you lay out “hot spots” of data in a graphical format that really stands out and makes those connections within the data quite obvious.
If you think about it, this is really only the tip of the iceberg. If you can now import data from data tables on any page on the Internet, just think of the possibilities. Get the latest stock numbers, or the most recent top 10 books and authors on the New York Times bestseller list, or the biggest selling cars in the world. There are HTML tables out there on almost any topic you can imagine, and in many cases those tables are frequently updated.
ImportHtml gives you the ability to plug your Google Spreadsheet into the Internet, and feed off the data that exists out there. It can become your own personal hub of information that you can use to manipulate and massage into a format that you can actually work with. It’s just one more very cool thing to love about Google Spreadsheet.
Have you ever imported data into your spreadsheets? What kind of interesting things did you discover in that data? How did you use the data? Share your experiences and ideas in the comments section below!
Image Credits: Business Graph