Doing Data Science in the Cloud With ScraperWiki
If you’ve got the mental chops, a flair for programming and storytelling, and an eye for design, you can do worse than getting into data science. It’s the new big thing in technology; highly trendy and highly paid, with data scientists being sought by some of the largest companies in the world.
ScraperWiki is a company that has long been associated with the data science field. For the past few years, this Liverpool based startup has offered a platform for coders to write tools that get data, clean it and analyze it in the cloud.
With a recent refresh and the ever increasing demand for data scientists in the enterprise, it is worth taking a good look at ScraperWiki.
Full disclosure: I was an intern at ScraperWiki last summer.
What does ScraperWiki Do?
ScraperWiki markets itself as a place to get, clean and analyze data, and it delivers on each of those counts. In its simplest form, it allows you – the user – a place where you can write code that retrieves data from a source, tools to convert it into a format that is easy to analyze, and storage to keep it for later visualization – which you can also handle with ScraperWiki.
It also comes with a number of pre-built tools that automate repetitive tasks, including getting data from PDFs, which are notoriously difficult to decode. This is in addition to Twitter searching and scraping utilities. You don’t need any software development experience to use these.
As previously mentioned, ScraperWiki embraces the freemium pricing model and offers a service which has multiple tiers. Those just getting started with data science or with limited needs can make use of the free service. This gives you three datasets – where you store your data and code.
Those planning to write multiple scrapers or wanting to do mountains of data analysis can fork out some cash for a premium account. These start at $9 per month and offer 10 datasets. If that’s still not enough, you can always upgrade to their highest tier which comes with 100 datasets and costs $29 per month.
Programmers are often quite particular when it comes to how they code. Some prefer scripting languages over compiled languages. Some prefer the pared-back experience of a text editor over that of an integrated development environment (IDE). ScraperWiki recognizes that, and gives the user a huge amount of choice when it comes to how you write your code.
If you’re so inclined, you can write your code in the browser. As you’d expect from any professional-grade, web-based development tool, this comes with features that any programmer would consider to be essential, such as syntax highlighting.
There are a number of languages on offer. These include Python , a popular scripting language used by the likes of Google and NASA; Ruby , which powers a number of popular websites such as Living Social; and the popular statistical analysis language, R.
In addition, you can also write code from the command line by using SSH, Git and whatever text editor you enjoy using. Yes, you read that right. SSH . Each box you use is its own Linux account, and you are able to connect to it as you would a VPS or any other shell account. There are a number of text editors available, including Vim which can be extended with plugins and by editing the configuration. Those intimidated by Vim can use Nano, which is a lightweight command line text editor.
The libraries installed should be sufficient for writing tools to retrieve data and to process it. If you need something a bit more obscure, you can always create a virtualenv from the command line. As you can see, there’s a huge amount of flexibility afforded to developers.
So, you’ve got your data. You’ve normalized it. You’ve cleaned it. You’ve analyzed it. Now it’s time to do some visualization and show the world what you’ve learned.
There are a number of pre-made visualizations available, including ones which plots your data on a map and find trends within your findings. To use these, you need to ensure your data is stored as SQLite file with the filename ‘scraperwiki.sqlite’. Then you simply add the visualization you’re interested in. Simple, right?
ScraperWiki offers a lot to developers who want to do some data analysis without their development environment getting in their way, whilst having the flexibility to please even the most demanding of users. But what do you think? Let me know in the comments below.
Photo Credit: Rocket Science (Dan Brown)