Doing Data Science in the Cloud With ScraperWiki

Matthew Hughes 21-01-2014

If you’ve got the mental chops, a flair for programming and storytelling, and an eye for design, you can do worse than getting into data science. It’s the new big thing in technology; highly trendy and highly paid, with data scientists being sought by some of the largest companies in the world.


ScraperWiki is a company that has long been associated with the data science field. For the past few years, this Liverpool based startup has offered a platform for coders to write tools that get data, clean it and analyze it in the cloud.

With a recent refresh and the ever increasing demand for data scientists in the enterprise, it is worth taking a good look at ScraperWiki.

Full disclosure: I was an intern at ScraperWiki last summer.

What does ScraperWiki Do?

ScraperWiki markets itself as a place to get, clean and analyze data, and it delivers on each of those counts. In its simplest form, it allows you – the user – a place where you can write code that retrieves data from a source, tools to convert it into a format that is easy to analyze, and storage to keep it for later visualization – which you can also handle with ScraperWiki.



It also comes with a number of pre-built tools that automate repetitive tasks, including getting data from PDFs, which are notoriously difficult to decode. This is in addition to Twitter searching 5 Cool Twitter Search Tricks to Monitor What People Are Saying About You If you own a website or are just trying to earn money online as a freelancer, it is always good to know what people are saying about you over the Internet. People may be citing... Read More and scraping utilities. You don’t need any software development experience to use these.


As previously mentioned, ScraperWiki embraces the freemium pricing model and offers a service which has multiple tiers. Those just getting started with data science or with limited needs can make use of the free service. This gives you three datasets – where you store your data and code.

Those planning to write multiple scrapers or wanting to do mountains of data analysis can fork out some cash for a premium account. These start at $9 per month and offer 10 datasets. If that’s still not enough, you can always upgrade to their highest tier which comes with 100 datasets and costs $29 per month.


Programmers are often quite particular when it comes to how they code. Some prefer scripting languages over compiled languages. Some prefer the pared-back experience of a text editor over that of an integrated development environment (IDE). ScraperWiki recognizes that, and gives the user a huge amount of choice when it comes to how you write your code.



If you’re so inclined, you can write your code in the browser. As you’d expect from any professional-grade, web-based The Top 3 Browser-Based IDE's To Code In The Cloud Read More development tool, this comes with features that any programmer would consider to be essential, such as syntax highlighting.


There are a number of languages on offer. These include Python The 5 Best Websites to Learn Python Programming Want to learn Python programming? Here are the best ways to learn Python online, many of which are entirely free. Read More , a popular scripting language used by the likes of Google and NASA; Ruby 3 Interactive, Fun, Free Ways To Start Learning The Ruby Programming Language Ruby is an expressive, very high-level, scripting language. It is used on the Web mainly as part of the Ruby on Rails web development framework, but also standalone. If you’re curious about what Ruby (not... Read More , which powers a number of popular websites such as Living Social; and the popular statistical analysis language, R.



In addition, you can also write code from the command line by using SSH, Git and whatever text editor you enjoy using. Yes, you read that right. SSH What SSH Is & How It's Different From FTP [Technology Explained] Read More . Each box you use is its own Linux account, and you are able to connect to it as you would a VPS or any other shell account. There are a number of text editors available, including Vim The Top 7 Reasons To Give The Vim Text Editor A Chance For years, I've tried one text editor after another. You name it, I tried it. I used each and every one of these editors for over two months as my primary day-to-day editor. Somehow, I... Read More which can be extended with plugins and by editing the configuration. Those intimidated by Vim can use Nano, which is a lightweight command line text editor.


The libraries installed should be sufficient for writing tools to retrieve data and to process it. If you need something a bit more obscure, you can always create a virtualenv from the command line. As you can see, there’s a huge amount of flexibility afforded to developers.


Data Visualization

So, you’ve got your data. You’ve normalized it. You’ve cleaned it. You’ve analyzed it. Now it’s time to do some visualization and show the world what you’ve learned.

ScraperWiki allows developers to display their data using web pages constructed from the all-familiar trifecta of HTML, CSS and JavaScript. In addition, Bootstrap components are supported out of the box.


There are a number of pre-made visualizations available, including ones which plots your data on a map and find trends within your findings. To use these, you need to ensure your data is stored as SQLite file with the filename ‘scraperwiki.sqlite’. Then you simply add the visualization you’re interested in. Simple, right?


ScraperWiki offers a lot to developers who want to do some data analysis without their development environment getting in their way, whilst having the flexibility to please even the most demanding of users. But what do you think? Let me know in the comments below.
Photo Credit: Rocket Science (Dan Brown)

Related topics: Cloud Computing, Web Analytics.

Affiliate Disclosure: By buying the products we recommend, you help keep the site alive. Read more.

Whatsapp Pinterest

Leave a Reply

Your email address will not be published. Required fields are marked *