Big data sets – where to find and how to harvest them

Finding big data sets to work with isn’t easy. People who’d like to share need to be wary of the sensitivity of the data they’re sharing (e.g. social media data is very personal) and generally a lot of interesting data sets are off limits because of this.

I’ve detailed in this blog some repositories for data that you can collect yourself. Sites like Kaggle will offer static data sets, and some are of a decent size too (terabytes and up), but testing your models against new data is much more interesting and a vital part of machine learning.

I should note that the below examples are mostly “pull” data sets, where you only get the data when you request it. “Push” data sets would be interesting for streaming systems like Apache Kafka, but these are rather rare. Transport for London offers one but the process to get approved is much more involved than just signing up. The link to the TfL stream is here called “Live bus and river bus arrivals API (stream)” but you’ll need to apply via email and describe your use case to them.

Transport API

Transport API website

This is a transport-based repository similar to the TfL API which provides data about the London transport network, but this one covers the whole of UK. It goes a bit further by including extras like Tweets for sentiment analysis, timetable comparisons to actuals and performance indicators. It’s mostly used to build transport applications for use in the UK.

It uses the JSON file format and collecting is also similar to TfL’s via URL manipulation. See some of the examples here in their documentation.

To harvest the data, I’d do something similar to my TfL collection using a bash script in a CRON job to pull the routes desired every n minutes and pushing it to a large store like an S3 bucket or HDFS.

London Air Quality

London Air website

This is also part of TfL’s open data and provided by King’s College London. It’s a feed of pollution and air quality data provided in XML or JSON format. It includes data and maps at different levels like hourly, daily, annually etc.

To harvest this, I’d do something similar to the Transport API mentioned previously. It would certainly be interesting to combine them to find correlations like highly polluted areas and the number of late buses for example.

AWS Open data registry

Open data registry website

This is a repository of very different data sets in S3 buckets with periodical updates. Having them in S3 buckets relieves you of having to harvest them yourself and accessing them is very simple via something like the Kognitio S3 connector. I wrote about connecting to JSON in S3 in an earlier blog here and a quick reference guide to our S3 connector can be found here.

Keep an eye out on the update frequency as the more frequent it is, the more likely it is to be a large data set. For example this weather radar data is updated in real time.

For a nice simple example, this financial data set, that gets updated every minute during trading hours, has some sample SQL calls in github.

GDELT project

GDELT project website

This is a collection of news that monitors every medium including broadcast, print, and web news globally. In 2015 alone 2.5TB of was collected and it doesn’t show any signs of slowing down especially with the growth of social media. It doesn’t provide the data itself but links to it instead which makes the size even more impressive.

It comes in CSV format so reading it into a traditional row-based database is relatively simple.

To harvest this you can access it via Google Big Query if you just want to sample some data sets or use Google’s implementation of it. But for different uses there’s also an S3 bucket for this via the AWS Open data registry that gets updated daily.

Other big data sets

This gives you a good selection of data sets to get started with if you’re wanting to experiment with big data analytics. Platforms like Kognitio perform at best with massive data sets, so we’re always on the look out for data we can use to demonstrate Kognitio’s impressive performance. If you have any examples of freely available big data sets, I’d love to hear about them, so add your thoughts into the comments below.


Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *