Big Data: Sources for Real Test Data

February 26th, 2013

One of the major impediments to working in Data Science is finding good data on which to run experiments. Since we suspect, based on prior research we’ve conducted, that infrastructure for Big Data needs to be adaptive because the characteristics of the data to be stored can require very different strategies, how do can we design new strategies and infrastructures without further understand the types of data being stored?

We hypothesis that there is a high level of sensitivity with regards to reliability, performability, and availability based on the type of data being stored. As such, synthetic data is likely unacceptable for our purposes, nor is it enough to study just a single data set to draw general conclusions. Given these problems, I’ve begun compiling a list of sources of data for testing. Any data we use should have certain characteristics:

  • The data should be freely available, and open.
  • The data should be real, and represent something of interest to scientists, engineers, industry, or the government.
  • The data should be available in a large volume.
  • Ideally the whole collection should be representative of data from many different domains. A broad range of data from many different applications and use cases will better allow us to perform experiments with broad

I’m working to slowly add the sources we’re already working with, and will post updates when we add important new sources. Hopefully this list will help out other researchers doing similar work.

