The statistics about data boggle the mind. The world’s data is doubling every 1.2 years. 90% of it has been created in the last two years. In 2012 there were two trillion gigabytes; by 2020 it will be 35 trillion gigabytes. We are not just consumers of this stuff. We have become active data agents, who spew out over 2.5 quintillion bytes every day from consumer transactions, communication devices, online behaviour and streaming services. This is our digital footprint. Of the 7 billion people on our planet over five billion own a mobile phone. Every day we make five billion Google searches, watch 2.8 billion YouTube videos and send over 11 billion texts. To be honest, I think that the names -terabyte, petabyte, exabyte, zettabyte or yottabyte – are like a foreign language. I would also like to know who compiles all these factoids that abound in TED talks or YouTube videos. Be that as it may, there is no doubt that we do have access to ever-increasing amounts of data
Welcome to the world of Big Data, the next big thing in the world of tech. The term is said to have been coined by John Mashey, a computer scientist working for Silicon Graphics in the mid-1990s. We are not the only source of this deluge of information. Data is becoming more understandable to computers. We now have the capacity to analyse unstructured data – stuff like words, images videos and streams of sensor data – that were inaccessible for traditional databases. Here are some of the major sources:
Scientific research data At CERN alone they produce 40 TB every second.
Retailer databases As a result of e-commerce and loyalty card schemes, retailers have been able to build up vast databases of recorded customer activity.
Vision recognition As vision recognition improves, it is starting to become possible for computers to glean meaningful information and data relationships from photographs and videos.
Internet of things As more smart objects go online, Big Data is also being generated by an expanding Internet of Things. One example is the sensors used to gather climate information.
Big Data generates value from the storage and processing of humongous quantities of digital information. Rather than just to putting data into silos for data storage for relatively little return we will be able to analyse these enormous datasets. It’s a new kind of asset – like a vital new mineral. Big Data is best understood in terms of the three Vs: variety, velocity and volume, i.e. large quantities of data of all kinds generated in real time. Crunching big numbers can help us learn a lot about ourselves and our world; it is “humanity’s dashboard”. This data can’t be analyzed using traditional computing techniques. It requires new systems, software and computers. And then you have those incredible machine-learning algorithms – the more data, the more they learn.
Big Data has the potential to improve analytical insight. It really is an extraordinary time to be a researcher with so much internet data available. It is being mined in areas as diverse as astrophysics, biology, economics or linguistics.
Google, Amazon and Facebook have already shown how it is possible to deliver personalised search results, advertising, and product recommendations using the vast amounts of data they handle. One third of Amazon’s sales are said to come from its recommendation engine. In a previous post I talked about the company Epagogix, whose algorithm uses big data analysis to evaluate the potential profitability of movies and TV shows before they get made. It is not just something that can be exploited by corporations. Big Data has the potential to be an intelligent tool that will enable us to:
Improve traffic management in cities, permitting the smarter operation of electricity generation infrastructures.
Help farmers to accurately forecast bad weather and crop failures.
Predict and plan for criminal activity or pandemics.
One exciting application is medicine. In the U.S. there is now an ambitious project to collect data on the care of hundreds of thousands of cancer patients and use it to help guide treatment of other patients across the health-care system. Cancer specialists would be able to consult the database, where they would be able to see how similar patients had fared on a particular regimen. The rationale is that information gleaned from huge clinical databases will give us a wealth of information about the benefits and harms of treatments. Ultimately it should lead to better quality healthcare and the development of new drugs.
Chris Anderson, an early fan of Big Data, foresees the end of theory and the demise of the expert:
There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
So are we about to enter a data utopia? I fear not.
There are important privacy issues, but those will have to wait for another post. I am going to concentrate on the methodological objections. What is the value of having this amount of data? Nassim Taleb has branded it a nasty phenomenon, cherry picking on an industrial scale. It may mean more information, but it also means more false information. Trevor Hastie, a statistics professor at Stanford has warned about the danger of looking for a meaningful needle in massive haystacks of data; many bits of straw look like needles. We must be wary of lies, damned lies and Big Data. I am particularly nervous about its application in finance.
Nevertheless, I am a fan of Big Data. As an English teacher, I love the access to so much grammar and lexis in the wild. Of course we need to be aware of generating spurious correlations. However, we had spurious correlations before the invention of big data. The classic case of this comes from the late 1940s in the USA when it was thought that there was a relationship between polio and the consumption of ice cream and soft drinks. We always need to have a healthy dose of scepticism when it comes to statistics. But I feel that knowing more about the world is a good thing. In the past inventions like the microscope and the telescope opened our eyes to worlds we could never have imagined. Some people feel uneasy about human activity being quantified in this way. In his book about the calculation of risk, Against the Gods, Peter Bernstein talked about how the Catholic Church had been opposed to statistics because they believed they were incompatible with the notion of free will.
I don’t believe in panaceas, but I think that Big Data presents us with some important opportunities. Time, that incorruptible judge, will tell us how much was hype.
*In this article I have used data with a singular verb. Though strictly speaking it should take a plural verb, as far as I am concerned data wants to be singular. This is like agenda – no one ever uses agendum.