Characteristics of Big Data – Part One
Are you not sure if your project requires a Big Data Solution? The purpose of the Characteristics of Big Data article is to provide you with the characteristics of Big Data that will assist you in answering that question.
The Big Data movement is based upon one common motivation: Need More Input!
Yes Johnny-5 is alive and well today in the world with the growing demand for more data to make effective and timely business decisions. In the movie, Short Circuit, Johnny-5 is a military artificial intelligent robot that escapes into the civilized world after being struck by lightning, thus Short Circuit, and begins the quest to acquire knowledge because it now has an insatiable appetite that demands more input. Johnny-5 acquires input from everything that is seen, touched or heard.
The Big Data movement over the last fourteen years has been similar where businesses have been capturing data from every source of input imaginable, thus requiring an approach to manage the data capture process and then analyze the data to make actionable decisions to improve the insatiable quest of profitability. Beginning in 2001, META Group (now Gartner) analyst Doug Laney wrote in a research report that defined the data growth challenges and opportunities known as the three dimensions or 3 V Dimensions of Big Data. Today, fourteen years later, the Big Data challenges and opportunities has grown to eight dimensions or the 8 ‘V’ Dimensions of Big Data.
Because of the different interpretations of what big data is promising to deliver, I am reminded of the SQL relational battle that started in the mid-80s with IBM, Oracle and Microsoft setting the stage for the American National Standards Institute creating the (ANSI) SQL standard in 1986. Every SQL vendor provided their own interpretations of SQL, which were then viewed as SQL extensions to the ANSI SQL standard. In my opinion, the same thing is happening in the Big Data movement. However in reality, nothing is different with the Big Data standards movement versus the SQL data standards movement thirty years ago.
The processes for managing data are still the same, however the methods to access the data have changed. And the problems are still the same, in that companies want to maintain a competitive edge in the market to remain profitable while maintaining a reasonable cost for managing, maintaining and analyzing the data. This raises the question, is Big Data the final frontier of the information age in the search for making data more manageable and making data more meaningful? The issues surrounding the management, maintenance and analysis of Big Data has become more complex, especially when the volume of data has become very large from multiple data sources and needs to be linked, connected, and correlated in order to produce information that is conveyed by the data to enable business personnel to make better informed business decisions more rapidly.
This brings us to the purpose of the characteristics of Big Data to help with identifying if a problem requires a Big Data solution. There are differing opinions with the characteristics in that some state that only the first three characteristic V dimensions are needed to identify a project as ‘Big Data’. And some would even state you may even need only a combination of two of the first three. Therefore I decided to create two parts for the characteristics of Big Data in order for the reader to start evaluating if they have a project that is Big Data driven.
In this Part One section, the first three ‘V’ dimensions of Big Data that started the Big Data movement in 2001 are addressed. The additional five ‘V’ Dimension Characteristics of Big Data will be covered in Part Two. The Big Data characteristics were compiled from several sources including IBM, Paxata, Datafloq, Data Science Central and the National Institute of Standards and Technology (NIST). NIST is a non-regulatory body that has been in existence from 1901 thru 1980 as the National Standards Bureau before it changed names.
The eight (8) ‘V’ Dimension Characteristics of Big Data:
Part One: Volume, Velocity, Variety
Part Two: Variability, Veracity, Virality, Visualization and Value.
The original three ‘V’ Dimension Characteristics of Big Data identified in 2001 are:
1) volume (amount of data the size of the data set)
Volume Refers to the vast amounts of data generated every second. We are not talking Terabytes but Zettabytes or Brontobytes. If we take all the data generated in the world between the beginning of time and 2008, the same amount of data will soon be generated every minute. This makes most data sets too large to store and analyze using traditional database technology. New big data tools use distributed systems so that we can store and analyze data across databases that are dotted around anywhere in the world.
90% of all data ever created, was created in the past 2 years. From now on, the amount of data in the world will double every two years. By 2020, we will have 50 times the amount of data as that we had in 2011. The sheer volume of the data is enormous and a very large contributor to the ever expanding digital universe is the Internet of Things with sensors all over the world in all devices creating data every second. The era of a trillion sensors is upon us.
If we look at airplanes they generate approximately 2.5 billion Terabyte of data each year from the sensors installed in the engines. Self-driving cars will generate 2 Petabyte of data every year. Also the agricultural industry generates massive amounts of data with sensors installed in tractors. Shell uses super-sensitive sensors to find additional oil in wells and if they install these sensors at all 10,000 wells they will collect approximately 10 Exabyte of data annually. That again is absolutely nothing if we compare it to the Square Kilometer Array Telescope that will generate 1 Exabyte of data per day.
In the past, the creation of so much data would have caused serious problems. Nowadays, with decreasing storage costs, better storage solutions like Hadoop and the algorithms to create meaning from all that data this is not a problem at all.
2) velocity (speed of data in and out or data in motion)
Velocity Refers to the speed at which new data is generated and the speed at which data moves around. Just think of social media messages going viral in seconds. Technology allows us now to analyze the data while it is being generated (sometimes referred to as in-memory analytics), without ever putting it into databases.
The Velocity is the speed at which the data is created, stored, analyzed and visualized. In the past, when batch processing was common practice, it was normal to receive an update from the database every night or even every week. Computers and servers required substantial time to process the data and update the databases. In the big data era, data is created in real-time or near real-time. With the availability of Internet connected devices, wireless or wired, machines and devices can pass-on their data the moment it is created.
The speed at which data is created currently is almost unimaginable: Every minute we upload 100 hours of video on YouTube. In addition, every minute over 200 million emails are sent, around 20 million photos are viewed and 30,000 uploaded on Flickr, almost 300,000 tweets are sent and almost 2.5 million queries on Google are performed.
The challenge organizations have is to cope with the enormous speed the data is created and used in real-time.
3) variety (range of data types, domains and sources)
Variety Refers to the different types of data we can now use. In the past we only focused on structured data that neatly fitted into tables or relational databases, such as financial data. In fact, 80% of the world’s data is unstructured (text, images, video, voice, etc.) With big data technology we can now analyze and bring together data of different types such as messages, social media conversations, photos, sensor data, video or voice recordings.
In the past, all data that was created was structured data, it neatly fitted in columns and rows but those days are over. Nowadays, 90% of the data that is generated by an organization is unstructured data. Data today comes in many different formats: structured data, semi-structured data, unstructured data and even complex structured data. The wide variety of data requires a different approach as well as different techniques to store all raw data.
There are many different types of data and each of those types of data require different types of analyses or different tools to use. Social media like Facebook posts or Tweets can give different insights, such as sentiment analysis on your brand, while sensory data will give you information about how a product is used and what the mistakes are.
In summary, with better storage solutions, the challenge of rapid data creation, and the diverse tools to store and analyze data, there are practical approaches to performing analytics on data for making informed business decisions. I trust that Part One content will assist you with evaluating if a project requires a Big Data Solution. If you are not certain, come back for the Part Two blog on the remaining five ‘V’ Dimension Characteristics of Big Data.
Have a great day and best of success in your first or next Big Data engagement.
« Back to blog