“Big Data”: What is it? Am I ready for it?

April 2, 2012 0

As data and storage needs increase in complexity and sheer magnitude, companies are challenged to maintain, store, and most importantly, analyze their data in an effective manner.

The growth in data is further accelerated by the explosive increase of social networking and cloud computing in the last few years. Every day, 2.5 quintillion (1000^6) bytes of data are created. To put that number in a clearer perspective, that’s 2,273,736.75 terabytes of data created each day.

Past generation data storage technology, in some cases over 30 years old, simply cannot handle this demand, which is forcing direct investment of billions of dollars into new system overhauls. International Data Corporation forecasts the Big Data market growing from $3.2 billion in 2010 to $16.9 billion in 2015, representing a compound annual growth rate (CAGR) of 40%, which is about seven times higher growth than the overall information and communication technology market.

By Wikipedia definition, “Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.” Forrester research describes big data in terms of “4 Vs” – Velocity, Variety, Volume and Variability. This framework indicates that it is not only the volume, typically in the order of terabytes and exabyte, but also other factors that may contribute to defining “big data”

Data can be both big and multi-structured.

An enterprise may want to do an analysis that ties together data from various sources such as machine generated logs, social media streams from services such as Twitter and Facebook and the transactional data stored in their enterprise data warehouses. For example, they may want to understand the root cause of customer churn on the basis of the outages they have seen in their data center and how that has impacted customer sentiment on social media streams.

Data can be both big and uniformly structured.

Companies with large amounts of data such as eBay would fit well into this bucket. eBay has one of the world’s largest deployments of enterprise data warehouses that spans over 10 PB of stored data. All the data is stored in an orderly fashion in large relational data warehouse clusters. eBay performs majority of its complex analyses using this paradigm. However, this deployment model is not the norm because cost can be prohibitive for typical businesses as data may require complex and expensive pre-processing steps before it is uniformly structured.

Data can also be not-so-big and multi-structured.

Consider for example, financial tick data streams used by Wall Street traders or the XBRL-formatted financial statement data for public companies from their 10K and 10Q documents. The typical size of such data is not as huge as other data (unless you store it over a long period of time). The data is, however, multi-structured and formatted differently in different parts. For example depending on the security, trade data may be formatted differently. In case of XBRL formatted financial statements, common assets or liability values in balance sheet are stored in a certain structure and document-type-definition (DTD) whereas free-form footnotes and MDNA section in the same SEC filing are stored in a completely different structure.

Data can be not-so-big and uniformly structured.

Consider, for example, most of the traditional relational data base management systems (RDBMS). RDBMS store data in tables of rows with pre-defined relationships between tables. This form of data organization is amenable to transactional data where you need strong ACID (atomicity, consistency, isolation, durability) compliance. Data is stored in a pre-defined schema and adheres to an established set of processes. In short, the data must adapt to the storage mechanism (an RDBMS) before it can be stored.

Technologies such as Hadoop and NoSQL allow the data to be kept in its original format, with no schema or flexible schema. This also prevents the need to do any kind of pre-processing of data before storing it. This offers great flexibility to perform analytics and experimentation with the data before defining a fixed data model. One tradeoff is the ACID compliance offered by traditional relational databases that make it not suitable for transactional data such as point of sale data, or a sales order data.

The rapid growth of data both within the enterprise and public sectors has created tremendous new opportunities for businesses to innovate. However, the varied formats, volume and velocity of new data being generated puts tremendous strain on traditional RDBMS model. Bridging this disparity would allow enterprises to make informed decisions and remain competitive in the new information age.

Big Data in the Cloud: Get our Perspective at VMware vCloud CloudTalk Tuesday, April 10

Join us Tuesday, April 10th at 2pm EST to hear our perspective on how we see big data evolving in the cloud this year.  Get details here.