The Big Data happening
Big Data?
The term “big data” refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods.
Let’s break it piece by piece and see what it really is.
The big data is everywhere in the 21st century. More people are spending time online than and the number keeps increasing steadily.
The 21st century is a digital book , anything we do on the internet is saved and stored by multiple organizations for variety of reasons. For example , everyday we open our smartphone ,google Instagram , open our profile and upload a story. When we do this google stores our search history, instagram retrieves our profile and when we upload our story it has to save that also , not to forget our ISP also keeps track of this data. Seems pretty lengthy process right? but we get our needs fulfilled at an extremely low latency . The reason is the efficient storage systems of these big tech giants that have to maintain the storage of millions and billions of users in the most efficient way possible.
So this is just an example of how humongous data can really be in 21st century.
So who has the most data?
The answer is fairly obvious and you might have guessed it correctly too. Google, Amazon, facebook have the most data for handling. According to a study google currently process 20 petabytes of data everyday and in total they have to handle something between 10 to 15 Exabytes of storage.
If you having a little difficulty in understanding the units of these storage have a look at the chart below:
We are mostly familiar with megabytes and gigabytes but these units of storage are of little significance when it comes to big data. Just to get a rough idea consider an HD 2 hour movie we watch on netflix. The size of the movie is about 2gb. It would translate to about 500 movies in 1 terabyte and we are talking about 20 petabytes of data every hour hence you can get the gravity of how big the units of storage can be.
Apart from google , the other storage hungry platforms also include facebook, netflix, instagram etc.
So where do they store the data?
All the big tech giants who have to handle such huge amount of data usually own their datacenters. So you can consider these datacenters as big chunks of hard disk but the story isn’t that simple.
The main two issues that I would like to throw light on which arise with Big Data are:
- Volume of data
- Velocity of the data.
The former seems pretty self explanatory. We already got the idea of how much data we are dealing with right now. The second issue is the velocity.
While volume means the issue of storing the data, velocity refers to handling of this big data. When we do a simple google search we see the results pop up almost instantly if this issue is not resolved we can see a simple search taking multiple days before executing. This would make the whole internet almost useless. So handling this data and managing it efficiently is as important as storing it efficiently.
Why do we even store this data?
Some people often think that why do we even need this data. Why not just dump it and forget about it?
Even if we forget the fact that big data also consists of our login details, our memorable images and videos we shared on instagram there is another reason why we cannot just dump it.
All this unstructured and huge chunks of data has all the information that these tech giants can use to make their products more usable and friendly to the users. This big data consists of the pattern of the type of movies we like, the type of products we want in our cart at amazon or the suggested friends pattern on facebook and instagram.
These are just to name a few. This big data is just gold that needs to be harnessed. This is what the work of Big Data Analytics is.
So this pretty much gives the idea why dumping all this data might not be the best idea.
So what is the solution?
The solution of this Big Data problem lies in distributed storage systems. Most tech giants like facebook harness the power of distributed storage to store and provide the zero latency services.
In layman terms, a distributed storage cluster consist of a single unit of system that acts as a master and utilizes the resources of the numerous nodes attached to it. The number of nodes depends upon the use case and the affordability.
They can range from two virtual machine on a single humble laptop or thousands of nodes on a google cluster. This architecture can also take the support of the cloud services for allocating the storage of nodes.
I did not cover much of the technical part of what goes into a distributed storage cluster. This was just to give the idea of how Big data arises and why do we even need to deal with it in the first place.
Thanks for reading :)