Introduction to Big Data

Abhiroop Bas
3 min readSep 16, 2020

--

Big data | Hadoop | Spark

What is Big Data?

If you ever wondered how popular social media sites like Facebook manages the tons of data you post everyday and shows you results for your searches in less than a second, here is the answer. Over 500 terabytes of data and 2.5 billion pieces of content uploaded each day! This calls for high capacity storage devices to safely and securely store the data and providing for means to access this data right at the fingertip.

This is where Big Data comes into play. Big Data involves analyzing and extracting information from huge chunks of data that is cumbersome and fairly inefficient for traditional data processing software.

Analyzing such huge chunks of data is often beneficial towards identifying trends, new business opportunities, predict diseases and combat crime.

Terminologies

Volume:

It refers to the size of the stored or generated data and is often used to determine storage capacities. Earlier it was very difficult to store data, but now with the use of data lakes and Hadoop, cheaper storage options are available.

Velocity:

It refers to the speed at which the data is accessible to the user. This can be particularly understood by the way Google search engine provides search results in a fraction of time even among the huge pool of data.

Variability:

It refers to the changing nature of data. This is particularly important for a dynamic world and applications like social media where the data is always changing.

Variety:

It refers to the type, nature and representation of data. This is particularly important with the evolution of structured, un-structured and semi-structured data. Hence data can range from text to image to audio and video files.

Veracity:

It refers to the data quality and data value associated with data. Since data comes from different sources, it is very important to connect and correlate these data and arrange into multiple hierarchies with linkage.

Evolution

In 2004, Google published a paper on a process called Map Reduce that uses a similar architecture. Using the concept of Map Reduce, huge chunks of data is sprit and distributed across parallel nodes and processed parallel and then gathered and delivered. This framework was later adopted by an Apache-based open-source project named Hadoop. Later on in 2012, Apache Spark was developed addressing the limitations of the Map Reduce algorithm.

Studies in 2012 showed that multiple-layer architecture to address the issues that big data presents. A distributed parallel architecture distributes data across multiple servers and implements Map Reduce and Hadoop framework.

Representation

Tools

Data warehouse is a popular tool has become synonymous with extract, transform and load (ETL) and is used to store structured data. This requires a predefined schema to exist and data be strictly filled into them. Data warehouses often implement partial ETL due to the rigid and regimented nature. Organizations periodically populate data warehouses in regular cycles, for example at 2:00 every day.

Data lakes enable users to determine data types, sources, the volume and the time. These are fairly inefficient to be performed on data warehouses and might require much time. Hence data lakes provide a lucrative option to implement these tasks. Moreover one particular schema does not fit all business models which render the data useless.

--

--