An Introduction to Big Data

akhil anand
3 min readSep 14, 2022
Source

Overview

When someone says big data the first thing that comes to mind is the large volume of data. Is it true? Is big data only mean a large volume of data? let’s break the myth. Any data with 5 Vs characteristics is known as big data. Now the question is what do these 5Vs indicate?

Is 500 GB of Excel file present in our local system considered big data?

Based on the Size/Variety of data we cannot say that the data is big data. for example, if we have an excel sheet of size 500GB present in our local system which can be downloadable & readable, cannot be simply considered big data. It might satisfy volume & Variety but velocity cannot be defined for these kinds of data & we won’t consider big data.

  1. Volume (The First V)

A Data engineer’s task is not to process a large set of data, but time is also the main constraint. for example, the analytics team of your company needs streaming data(every second's data ) so as a big data developer the major challenge is not only to fetch the data but the data should be fetched within a second so that it would be helpful for real-time analytics. These large volumes of data can be in petabytes, terabytes, gigabytes, etc.

2. Variety (The Second V)

The data can be of a different — different variety. for example

  • Structured/ Relational: Relational data (data can be fetched with the help of SQL)
  • Semi-Structured data:-CSV file, excel sheet, etc.
  • Unstructured Data:-Text files for example the data containing text reviews of the products are known as unstructured data.

3. Velocity(The Third V)

Velocity means nature of data i.e.

  • Batch Processing:- When we are receiving the data chunk by chunk for example some files we obtained at 10 am, some files we obtained at 1 pm, and so on then we can say the data is in Batch Mode. latency of batch mode data is in minutes or hours.
  • Near Time:- Suppose you are working in an e-commerce company and whenever a buyer placed a new order the data will be stored in the database these data can be updated within minutes, seconds, or hours and are known as near-time data.
  • Streaming Data:- Data that is continuously flowing is known as streaming data. Data coming from sensors, Financial trading platforms are known as streaming data. Latency of streaming data in orders of milliseconds or microseconds.

4. Veracity(The Fourth V)

Verasity means the accuracy of data. The data engineer’s task is not only to process the data by eliminating the challenges of the above 3 Vs but he/ she should also be sure that the processed data should be trustworthy in terms of accuracy. Data engineers cannot simply process a large set of junk data for a specific use case.

5. Value (The fifth & Last V)

Suppose some users are interested in web development and by mistake, they have clicked on a data science ad, recommending a data science course to these users won’t make any value to the organizations which are providing data Science -related curriculum. Hence these users will be eliminated from the dataset and won’t be considered valuable data.

How someone can differentiate between value & veracity as both of them have the same meanings?

Suppose you are running an ed-tech startup and providing premium courses to 5th–10th standard students. You are getting data as shown below in the picture.

Figure 1

If we will see the above figure we can observe the 4th & 7th rows have unrealistic data and we can’t trust these data hence need to remove both rows because we can’t have veracity on these records. City, Parent’s Monthly Income, School Name & Age looks relevant column to recommend the premium course so we can say that these columns only generate value for the use case.

Note:- Any Dataset which has above 5Vs Characteristics is said to be big data.

References:-

  • Full credit goes to Sudhanshu Kumar (ineuron)

--

--