+91 9830991821/9804866596 admissions.kolkata@isbm.ac.in International School of Business & Media ISB&M, Kolkata

What is Big Data?

Big data is characterized by three V-s namely Volume, Velocity and Variety. The Volume indicates huge amount of consumer data collected in real time from numerous market and other sources and can be used as an indicator of customer response to various products and services available in the market. The Velocity means the speed at which the data is collected from various online and offline sources. The data is generated at high speed by thousands of retail websites, mobile social media sites, electronic POS terminals at retail stores, banking transactions, air/rail/bus ticket booking counters, audio/video downloads and so on. Millions of events are happening per second across the globe and billions of records are generated per second which should be instantly captured and stored for further processing. Variety indicates different types of data that are generated and stored in Big Databases. The data includes numbers, text messages, image files, audio/video clips, animations, 3D, HD, unstructured data, log files, financial data, social media posts etc. Each of these data sets has specific formats and need to be stored in such a manner as they can be instantly reconstructed as and when required. So traditional database systems that are designed to handle structured low volume and low speed data will be unable to handle big data and special technology is required to store and analyze big data.

In order to store huge amount of unstructured data coming from a large variety of sources at a fast rate, NoSQL (not only SQL) databases are used that store data in the form of objects or key/value pairs and not in the form of tables. This approach is useful in case of distributed data structures and offers higher speed of operations, agility and accuracy. Another important big data technology is Apache Hadoop which is a Java based distributed computing platform that ensures fast data transfer rates under distributed database environments. Google's MapReduce is a distributed database application framework where a database application is broken into a number of smaller parts which can run in any node of a distributed file system (such as Hadoop Distributed File System - HDFS).Some other allied technologies are Apache Hive (Hadoop data warehouse) and Apache Hbase (distributed database).

Big data can be of two types, namely online Big data and off-line Big data. Online Big data are generated online through numerous online events and are collected and stored in cloud servers. Users can subscribe to the cloud service and download and analyze the big data whenever required. Off-line big data are collected from various off-line batch processes and stored in distributed Hadoop databases using MapReduce technology. Examples of off-line big data systems are Data Warehouses, Extract Transfer & Load (ETL) systems or Business Intelligence tools. Major vendors of Big data include IBM, SAP, Oracle, Microsoft, Teradata and Amazon Web Services (AWS).