Professional Documents
Culture Documents
BIG DATA
Big Data is a phrase used to mean a massive volume of both structured and unstructured data
that is so large it is difficult to process using traditional database and software techniques. In
most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds
current processing capacity. Big data is a field that treats ways to analyze, systematically
extract information from, or otherwise deal with data sets that are too large or complex to be
dealt with by traditional data-processing application software. Data with many cases (rows)
offer greater statistical power, while data with higher complexity (more attributes or columns)
may lead to a higher false discovery rate. Big data challenges include capturing data, data
storage, data analysis, search, sharing, transfer, visualization, querying, updating, information
privacy and data source. Big data was originally associated with three key concepts: volume,
variety, and velocity. Other concepts later attributed with big data are veracity (i.e., how
much noise is in the data) and value
Current usage of the term big data tends to refer to the use of predictive analytics, user
behavior analytics, or certain other advanced data analytics methods that extract value from
data, and seldom to a particular size of data set. "There is little doubt that the quantities of
data now available are indeed large, but that's not the most relevant characteristic of this new
data ecosystem." Analysis of data sets can find new correlations to "spot business trends,
prevent diseases, combat crime and so on. Scientists, business executives, practitioners of
medicine, advertising and governments alike regularly meet difficulties with large data-sets in
areas including Internet search, fintech, urban informatics, and business informatics.
Scientists encounter limitations in e-Science work, including meteorology, genomics,
complex physics simulations, biology and environmental research.
EXAMPLES OF BIG DATA
Education Industry is flooding with a huge amount of data related to students, faculties,
courses, results and what not. It was not long before we realized that the proper study and
analysis of this data can provide insights that can be used to improve the operational
effectiveness and working of educational institutes.
Following are some of the fields in education industry that have been transformed by big data
motivated changes
Customized programs and schemes for each individual can be created using the data collected
on the bases of a student’s learning history to benefit all students. This improves the overall
student results
Reframing the course material according to the data that is collected on the basis of what
student learns and to what extent by real time monitoring of what components of a course are
easier to understand.
Grading Systems:
New advancements in grading systems have been introduced as a result of proper analysis of
student data.
Career prediction:
Proper analysis and study of every student’s records will help in understanding the student’s
progress, strengths, weaknesses, interests and more. It will help in determining which career
would be most appropriate for the student in the future.
The applications of big data have provided a solution to one of the biggest pitfalls in the
education system, that is, the one-size-fits-all fashion of academic set up, by contributing in
e-learning solutions.
Example:
The University of Alabama has more than 38000 students and an ocean of data. In the past
when there were no real solutions to analyse that much data, some of that data seemed
useless. Now administrators are able to use analytics and data visualizations for this data to
draw out patters with students revolutionizing the university’s operations, recruitment and
retention efforts.
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period of time, talent in computer science has achieved greater
success in developing techniques for working with such kind of data (where the format is
well known in advance) and also deriving value out of it. However, nowadays, we are
foreseeing issues when a size of such data grows to a huge extent, typical sizes are being in
the rage of multiple zettabytes.
Looking at these figures one can easily understand why the name Big Data is given and
imagine the challenges involved in its storage and processing.
Data stored in a relational database management system is one example of a 'structured' data.
Any data with unknown form or the structure is classified as unstructured data. In addition to
the size being huge, un-structured data poses multiple challenges in terms of its processing
for deriving value out of it. A typical example of unstructured data is a heterogeneous data
source containing a combination of simple text files, images, videos etc. Now day
organizations have wealth of data available with them but unfortunately, they don't know how
to derive value out of it since this data is in its raw form or unstructured format.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational
DBMS.
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Please note that web application data, which is unstructured, consists of log files, transaction
history files etc. OLTP systems are built to work with structured data wherein data is stored in
relations (tables).
The name Big Data itself is related to a size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also, whether a particular data can
actually be considered as a Big Data or not, is dependent upon the volume of data.
Variety
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of
data considered by most of the applications.
Velocity
The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.
Variability
This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.