UNIT-2
Data science
- It extract knowledge and insights from structured, semi-structured and unstructured data.
- Data science is much more than simply analyzing data.
- It offers a range of roles and requires a range of skills.
Data
- representation of facts, concepts, which should be suitable for communication,
interpretation, or processing, by human or electronic machines.
- unprocessed facts and figures.
- Represented with the help of characters such as alphabets , digits or special characters.
Information
- Processed data on which decisions and actions are based.
- Data that has been processed into a form that is meaningful to the recipient
- It is Interpreted data; created from organized, structured, and processed data
Data processing cycle
- It is the re-structuring or re-ordering of data by people or machines to increase their
usefulness
- It consists of the following basic steps.
Input: recorded on hard disk, CD, flash disk and so o
Processing: The input data is changed in a more useful form.
Output: The result of the processing step is collected.
Data type and there representation
1, Data types from Computer programming perspective
Common data types include:
• Integers(int)- is used to store whole numbers
• Booleans(bool)- true or false
• Characters(char)- store a single character
• Floating-point numbers- store real numbers
• Alphanumeric strings- store a combination of characters and number
2. Data types from Data Analytics perspective:
- There are three common types of data types or structures:
I. structured data: conforms to a tabular format with a relationship between the different rows
and columns.
Eg: Excel files or SQL.
II. semi structured : does not conform with the formal structure of data models
- but nonetheless, contains tags or other markers to separate semantic elements
Eg: JSON and XML
III. unstructured data
- Either does not have a predefined data model or is not organized in a pre-defined manner.
- It is typically text-heavy but may contain data such as dates, numbers, and facts results in
irregularities and ambiguities.
Eg: audio, video files or No-SQL
Meta data –data about data
- It provides additional information about a specific set of data.
- provides fields for dates and locations which, by themselves, can be considered structured
data.
Data value chain
- describe the information flow within a big data system as a series of steps needed to
generate value and useful insights from data.
1. Data Acquisition
- process of gathering, filtering, and cleaning data before it is put in a data warehouse
- challenges include :infrastructure requirement for high transaction volumes
2. Data Analysis
- making the raw data useful for decision making.
- Involves exploring, transforming, and modeling data with the goal of highlighting relevant data.
3. Data Curation
- managing data throughout its lifecycle to ensure quality and usability.
- Activities include content creation, selection, classification, validation.
4. Data Storage
-Managing data in a scalable way.
-RDBMS may not handle big data efficiently.
- However, the ACID (Atomicity, Consistency, Isolation, and Durability) properties that
guarantee database transactions lack flexibility with regard to schema changes.
- NoSQL technologies have been designed with the scalability goal.
5. Data Usage: applying data analysis to business activities to improve performance, reduce cost
and enhance value
Basic concepts of big data
-Big data is characterized by 3V and more
• Volume: large amounts of data Zeta bytes/Massive datasets
• Velocity: Data is live streaming or in motion
• Variety: data comes in many different forms from diverse sources
• Veracity: can we trust the data? How accurate is it?
Clustered computing
- combine multiple computers to handle large data volumes and computational tasks.
Benefits:
- Resource Pooling: combine available storage space, CPU, …
- High Availability; fault tolerance and availability
- Easy Scalability: expansion in resource requirement
- The good example of clustering software is Hadoop’s YARN
Hadboop and its ecosystem
- four core components: data management, access, processing, and storage
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
- Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event data.
2. Processing the data in storage
- In this stage, the data is stored and processed.
- HDFS, and the NoSQL distributed data, HBase.
- Spark and Map Reduce perform data processing.
3. Computing and analyzing data
- the data is analyzed by processing frameworks such as Pig, Hive, and Impala.
- Pig converts the data using a map and reduce and then analyzes it.
4. Visualizing the results
- performed by tools such as Hue and Cloud era Search.