0% found this document useful (0 votes)
14 views3 pages

Data Science Fundamentals and Concepts

Uploaded by

blessmillion434
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

Data Science Fundamentals and Concepts

Uploaded by

blessmillion434
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT-2

Data science

- It extract knowledge and insights from structured, semi-structured and unstructured data.
- Data science is much more than simply analyzing data.
- It offers a range of roles and requires a range of skills.

Data

- representation of facts, concepts, which should be suitable for communication,


interpretation, or processing, by human or electronic machines.
- unprocessed facts and figures.
- Represented with the help of characters such as alphabets , digits or special characters.

Information

- Processed data on which decisions and actions are based.


- Data that has been processed into a form that is meaningful to the recipient
- It is Interpreted data; created from organized, structured, and processed data

Data processing cycle

- It is the re-structuring or re-ordering of data by people or machines to increase their


usefulness
- It consists of the following basic steps.

Input: recorded on hard disk, CD, flash disk and so o

Processing: The input data is changed in a more useful form.

Output: The result of the processing step is collected.

Data type and there representation

1, Data types from Computer programming perspective

Common data types include:


• Integers(int)- is used to store whole numbers
• Booleans(bool)- true or false
• Characters(char)- store a single character
• Floating-point numbers- store real numbers
• Alphanumeric strings- store a combination of characters and number

2. Data types from Data Analytics perspective:


- There are three common types of data types or structures:
I. structured data: conforms to a tabular format with a relationship between the different rows
and columns.
Eg: Excel files or SQL.

II. semi structured : does not conform with the formal structure of data models
- but nonetheless, contains tags or other markers to separate semantic elements
Eg: JSON and XML

III. unstructured data


- Either does not have a predefined data model or is not organized in a pre-defined manner.
- It is typically text-heavy but may contain data such as dates, numbers, and facts results in
irregularities and ambiguities.
Eg: audio, video files or No-SQL

Meta data –data about data

- It provides additional information about a specific set of data.


- provides fields for dates and locations which, by themselves, can be considered structured
data.

Data value chain

- describe the information flow within a big data system as a series of steps needed to
generate value and useful insights from data.

1. Data Acquisition
- process of gathering, filtering, and cleaning data before it is put in a data warehouse
- challenges include :infrastructure requirement for high transaction volumes

2. Data Analysis
- making the raw data useful for decision making.
- Involves exploring, transforming, and modeling data with the goal of highlighting relevant data.

3. Data Curation
- managing data throughout its lifecycle to ensure quality and usability.
- Activities include content creation, selection, classification, validation.

4. Data Storage
-Managing data in a scalable way.
-RDBMS may not handle big data efficiently.
- However, the ACID (Atomicity, Consistency, Isolation, and Durability) properties that
guarantee database transactions lack flexibility with regard to schema changes.
- NoSQL technologies have been designed with the scalability goal.
5. Data Usage: applying data analysis to business activities to improve performance, reduce cost
and enhance value

Basic concepts of big data


-Big data is characterized by 3V and more
• Volume: large amounts of data Zeta bytes/Massive datasets
• Velocity: Data is live streaming or in motion
• Variety: data comes in many different forms from diverse sources
• Veracity: can we trust the data? How accurate is it?

Clustered computing

- combine multiple computers to handle large data volumes and computational tasks.

Benefits:

- Resource Pooling: combine available storage space, CPU, …


- High Availability; fault tolerance and availability
- Easy Scalability: expansion in resource requirement
- The good example of clustering software is Hadoop’s YARN

Hadboop and its ecosystem

- four core components: data management, access, processing, and storage

Big Data Life Cycle with Hadoop

1. Ingesting data into the system


- Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event data.

2. Processing the data in storage


- In this stage, the data is stored and processed.
- HDFS, and the NoSQL distributed data, HBase.
- Spark and Map Reduce perform data processing.

3. Computing and analyzing data


- the data is analyzed by processing frameworks such as Pig, Hive, and Impala.
- Pig converts the data using a map and reduce and then analyzes it.

4. Visualizing the results


- performed by tools such as Hue and Cloud era Search.

You might also like