0% found this document useful (0 votes)
26 views28 pages

CH 2 Data Science

Uploaded by

atakilti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views28 pages

CH 2 Data Science

Uploaded by

atakilti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Mekelle University​

MIT​
Department of Computer Science and Engineering ​

Course Title: Introduction to Emerging Technologies ​



Data Science ​
Outlines
• An Overview of Data Science ​
• Data types and their representation ​
• Data value Chain ​
• Basic Concepts of Big Data ​
An Overview of Data Science ​
• Data science is a multi-disciplinary field​
• Uses scientific methods, processes, algorithms, and systems to extract
knowledge and insights from structured and unstructured data. ​
• Data science is much more than simply analyzing data.​
• It offers a range of roles and requires a range of skills. ​

• Data science is a rapidly evolving field that combines statistical analysis,


machine learning, and domain expertise to extract valuable insights
from data.
An Overview of Data Science… ​
An Overview of Data Science ​…
Data Gathering 1
Collect data from various sources, including
structured databases, unstructured text, and
real-time streams. 2 Data Preprocessing
Clean, transform, and prepare the data for
analysis, ensuring it is high-quality and ready for
modeling.
Data Analysis 3
Apply statistical and machine learning
techniques to uncover patterns, trends, and
insights within the data.
What are data and information?​
• Data is a representation of facts, concepts, or instructions in a formalized manner.​
• It should be suitable for communication, interpretation, or processing, by
human or electronic machines. ​
• It can be described as unprocessed facts and figures.​
• It is represented with the help of characters such as ​
• Alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.).​

• Information is the processed data on which decisions and actions


are based information created from organized, structured, and processed data in a
particular context.​
Data Processing Cycle​
• Data processing is the re-structuring or re-ordering of data by people or machines
to increase their usefulness and add values for a particular purpose.​
• It consists of the following basic steps ​
• Input, processing, and output ​

These three steps constitute the data processing cycle


Data Processing Cycle...​
1. Input: the input data is prepared in some convenient
form for processing depend on the processing machine.​
• Fore example, for electronic computers, input data can be recorded on any one
of the several types of storage medium, such as hard disk, CD and flash disk. ​
2. Processing: the input data is changed to produce data in a
more useful form.​
• Fore example, interest can be calculated on deposit to a bank, a summary of
sales for the month can be calculated from the sales orders.​
3. Output: result of the processing step is collected.​
• For example, output data may be payroll for employees ​
Data Types and their Representation ​
• Data types can be described from diverse perspectives.​
• For instance, In computer programming, a data type is simply an attribute of data that tells the compiler how
the programmer intends to use the data.​

Data Types from Computer Programming Perspective ​


• This data type defines the operations that can be done on the data. Though different languages
may use different terminology. ​
• Common data types include:​
• Integers(int)- is used to store whole numbers​
• Floating-point numbers(float)- store real numbers.​
• Characters(char)- is used to store a single character​
• Booleans(bool)- to one of two values: true or false ​
• Alphanumeric strings(string)- characters and numbers ​
Data Types and their Representation… ​
Data types from Data Analytics perspective​
• From a data analytics point of view, there are three common types of data types or structures ​
1. Structured​
2. Semi-structured ​
3. Unstructured data types ​

• Below figure describes the three types of data and metadata​


Data Types and their Representation… ​
Data types from Data Analytics perspective
1. Structured Data: is data that adheres to a pre-defined data model and is therefore
straightforward to analyze​
• It conforms to a tabular format with a relationship between the different rows and columns.​
• Common examples of structured data are Excel files or SQL databases. Each of these has
structured rows and columns that can be sorted

2. Semi-structured Data: is a form of structured data that ​


• Does not conform with the formal structure of data models associated with relational databases.​
• Contains tags or other markers to separate semantic elements and enforce hierarchies of records
and fields​
• It is also known as a self-describing structure ​
• Example: JSON and XML
Data Types and their Representation… ​
Data types from Data Analytics perspective…​
C. Unstructured Data: is information that either​
• Does not have a predefined data model or is not organized in a pre-defined manner.​
• It is typically text-heavy but may contain data such as dates, numbers, and facts as well.​
• This results in irregularities and ambiguities​
• For example, audio, video files or NoSQL databases ​
D. Metadata (Data about Data): provides additional information about a specific set of data. ​
• Frequently used by Big Data solutions for initial analysis​
• For example, In a set of photographs, metadata could describe when and where the
photos were taken
Data Value Chain ​
• The Data Value Chain is introduced to describe the information flow within a big
data system as a series of steps needed to generate value and useful
insights from data.​
• Big Data Value Chain identifies the following key high-level activities: ​
Data value Chain… ​
Data Acquisition​
• It is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse or any other storage solution on which data analysis can be carried out.​

• It is one of the major big data challenges in terms of infrastructure requirements.​

• The infrastructure required to support the acquisition of big data must deliver low,
predictable latency in both capturing data and in executing queries.​

• Be able to handle very high transaction volumes, often in a distributed environment and ​
Support flexible and dynamic data structures.​
Data value Chain… ​
Data Analysis ​
• I​ t is concerned with making the raw data acquired amenable to use in decision-making as well as
domain-specific usage.​

• Data analysis involves​


• Exploring, transforming, and modeling data with the goal of highlighting relevant data​
• Synthesizing and extracting useful hidden information with high potential from a business point of view.​

• Related areas include:​


• Data mining ​
• Business intelligence ​
• Machine learning
Data value Chain… ​
Data Curation​
• It is the active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its effective usage.​
• Data curation processes can be categorized into different activities such as:​
• Content creation​
• Selection and classification​
• Transformation, validation, and preservation ​
• It is performed by expert curators that are responsible for improving the
accessibility and quality of data.​

• Data curators hold the responsibility of ensuring that data are trustworthy,
discoverable, accessible, reusable and fit their purpose. ​
Data value Chain… ​
Data Storage​
• I​ t is the persistence and management of data in a scalable way
that satisfies the needs of applications that require fast access to the data. ​

• Relational DBMS have been the main solution to the storage paradigm
for nearly 40 years. ​

• NoSQL technologies have been designed with the scalability goal in mind
and present a wide range of solutions based on alternative data models.
Data value Chain… ​
Data Usage​
• It covers the data-driven business activities that need access to data and its analysis.​

• And the tools needed to integrate the data analysis within the business activity.​

• Data usage in business decision making can enhance competitiveness through ​


• The reduction of costs ​
• Increased added value ​
• Any other parameter that can be measured against existing performance criteria. ​
Basic Concepts of Big Data ​
• Big data is a term for the non-traditional strategies and technologies needed to gather, organize, process, and
gather insights from large datasets. ​

What Is Big Data? ​


• Big data is the term for a collection of large and complex data sets.​
• It becomes difficult to process using on-hand database management tools or traditional data processing
applications​
• Big data is characterized by 3V and more: ​
• Volume: large amounts of data Zeta bytes/Massive dataset​
• Velocity: Data is live streaming or in motion with the speed​
• Variety: data comes in many different forms from diverse sources​

• Veracity: Trustworthiness, Accuracy, and Quality (can we trust the data? How accurate is it?​)
Basic Concepts of Big Data… ​
• Below figure shows the Characteristics of big data.​
The Role of Big Data in Data Science
• Enhanced Predictive Modeling​
 Big data enables more accurate and cultured predictive models, leading to better decision-making.​

• Improved Personalization​
 The large volume and variety of data allow for more personalized experiences and targeted
solutions.​

• Real-Time Insights​
 The high velocity of big data enables real-time analysis and instant decision-making in dynamic
environments.​

• Increased Efficiency​
 Big data can help optimize business processes, reduce costs, and improve overall operational
efficiency.​
Basic Concepts of Big Data… ​
Clustered Computing and Hadoop Ecosystem​
Clustered Computing ​
• Because of the quantities of big data, individual computers are
often inadequate for handling the data at most stages.​
• To better address the high storage and computational needs of
big data, computer clusters are a better fit.​
• Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits: ​
• Resource Pooling: Combining the available storage space to hold data is a
clear benefit. ​
 But CPU and memory pooling are also extremely important.​​
Basic Concepts of Big Data… ​
Clustered Computing and Hadoop Ecosystem​
Clustered Computing…
• High Availability: Clusters can provide varying levels
of fault tolerance and availability guarantees ​
 To prevent hardware or software failures from affecting access to data and processing.​
 This becomes increasingly important as we continue to emphasize the importance of real-
time analytics. ​

• Easy Scalability: Clusters make it easy to scale


horizontally by adding additional machines to the group.​
 This means the system can react to changes in resource requirements without expanding
the physical resources on a machine.​
Basic Concepts of Big Data… ​
Clustered Computing and Hadoop Ecosystem​
Clustered Computing…
• Using clusters requires a solution for managing​
 Cluster membership​
 Coordinating resource sharing ​
 Scheduling actual work on individual nodes​

• Cluster membership and resource allocation can be handled by software like Hadoop’s YARN.​
• The assembled computing cluster often acts as a foundation that other software interfaces with to
process the data. ​
• The machines involved in the computing cluster are also typically involved with the management
of a distributed storage system.
Basic Concepts of Big Data… ​
Clustered Computing and Hadoop Ecosystem…​
Hadoop and its Ecosystem​
• Hadoop is an open-source framework intended to make interaction with big data
easier.​
• It is a framework that allows for the distributed processing of large datasets across clusters of
computers​

• The four key characteristics of Hadoop are: ​


• Economical: highly economical as ordinary computers can be used for data processing​
• Reliable: as it stores copies of the data on different machines and is resistant to hardware failure. ​
• Scalable: It is easily scalable horizontally and vertically​
• Flexible: It is flexible and you can store as much structured and unstructured data as you need. ​
Basic Concepts of Big Data… ​
Clustered Computing and Hadoop Ecosystem…​
Hadoop and its Ecosystem​…
• Hadoop has an ecosystem that has evolved from its four core components: ​
• Data management​
• Access​
• Processing​
• Storage ​
• It is continuously growing to meet the needs of Big Data.​
• It comprises the following components and many others:​
• HDFS: Hadoop Distributed File System​
• YARN: Yet Another Resource Negotiator​
• MapReduce: Programming based Data Processing​
• Spark: In-Memory data processing ​
• HBase: NoSQL Database ​
Basic Concepts of Big Data… ​
Below figure shows Hadoop Ecosystem

Note: Study@Home
Hadoop Ecosystem
(https://www.geeksforgeeks
.org/hadoop-ecosystem/)
Basic Concepts of Big Data… ​
Clustered Computing and Hadoop Ecosystem…​
Big Data Life Cycle with Hadoop​
1. Ingesting data into the system: this is the first stage and data is ingested
or transferred to Hadoop from various sources such as relational databases, local
files.​
2. Processing the data in storage: data is stored and processed. The data is stored in
the distributed file system, HDFS and NoSQL perform data processing.​
3. Computing and analyzing data: data is analyzed by processing frameworks such
as Pig, Hive, and Impala​
4. Visualizing the results: this stage is Access, which is performed by tools such as
Hue and Cloudera Search​
• The analyzed data can be accessed by users.

You might also like