0% found this document useful (0 votes)
654 views48 pages

KCA 034 - Unit 1

The document outlines a syllabus for a Data Analytics course, covering topics such as the introduction to data analytics, the data analytics lifecycle, and the nature of data. It details the sources of data, including primary and secondary sources, and categorizes data into structured, unstructured, and semi-structured types. Additionally, it discusses the importance of data analytics, the evolution of analytic scalability, and modern data analytics tools and techniques.

Uploaded by

pefova5174
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
654 views48 pages

KCA 034 - Unit 1

The document outlines a syllabus for a Data Analytics course, covering topics such as the introduction to data analytics, the data analytics lifecycle, and the nature of data. It details the sources of data, including primary and secondary sources, and categorizes data into structured, unstructured, and semi-structured types. Additionally, it discusses the importance of data analytics, the evolution of analytic scalability, and modern data analytics tools and techniques.

Uploaded by

pefova5174
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

KCA034: Data Analytics

Syllabus
(Unit-1)
• Introduction to Data Analytics: Sources and nature of data,
classification of data (structured, semi-structured, unstructured),
characteristics of data, introduction to Big Data platform, need of
data analytics, evolution of analytic scalability, analytic process and
tools, analysis vs reporting, modern data analytic tools, applications
of data analytics.
• Data Analytics Lifecycle: Need, key roles for successful analytic
projects, various phases of data analytics lifecycle – discovery, data
preparation, model planning, model building, communicating results,
operationalization
Sources of Data in Data Analytics
• Data can be collected from various sources, broadly categorized into
primary and secondary sources.
• Data collection is the process of acquiring, collecting, extracting, and
storing the voluminous amount of data which may be in the
structured or unstructured form like text, video, audio, XML files,
records, or other image files used in later stages of data analysis.
• “Data collection” is the initial step before starting to analyze the
patterns or useful information in data. The data which is to be
analyzed must be collected from different valid sources.
Sources of Data in Data Analytics
• The data which is collected is known as raw data which is not useful
now but on cleaning the impure and utilizing that data for further
analysis forms information, the information obtained is known as
“knowledge”.
• The main goal of data collection is to collect information-rich data.
• Data can be collected from various sources, broadly categorized into
primary and secondary sources.
Primary Data Sources (First-Hand Data)
Data collected directly from original sources for specific analysis.
• Surveys & Questionnaires – Customer feedback, opinion polls.
• Interviews & Focus Groups – Direct conversations with stakeholders.
• Observations – Behavioral tracking, CCTV footage.
• Sensor & IoT Data – Smart devices, environmental sensors.
• Transactional Data – Online purchases, financial transactions.
• Web Scraping & APIs – Extracting data from web pages and services.
Secondary Data Sources (Pre-Existing Data)
• Data obtained from existing databases or external sources.
• Public Datasets – Government portals (e.g., WHO, World Bank).
• Enterprise Databases – CRM, ERP, and other internal systems.
• Social Media – Twitter, Facebook, LinkedIn analytics.
• Research Papers & Reports – Academic journals, market research
firms.
• Cloud Data Repositories – Google BigQuery, AWS Data Exchange.
Nature of Data in Data Analytics
Data can be categorized based on its structure, type, and form.
Based on Structure:
• Structured Data – Organized, stored in relational databases.
• Examples: Sales records, customer information, financial transactions.
• Unstructured Data – Raw, non-tabular, lacks a predefined format.
• Examples: Emails, images, videos, social media posts.
• Semi-Structured Data – Hybrid, contains some structure but not in
relational format
Structured data
• Structured data must always comply with a strict format, known as a
predefined data model or schema.
• Structured data is data that fits neatly into data tables and includes
discrete data types such as numbers, short text, and dates.
• Examples of structured data stores include relational databases, and OLAP
cubes.
Structured data examples
• Excel files.
• SQL databases.
• Inventory control.
• Reservation systems.
Unstructured Data
• Unstructured data is information, in many different forms, that
doesn't follow conventional data models, making it difficult to store
and manage in a mainstream relational database.
• Unstructured data has an internal structure but does not contain a
predetermined data model or schema. It can be textual or non-
textual. It can be human-generated or machine-generated.
Semi-structured Data
• Semi-structured data is a type of data that is not purely structured, but also
not completely unstructured.
• It contains some level of organization or structure, but does not conform
to a rigid schema or data model, and may contain elements that are not
easily categorized or classified.
• Semi-structured data is typically characterized by the use of metadata or
tags that provide additional information about the data elements.
• Similar entities are grouped together and organized in a hierarchy
• Due to lack of a well-defined structure, it can not used by computer
programs easily

Semi-structured Data

Sources of semi-structured Data:


• E-mails
• XML and other markup languages
• Binary executables
• TCP/IP packets
• Zipped files
• Integration of data from different sources
• Web pages
Introduction to Big Data
• Big data refers to extremely large and diverse collections of
structured, unstructured, and semi-structured data that continues to
grow exponentially over time.
• These datasets are so huge and complex in volume, velocity, and
variety, that traditional data management systems cannot store,
process, and analyze them.
• Big data describes large and diverse datasets that are huge in volume
and also rapidly grow in size over time.
• Big data is used in machine learning, predictive modeling, and other
advanced analytics to solve business problems and make informed
decisions.
Introduction to Big Data
• Tracking consumer behavior and shopping habits.
• Monitoring payment patterns and analyzing them.
• Combining data and information from every stage of an order’s shipment
• Using AI-powered technologies like natural language processing to analyze
unstructured medical data (such as research reports, clinical notes, and lab
results) to gain new insights for improved treatment development and
enhanced patient care
• Using image data from cameras and sensors, as well as GPS data
• Analyzing public datasets of satellite imagery and geospatial datasets to
visualize, monitor, measure, and predict the social and environmental
impacts
• Volume: The amount of data is the most defining characteristic of Big
Data. Enterprises collect data from various sources including business
transactions, smart (IoT) devices, industrial equipment, videos, social
media, and more. Dealing with potential petabytes of data requires
specialized storage, management, and analysis technologies.
• Velocity: Data is being generated at unprecedented speeds and must
be dealt with in a timely manner. Velocity refers to the rate at which
data flows from various sources like business processes, machines,
networks, social media feeds, mobile devices, etc.
• Variety: Data comes in various formats – structured data, semi-
structured data, and unstructured data. Handling this variety involves
extracting data and transforming it into a cleaner format for analysis.
• Veracity: The quality of collected data can vary greatly, affecting
accurate analysis. Veracity refers to the uncertainty of data, which can
be due to inconsistency and incompleteness, ambiguities, latency,
deception, and model approximations. Ensuring the veracity of data is
critical as it affects the decision-making process in businesses.

• Value: The final V stands for value. It’s critical to assess whether the
data that is being gathered is actually valuable in decision-making
processes. The main goal of businesses investing in big data
technologies is to extract meaningful insights from collected data that
lead to better decisions and strategic business moves.
How does big data work?
Data Collection
• This first phase involves gathering data from multiple sources, including
databases, social media, and sensors. To have this data at hand, data
engineers will employ web scraping, data feeds, APIs, and data extraction
tools.
Data Storage
• Following data collection is efficient storage for retrieval and processing.
Some big names include Hadoop Distributed File System (HDFS), Google
Cloud Storage, and Amazon S3.
Data Processing
• This is a critical phase in the data lifecycle, where raw data is transformed
into actionable insights. This stage encompasses a series of sophisticated
operations designed to refine and structure the data for analysis. These
operations consist of data cleaning, transformation, and aggregation, each
serving a unique purpose in preparing data for insightful analysis.
Data Analysis
• This step is a pivotal phase in the big data processing pipeline, where the primary
goal is to distill vast volumes of complex data into actionable insights and
discernible patterns.
• This phase leverages sophisticated methodologies and technologies, including
machine learning algorithms, data mining techniques, and advanced visualization
tools, to unearth valuable information hidden within the data.
Data Quality Assurance
• Data quality assurance (DQA) comes next to ensure the reliability and
effectiveness of data used across various business operations and decision-
making processes. It encompasses a comprehensive approach to maintaining high
data governance standards, accuracy, consistency, integrity, relevance, and
security.
Data Management
• Data management covers a comprehensive set of disciplines and practices
dedicated to the proper handling, maintenance, and utilization of data. Effective
data management ensures that data is not only accurate and accessible but also
secure and compliant with relevant regulations.
What Is Data Analytics?
• Data analytics is defined as a set of processes, tools, and technologies
that help manage qualitative and quantitative data to enable
discovery, simplify organization, support governance, and generate
insights for a business.
• Data analytics is a discipline that involves analyzing data sets to get
information that would help solve problems in different sectors. It
employs several disciplines like computer programming, statistics, and
mathematics, to give accurate data analysis.
• The goal of data analytics can either be to describe, predict, or
improve organizational performance. They achieve this using
advanced data management techniques like data modeling, data
mining, data transformation, etc., to describe, predict and solve
present and future problems.
How does data analytics work?
Data analytics involves a series of steps to give an accurate analysis.
1. Data collection
The first step is to identify the data you need for the analyses and
assemble it for use. If the data are from different source systems, the
data analyst would have to combine the different data using data
integration routines.
2. Adjusting data quality
The next step is finding and correcting data quality problems in the
collected data. Data quality problems include inconsistencies, errors,
and duplicate entries. They are resolved by running data profiling and
data cleansing tasks.
3. Building an analytical model
• Moving forward, the data analyst works with data scientists to build
analytical models that would run accurate analyses. These models are
built using analytical software, like predictive modeling tools, and
programming languages like Python, Scala, R, and Structured Query
Language (SQL).
4. Presentation
• The final step in data analytics is presenting the models’ results to the
end-users and business executives. It is best practice to use tools like
charts and infographics for presentations.
Importance(Need)of Data Analytics
1. Reduce the cost of operation
Data analytics helps reduce the cost of operations in several ways by
improving efficiency, minimizing waste, and optimizing resource allocation.
2. Predict future trends
Organizations can predict future trends and innovations with data analytics.
Using predictive analysis tools, organizations can develop future-focused
products and services and stay at the top of their market.
3. Monitor product performance
Data analytics is used in tracking customers’ behavior towards products or
services. We can use it to identify why sales are low, what products people
buy, why they are buying them, how much they are spending on these
products, how we can sell our products better, and many other queries.
4. Strengthen security
Businesses use data analytics to examine past security breaches and
diagnose the vulnerabilities that led to these breaches.
Evolution of Analytic Scalability
The evolution of analytic scalability can be traced through several stages, driven by
advancements in technology, data storage, and processing capabilities.
Manual Data Analysis (Pre-Computer Era):
• Before computers, data analysis was a manual and labor-intensive process.
• Data was limited to what could be processed by hand, making it difficult to
analyze large datasets.
• Analytical capabilities were constrained by human limitations, and insights were
limited to the data that could be readily examined.
Relational Databases (1970s-1980s):
2.

• The introduction of relational databases revolutionized data management and


analysis.
• These databases provided a structured way to store and retrieve data using SQL
queries.
• While they improved data organization, they had limitations in handling massive
datasets due to hardware constraints.
Data Warehousing (1990s):
• Data warehousing emerged as a solution to address the limitations of relational databases in
handling large volumes of data.
• Data warehouses aggregated and stored data from multiple sources in a structured format.

Massively Parallel Processing (MPP) Databases (2000s):


• MPP databases were designed to process and analyze large datasets by distributing the workload
across multiple nodes in a cluster.
• This approach improved query performance and allowed for better scalability to handle big data.

Hadoop and MapReduce (Mid-2000s):


• Hadoop and MapReduce introduced the concept of distributed computing for big data analytics.
• Hadoop's distributed file system (HDFS) allowed data to be stored and processed across a cluster
of commodity hardware.
NoSQL Databases (2010s):
NoSQL databases, like MongoDB and Cassandra, were developed to
handle unstructured and semi-structured data at scale.
They offered horizontal scalability and flexible data models to
accommodate the variety and volume of big data.
Cloud Computing (Present):
Cloud computing platforms, such as Amazon Web Services (AWS),
Google Cloud Platform (GCP), and Microsoft Azure, have democratized
big data analytics.
Advanced Analytical Tools and AI (Present):
Modern analytics has been complemented by advanced analytical tools
and artificial intelligence (AI).
Machine learning and AI algorithms can analyze vast datasets to derive
insights and make predictions
Analytic Process
• An "analytic process" refers to a structured series of steps used to
gather, clean, analyze, and interpret data to gain insights and make
informed decisions.
• An "analytic tool" is a software application that helps facilitate this
process by providing functionalities to manipulate, visualize, and
analyze data effectively; essentially, the tool is the software used to
execute the analytic process.

Key components of an analytic process
• Define the business objective: Clearly state the question or problem you
want to answer with the analysis.
• Data collection: Gather relevant data from various sources.
• Data cleaning and preparation: Organize, format, and handle missing or
inconsistent data.
• Exploratory data analysis (EDA): Visualize and summarize key trends and
patterns in the data.
• Model building and selection: Choose appropriate statistical or machine
learning techniques to analyze the data.
• Interpretation and insights: Explain the results and draw meaningful
conclusions from the analysis.
Examples of Analytic tools

• General data analysis tools: Microsoft Excel, Google Sheets, Tableau,


Power BI
• Statistical analysis tools: R, SPSS
• Machine learning platforms: Python with libraries, TensorFlow
• Data visualization tools: Plotly, D3.js
Reporting Vs Analysis
Reporting Vs Analysis
Modern Data Analytics Tools
• A modern data analytics platform leverages the benefits of natural
language search, artificial intelligence, and advanced machine
learning models.
• This helps users interact with data quickly and effortlessly to extract
insights, without any dependencies or delays.
• A modern data analytics platform comprises of capabilities to extract,
store, process, analyze, and present data.
• These cloud-based SaaS solutions are easy to set up and offer the
benefits of scalability, flexibility, and agility.
Key Components of Modern Data Analytics
Modern data analytics comprises the following components:
• Data Pipeline that connects and ingests data from different sources
• Data Warehouse that stores all collected data
• Data Transformation tool that makes data user-friendly and easy to
query
• Data Visualization and BI to present extracted insights in an easily
understandable way
• Data Governance policies to manage data security and access
Criteria for Choosing Data Analysis Tools
Selecting the right data analysis tools involves considering several critical factors:
• Type of Data and Complexity: Some of the critical factors that influence when
choosing the right tool are derived from the nature of the collected data which
may be structured, unstructured, or big. Software such as Apache Hadoop best
serves large data sets, but Excel is reasonable for modest tasks.
• Types of Data Analysis Required: Another criteria is that there are unique tools
for specific data analysis. For example, Python and R are suitable for exploratory
data analysis and machine learning in contrast with Table.
• User Expertise and Technical Skills: Assess the overall technological expertise of
the team. Microsoft Excel and tablet apps are mostly preferred because they are
easy to use, while Python and R are based on some knowledge of computer
programming.
• Scalability and Performance: Determine if the tool can meet your current and
future data volumes and learn if it has efficient performance when it is dealing
with large amounts of data.
Top Data Analysis Tools
Microsoft Excel
• Microsoft Excel is a very common spreadsheet with advanced calculating
and graphical tools.
• It comprises tools like pivot tables, conditional formatting, and the like; as
built-in formulas for statistical, financial, and logical calculations.
• Microsoft Excel allows importing content from other sources and using
other Office programs.
• An example of such simple analytics is sorting, filtering, and aggregation of
data.
• Fast data analysis and data visualization with the help of charts and graphs.
• Possesses a user-friendly GUI, is well-documented, and is available with a
large number of users.
• Inability to deal with large dataset size, shortcomings in advanced statistics
and machine learning, and using excessive memory in handling very large
models.
Apache Hadoop
• Apache Hadoop is an open-source Big Data platform, which is responsible for
distributed storage and processing using the MapReduce programming model.
• It consists of four main modules: Hadoop Common, Hadoop Distributed File
System (HDFS), Hadoop YARN, and Hadoop MapReduce.
• Hadoop also has tools like Apache Hive and Apache Hbase which makes it a bit
more powerful in processing and analyzing the data.
• Processing and storing massive amounts of structured and unstructured data
across distributed machines.
• They distributed environments where work is partitioned directly into sub-tasks
to work in parallel.
• Analytical intelligence in industries such as banking, medical services, and retail.
• Scalability to manage data increases, cost-effectiveness in data storage, fault
tolerance of data over the network, and flexibility in data storage management.
• Limitations: To install and maintain the solution the organization needs highly
qualified engineers who can manage it.
IBM SPSS (Statistical Package for the Social Sciences)
• SPSS (Statistical Package for the Social Sciences) is an advanced analytical
software solution for statistical analysis in business. It has added statistical
analysis, data editing, and documentation capabilities.
• SPSS is one of the most preferred statistical analytic software in academia
for social science research because of the large number of statistical tests
required in such studies and the software’s user-friendliness.
• SPSS is suitable for processing and analyzing survey data, which is a major
component of work in market research, health research, and education
research fields.
• As a statistical data analysis tool, SPSS can use past data to identify trends
and predict future occurrences.
• Friendly for its GUI, offers statistical function highly, robust on big datasets,
and good on survey data analysis.
Google Data Studio
• Google Data Studio is a business intelligence application that provides
free report and dashboard creation and sharing.
• The customers can independently develop graphs, use templates, and
use drag and drop to simplify the operations.
• Provides best Strategy for the development of visually appealing and
interactive reports and dashboards.
• Working with and displaying data from different sources specifically
those from the Google world.
• Using descriptive statistics procedures to analyze data for trends and
patterns.
• Easy to use, complete with live sharing, not available at a cost, and
highly supported on Google applications.
Applications of Data Analytics
Data analytics has a wide range of applications across industries. Here
are some key areas where it is used:
Business & Marketing
• Customer segmentation – Identifying different customer groups based on
purchasing behavior.
• Sales forecasting – Predicting future sales trends.
Healthcare
• Predictive analytics – Forecasting disease outbreaks
• Medical imaging analysis – Using AI to detect anomalies in X-rays and MRIs.
• Operational efficiency – Optimizing hospital resource allocation.
Applications of Data Analytics
Finance & Banking
• Fraud detection – Identifying suspicious transactions.
• Credit risk analysis – Evaluating loan applicants based on historical data.
• Customer sentiment analysis – Understanding public opinions on financial
services.
Sports & Entertainment
• Performance analytics – Analyzing player performance for strategy
improvements.
• Ticket pricing optimization – Adjusting ticket prices based on demand.
Data Analytics Life Cycle
Phase 1: Discovery -
• The team is trained and researches the issue.
• Create context and gain understanding.
• Learn about the data sources that are needed and accessible related to the
problem.
• The team comes up with an initial hypothesis, which can be later confirmed
with evidence.
Phase 2: Data Preparation -
• Methods to investigate the possibilities of pre-processing, analysing, and
preparing data before analysis and modelling.
• Data preparation tasks can be repeated and not in a predetermined
sequence.
• Some of the tools used commonly for this process include - Hadoop, etc.
Phase 3: Model Planning -
• The team studies data to discover the connections between variables.
Later, it selects the most significant variables as well as the most
effective models.
• In this phase, the data science teams create data sets that can be
used for training for testing, production, and training goals.
Phase 4: Model Building -
• The team creates datasets for training, testing as well as production
use.
• The team create the model and also check whether its current tools
are sufficient to run the models or if they require an even more
strong environment to run.
• Tools that are free or open-source are R, MATLAB etc.
Phase 5: Communication Results -
• Following the execution of the model, team members will need to evaluate
the outcomes of the model to establish criteria for the success or failure of
the model.
• The team is considering how best to present findings and outcomes to the
various members of the team. The team should determine the most
important findings, quantify their value to the business and create a
narrative to present findings and summarize them to all stakeholders.
Phase 6: Operationalize -
• The team sets up a pilot project that will deploy the work in a controlled
manner prior to expanding the project to the entire enterprise of users.
• This technique allows the team to gain insight into the performance and
constraints related to the model within a production setting at a small scale
and then make necessary adjustments before full deployment.

You might also like