100% found this document useful (1 vote)
1K views279 pages

Applied Data Science

Uploaded by

freespace000xd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
100% found this document useful (1 vote)
1K views279 pages

Applied Data Science

Uploaded by

freespace000xd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 279
WL sem 8 Computer Engineering (Course Code : CSDC8013) (Department Optional Course Applied Data Science EA. cae Government COE, Pune (C.0.E.P) (Teaching M. Tech & Ph.D Students) Prof. K. S. Londhe JSPM's Imperial COE and Research, Wagholi, Pune Tecu-Neo cm pe PUBLICATIONS } y ee commas (MUIR Ege ? eet rea) CSDCB013 : Applied Data Science University of Mumbai Applied Data Science Strictly as per the New Syllabus (REV-2019 ‘C’ Scheme) of Mumbai University w.e.f. academic year 2022-2023 Semester 8 : Computer Engineering (Course Code : CSDC8013) (Department Optional Course -5) Prof. R. M. Baphana Prof. K. S. Londhe Adjunct Faculty, Formerly, Assistant Professor Government College of Engineering, Department of Computer Engineering Pune (C.0.E.P) JSPM’s Imperial College of Engineering (Teaching M. Tech & Ph.D Students) and Research, Wagholi, Pune & Tecu-Neo — PUBLICATIONS Cee eee il A Sachin Shah Venture Preface Dear students, We are extremely happy to come out with this edition of “Applied Data Science” for you. We have divided the subject into small chapters so that the topics can be arranged and understood properly. The topics within the chapters have been arranged in a proper sequence to ensure smooth flow of the subject. We are thankful to Shri. Sachin Shah for the encouragement and support that they have extended. We are also thankful to the staff members of “Tech-Neo Publications” and others for their efforts to make this book as good as it is. We have made every possible efforts to eliminate all the errors in this book. However if you find any, please let us know, because that will help us to improve further. We are also thankful to family members and friends for patience and encouragement. nl Syllabus... University of Mumbai Applied Data Science (Course Code : CSDC8013) (Department Optional Course -5) Prerequisite : Machine Learning, Data Structures & Algorithms Course Objectives To introduce students to the basic concepts of data science. To acquire an in-depth understanding of data exploration and data visualization. To be familiar with various anomaly detection techniques. To understand the data science techniques for different applications. Course Outcomes After successful completion of the course students will be able to : To gain fundamental knowledge of the data science process. To apply data exploration and visualization techniques. To apply anomaly detection techniques. To gain an in-depth understanding of time-series forecasting. Apply different methodologies and evaluation strategies. To apply data science techniques to real world applications.. Introduction to Data Science Introduction to Data Science, Data Science Process ‘Motivation to use Data Science Techniques: Volume, Dimensions and Complexity, Data Science Tasks and Examples Overview of Data Preparation, Modeling, Difference between data science and data analytics, (Refer Chapter 1) Data Exploration Types of data, Properties of data Descriptive Statistics : Univariate Exploration: Measure of Central Tendency, Measure of Spread, Symmetry, Skewness: Karl Pearson Coefficient of skewness, Bowley's Coefficient, Kurtosis Multivariate Exploration: Central Data Point, Correlation, Different forms of correlation, Karl Pearson Correlation Coefficient for bivariate distribution. Inferential Statistics : Overview of Various forms of distributions: Normal, Poisson, Test Hypothesis, Central limit theorem, Confidence Interval, Z-test, t-test, Type-I, Type-II Errors, ANOVA. (Refer Chapter 2) ee a Methodology and Data Visualization Methodology : Overview of model building, Cross Validation, K-fold cross validation, leave-1 out, Bootstrapping Data Visualization Univariate Visualization: Histogram, Quartile, Distribution Chart Multivariate Visualization: Scatter Plot, Scatter Matrix, Bubble chart, Density Chart Roadmap for Data Exploration. 33. | Self-Learning Topics : Visualizing high dimensional data: Parallel chart, Deviation chart, Andrews Curves. (Refer Chapter 3) 4 Anomaly Detection 6 4.1 | Outliers, Causes of Outliers, Anomaly detection techniques, Outlier Detection using Statistics 4.2 | Outlier Detection using Distance based method, Outlier detection using density-based methods, SMOTE. (Refer Chapter 4) Time Series Forecasting Taxonomy of Time Series Forecasting methods, Time Series Decomposition Smoothening Methods: Average method, Moving Average smoothing, Time series analysis using linear regression, ARIMA Model, Performance Evaluation: Mean Absolute Error, Root Mean Square Error, Mean Absolute Percentage Error, Mean Absolute Scaled Error Contents Self-Learning Topics : Evaluation parameters for Classification, regression and clustering. (Refer Chapter 5) Applications of Data Science Predictive Modeling : House price prediction, Fraud Detection Clustering: Customer Segmentation. Time series forecasting : Weather Forecasting Recommendation engines : Product recommendation. (Refer Chapter 6) Assessment Internal Assessment Assessment consists of two class tests of 20 marks each. The first-class testis to be conducted when approx. 40% syllabus is completed and second class test when additional40% syllabus is completed. Duration of each test shall be one hour. End Semester Theory Examination: Question paper will comprise a total of six questions. Alll questions carry equal marks. Questions will be mixed in nature (for example supposed Q.? has part (a) from module 3 thea part (b) will be from any module other than module 3). Only Four questions need to be solved. eas Introduction to Data Science 1-1 to 1-25 Data Exploration 2-1 to 2-112 Methodology and Data 3-1 to 3-55 Visualization Anomaly Detection 4-1 to 4-21 5-1 to 5-25 APPLIED DATA SCIENCE LAB Please Download LAB PRACTICALS from Tech-Neo Website e www.techneobooks.in e 1 12 13 14 15 1.6 Introduction to CHAPTER 1 Data Science Introduction to Data Science, Data Science Process Motivation to use Data Science Techniques: Volume, Dimensions and Complexity, Data Science Tasks and Examples. Overview of Data Preparation, Modeling, Difference between data science and data analytics. Introduction to Data Science and Big Data..... 1.1.1. Introduction to Data Science. GQ. Explain data science in brief. (4 Marks). 1.1.2 Introduction to Big Data GQ. —_ Write a short note on Big Data? (4 Marks). Defining Data Science and Big Dat GQ. _Define the term data science. (2 Marks) GQ. Define Big Data. (2 Marks). ‘The Requisite Skill Set in Data Science .... GQ. Explain requisite skill set in data science. (4 Marks)... 5V's of Big Data. Data Science Life Cycle... Data : Data Types, Data Collection 17 1.8 19 1.10 Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-2) 1.6.1 Methods of Collecting Primary Data Data Analytic Life Cycle : Overview. 1.7.1 Phase 1 - Discovery Phase... 1.7.2 Phase 2 - Data Preparation. 1.7.3 Phase 3 - Model Planning 1.7.4 Phase 4 - Model Building.. 1.7.5 Phase 5 - Communicate Results .. 1.7.6 Phase 6 - Operationalize .. Modeling. 1.8.1 Purpose of Data Modeling 1.8.2 _ Different types of Data Models Difference between data science and data analytics. Case Study - GINA : Global Innovation Network and Analysis. GQ. Write a short note on Case of GINA. (8 Marks). 1.10.1 Phase 1 - Discovery 1.10.2 Phase 2 - Data Preparation 1.10.3 Phase 3 - Model Planning .. > Chapter Ends (MU - New Syllabus w.ef academic year 22-23)(M8-79) & Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-3) eS ee > 1.1 INTRODUCTION TO DATA SCIENCE AND BIG DATA %& 1.1.1 Introduction to Data Science ts * Data science is a process in which it is examined that from where the information can be taken, what it signifies and how it can be converted into a useful resource in the creation of business and IT strategies, © Mining huge quantity of structured and unstructured data to recognize pattems can help out an organization to reduce costs, raise efficiencies, identifies new market opportunities and enhances the organization's competitive benefit. * The data science field manipulates the mathematics, statistics and computer science regulations, and includes methods like machine learning, cluster analysis, data mining and visualization. Data Scientific engineering method Domain / expertise \ Math —~, —_— Hacker) a mindset / \ Suatiatice: ‘Advanced Visualization) | computing Fig. 1. 5 Data scientists © As we know that the amount of data generation can be increased by the typical modem businesses. Because of this the importance of data scientists can be increased. ‘©The task of data scientists is to convert the organization's raw data into the useful information. ‘© Data extraction is a method of retrieving particular data from unstructured or badly structured data sources for advance processing and analysis. (MU - New Syllabus wef academic year 22-23)(M8-79) Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (introduction to Data Science) Page No. (1-4) Data scientists must acquire a mixture of analytic, machine learning, data mining and statistical skills, as well as familiarity with algorithms and coding. Another task for data scientists along with managing and understanding large amounts of data is to create data visualization models that facilitates demonstrating the business value of digital information. Data scientists must acquire an emotional intelligence in addition to education and experience in data analytics to make it effective. With the help of Smartphone's, Internet of Things (IoT) devices, social media, internet searches and behavior, the data scientists can illustrates the digital information very easily because they are studying them on regular basis. Definition : The data mining is the process of identifying the patterns to solve the problems by data scientists when such a large data sets are sorted with the help of data analysis. Data science and machine learning Machine learning is often integrated in data science. Machine learning is an Artificial Intelligence (AD) tool that basically automates the data-processing piece of data science. Machine learning includes advanced algorithms that can be self learned and can process huge amounts of data within a fraction of time. ‘After gathering and processing the structured data from the machine learning tools, data scientists takes data, transform it and summarize the data so it becomes useful for the company's decision- makers. Example : Examples of Machine learning applications in the data science field are image recognition and speech recognition, self-driving vehicles etc. %_1.1.2 Introduction to Big Data ‘RQ. What is Big data ? Explain characteristics of big data. 1.6Q, Write short note on Bi Data? Now a day the amount of data created by various advanced technologies like Social networking sites, E-commerce etc. is very large. It is really difficult to store such huge data by using the traditional data storage facilities. Until 2003, the size of data produced was 5 billion gigabytes. If this data is stored in the form of disks it may fill an entire football field. In 2011, the same amount of data was created in every two days and in 2013 it was created in every ten minutes. This is really tremendous rate. In this topic, we will discuss about big data on a fundamental level and define common concepts related to big data. We will also see in deep about some of the processes and technologies currently being used in this field. (MU - New Syllabus w.e. academic year 22-23)(M8-79) [al rech-neo Publications Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-5) «= Big Data 1 Definition : Big data means huge amount of data, itis a collection of large datasets that cannot be Processed using traditional computing techniques. Big Data is complex and difficult to store, maintain or access in regular file system. Big Data becomes a complete subject, which involves different techniques, tools, and frameworks 5 Sources of big data ‘There are various sources of big data. Now a days in number of fields such huge data get created, Following are the some of fields : Sources of big data 1. Stock Exchange | 2. Social Media Data 3. Video sharing portals il 4, Search Engine Data 5. Transport Data 6. Banking Data Fig. 1.1.2 : Sourees of big data > (@) Stock Exchange : The data in the share market regarding information about prices and status details of shares of thousands of companies is very huge. > (2) Social Media Data : The data of social networking sites contains information about all the account holders, their posts, chat history, advertisements etc. On topmost sites like facebook and whatsapp, there are literally billions of users. > (3) Video sharing portals : Video sharing portals like youtube, Vimeo etc. contains millions of Videos each of which requires lots of memory to store. > (Search Engine Data : The search engines like Google and Yahoo holds lot much of metadata regarding various sites. > (8) Transport Data : Transport data contains information about model, capacity, distance and availabilty of various vehicles. » (6) Banking Data : The big giants in banking domain like SBI or ICICI hold large amount of data regarding huge transactions of account holders. (MU - New Syllabus w.ef academic year 22-23)(M8-79) & Tech-Neo Publications Page No. (1-6) Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Scienc =F Categories of Data The data can be categorized in three types : Categories of data 1. Structured data 2. Semi-structred data 3, Unstructured data Fig. 1.1.3 : Categories of data > (4) Structured Data This type of data is stored in relations (tables) in Relational Database Management System. > (2) Semi-structured Data This type of data is neither raw data nor typed data in a conventional database system. A lot of data found on the web can be described as semi-structured data. This type of data does not have any standard formal model. This data is stored using various formats like XML and JSON. > (3) Unstructured Data This data do not have any pre-defined data model. The data of video, audio, Image, text, web logs, system logs etc. comes under this category. 1 Important issues regarding data in traditional file In general there are some important issues regarding data in traditional file storage system. Important issues regarding data In traditional file [rine] [Econo] Fig. 1.1.4: Important issues regarding data in regarding data in traditional file & Tech-Neo Publications (MU - New Syllabus w.e.f academic year 22-23)(MB-79) Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-7) SSS > (1) Volume Now a day the volume of data regarding different fields is high and potentially increasing day by day. Organizations collect data from a variety of sources, including business transactions, social media and information etc. > (2) Velocity The configuration of system with single processor, limited RAM and limited storage capacity cannot store and manage high volume of data. >» (3) Variety The form of data from different sources is different. > (@) Variability The flow of data coming from sources like social media is inconsistent because of daily emerging new trends. It can show sudden increase in size of data which is difficult to manage. > (5) Complexity As the data is coming from various sources, it is difficult to link, match and transform such data across systems. It is necessary to connect and correlate relationships, hierarchies and multiple data linkages of the data. All these issues are solved by the new advanced Big Data Technology. > 1.2 DEFINING DATA SCIENCE AND BIG DATA 1 6Q_ Define the term data science. 5 Defining Data science Q Definition : Data science is a field of Big Data which searches for providing meaningful information from huge amounts of complex data.Data science is a system used for retrieving the information in different forms, either in structured or unstructured. Data Science unites different fields of work in statistics and computation in order to understand the data for the purpose of decision making. Defining Big Data oO Definition : Big Data is described as volumes of data available in changing level of complexity, Produced at different velocities and changing level of ambiguity, that cannot be processed using conventional technologies, processing methods, algorithms, or any commercial off-the-shelf solutions, (MU - New Syllabus w.e.f academic year 22-23)(M8-79) [Bl rech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-8) ‘© Data that can be defined as Big Data comes from variety of fields such as machine-generated data from sensor networks, nuclear plants, airplane engines, and consumer-driven data from social media. The producers of the Big Data that resides within organizations include legal, sales, marketing, Procurement, finance, and human resources departments. De 1.3 THE REQUISITE SKILL SET IN DATA SCIENCE Data science is a combination of skills consisting of three most important areas. They are explained in brief in Fig. 1.3.1. Fig. 1.3.1 : Requisite skill set in data science (1) Mathematics Expertise «The most important thing required while constructing the data products and data mining insights is the capability 10 view the data via a quantitative way. There are texture, measurement, and relationship in data that can be illustrated mathematically. ‘© Solutions to numerous business problems occupies building analytic models grounded in the hard math, where being able to recognize the underlying mechanics of those models is key to success in building them. © Also, a misunderstanding is that data science contains all about statistics. While statistics is important, itis not the only type of math utilized in the data science. © There are two branches of statistics as given in Fig. 1.3.2. (MU - New Syllabus w.ef academic year 22-23)(MB-79) eb reci-neo Publications Fig. 13.2 : Branches of statistics * Having knowledge of both classical and Bayesian statistics is helpful but when the majority of peoples refer to stats they are normally preferring to classical statistics. (2) Technology and Hacking * Here the term “hacking” is related to the innovation not to the tempering any confidential data. * We are going to refer hacking as a programmer's creativity and the cleverness to solve the problems that arise while building the things. * The hacking is important because data scientists make use of technology in order to handle huge data sets and work with composite algorithms, and it needs tools far more difficult than Excel, * Data scientists have to know the fundamentals of programming language to find out the quick solutions for complex data as well as to integrate that data. ¢ But having only fundamental knowledge is not sufficient for data sciemtists because data science hackers are very creative and they can find a way with the help of technical challenges to work their code in desired manner. * Algorithmic thinking of data science hacker is very high, so that it can have the ability to break down confused problems and recompose them in ways that are solvable. (3) Strong Business Acumen * For the data scientists, it is necessary to behave like a tactical business consultant. As the data scientists working are very close to the data so they can works like no one can do it. * This will make a responsibility to transform observations to shared knowledge, and contribute to strategy on how to solve core business problems. * This means a core ability of data science is using data to clearly inform a story. No data~ puking — rather, present a unified description of problem and solution, with the help of data insights as supporting pillars, that lead to guidance. Dy 1.4 5 V’S OF BIG DATA — sabe noha aie © Big datais a collection of data from many different sources and is often describe by five characteristics: volume, value, variety, velocity, and veracity. (MU - New Syllabus we.f academic year 22-23)(M8-79) Tabreci-neo Publications Applied Data Science (MU-Sem VOLUME Huge amount VERACITY Inconsistencies and uncertainty in data VARIETY various sources VELOCITY High speed of VALUE accumulation Extract useful ofdata data Fig. 1.4.1 Volume : The size and amounts of big data that companies manage and analyze. The name Big, Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data solutions. Variety : The diversity and range of different data types, including unstructured data, semi- structured data and raw data. Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. Earlier days, most of the application was using database as a spreadsheet (Structured data). Now a day's data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc.(unstructured data) are also being considered in the analysis applications. ‘Value : The most important “V" from the perspective of the business, the value of big data usually comes from insight discovery and pattern recognition that lead to more effective operations, stronger customer relationships and other clear and quantifiable business benefits. This refers to the value that big data can provide, and it relates directly to what organizations can do with that collected data. Being able to pull value from big data is a requirement, as the value of big data increases significantly depending on the insights that can be gained from them. Velocity : The speed at which companies receive, store and manage data ~ eg., the specific ‘umber of social media posts or search queries received within a day, hour or other unit of time Veracity : The “wut” or accuracy of data and information assets, which often determines executive-level confidence. (MU - New Syllabus wef academic year 22-23)(MB-79) [ed rech-neo Publications Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-11) cee ee bbl 1.5 DATA SCIENCE LIFE CYCLE Fig. 1.5.1 : Data Science Lifecycle > (4) Business Understanding * The complete cycle revolves around the enterprise goal. What will you resolve if you do no longer have a specific problem? It is extraordinarily essential to apprehend the commercial enterprise goal sincerely due to the fact that will be your ultimate aim of the analysis. * After desirable perception only we can set the precise aim of evaluation that is in sync with the enterprise objective. You need to understand if the customer desires to minimize savings loss, or if they prefer to predict the rate of a commodity, etc. > (2) Data Understanding * After enterprise understanding, the subsequent step is data understanding. This includes a series of all the reachable data. Here you need to intently work with the commercial enterprise group as they are certainly conscious of what information is present, what facts should be used for this commercial enterprise problem, and different information. * This step includes describing the data, their structure, their relevance, their records type. Explore the information using graphical plots. Basically, extracting any data that you can get about the information through simply exploring the data. > (3) Preparation of Data * Next comes the data preparation stage. This consists of steps like choosing the applicable data, integrating the data by means of merging the data sets, cleaning it, treating the lacking values through either eliminating them or imputing them, treating inaccurate data through eliminating them, additionally test for outliers the use of box plots and cope with them. * Constructing new data, derive new elements from present ones. Format the data into the preferred Structure, eliminate undesirable columns and features, (MU - New Syllabus w.e.f academic year 22-23)(M8-79) Tbrech-neo Publications Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-12) ae ee ‘© Data preparation is the most time-consuming but arguably the most essential step in the complete existence cycle. Your model will be as accurate as your data, > (4) Exploratory Data Analysis ¢ This step includes getting some concept about the answer and elements affecting it, earlier than constructing the real model. * Distribution of data inside distinctive variables of a character is explored graphically the usage of bar-graphs, Relations between distinct aspects are captured via graphical representations like scatter plots and warmth maps. Many data visualization strategies are considerably used to discover each and every characteristic individualiy and by means of combining them with different features. > (5) Data Modeling * Data modeling is the coronary heart of data analysis. A model takes the organized data as input and gives the preferred output. * This step consists of selecting the suitable kind of model, whether the problem is a classification problem, or a regression problem or a clustering problem. After deciding on the model family, amongst the number of algorithms amongst that family, we need to cautiously pick out the algorithms to put into effect and enforce them. © We need to tune the hyper parameters of every model to obtain the preferred performance. We additionally need to make positive there is the right stability between overall performance and generalizability. We do no longer desire the model to study the data and operate poorly on new data. > © Model Evaluation © Here the model is evaluated for checking if it is geared up to be deployed. The model is examined on an unseen data, evaluated on a cautiously thought out set of assessment metrics. We additionally ‘need to make positive that the model conforms to reality. © If we do not acquire a quality end result in the evalu: ; n ‘modelling procedure until the preferred stage of metrics is achieved. Any data science solution, @ machine learning model, simply like a human, must evolve, must be capable to enhance itself with new data, adapt to a new evaluation metric. © We can construct more than one model for a certain phenomenon, bi additionally be imperfect. The model assessment helps us select and construct an ideal model. ¥ (1) Model Deployment © The model after a rigorous assessment is at the end deployed in the preferred structure and channel. This isthe last step in the data science life cycle, tation, we have to re-iterate the complete jowever, a lot of them may (MU - New Syllabus w.ef academic year 22-23)(MB-79) [el rech-neo Publications Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-13) SSS TO Each step in the data science life cycle defined above must be laboured upon carefully. If any step is performed improperly, and hence, have an effect on the subsequent step and the complete effort B0es to waste. For example, if data is no longer accumulated properly, you'll lose records and you will no longer be constructing an ideal model. If information is not cleaned properly, the model will no longer work. If the model is not evaluated Properly, it will fail in the actual world. Right from Business perception to model deployment, every step has to be given appropriate attention, time, and effort. > 1.6 DATA: DATA TYPES, DATA COLLECTION Data collection is the process of acquiring, collecting, extracting, and storing the voluminous amount of data which may be in the structured or unstructured form like text, video, audio, XML files, records, or other image files used in later stages of data analysis. In the process of big data analysis, “Data collection” is the initial step before starting to analyze the patterns or useful infotmation in data. The data which is to be analyzed must be collected from different valid sources. Fig. 16.1 The data which is collected is known as raw data which is not useful now but on cleaning the impure and utilizing that data for further analysis forms information, the information obtained is known as “knowledge”. Knowledge has many meanings like business knowledge or sales of enterprise products, disease treatment, etc. The main goal of data collection is to collect information-rich data. Data collection starts with asking some questions such as what type of data is to be collected and What is the source of collection, ; Most of the data collected are of two types known as “qualitative data which is a group of non- numerical data such as words, sentences mostly focus on behavior and actions of the group and another one is “quantitative data” which is in numerical forms and can be calculated using different scientific tools and sampling data. (MU - New Syllabus w.ef academic year 22-23)(M8-79) Tal rech-neo Publications Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-14) The actual data is then further divided mainly into two types known as: ean Fig. 1.6.2 > (A) Primary data * The data which is Raw, original, and extracted diretly from the official sources is known as Primary data. This type of data is collected directly by performing techniques such as questionnaires, interviews, and surveys. ¢ The data collected must be according to the demand and requirements of the target audience on which analysis is performed otherwise it would be a burden in the data processing. 2 1.6.1 Methods of Collecting Primary Data (1) Interview method © The data collected during this process is through interviewing the target audience by a person called interviewer and the person who answers the interview is known as the interviewee. «Some basic business or product related questions are asked and noted down in the form of notes, audio, or video and this data is stored for processing. These can be both structured and vinstructured like personal interviews or formal interviews through telephone, face to face, email, etc. (2) Survey method © The survey method is the process of research where a list of relevant que answers are noted down in the form of text, audio, or video. «The survey method can be obtained in both online and offline mode like through website forms and email.’Then that survey answers are stored for analyzing data. Examples are online surveys or surveys through social media polls. (3) Observation method The observation method is a method of data collection in which the researcher keenly observes the behavior and practices of the target audience using some data collecting (ool and stores the observed data in the form of text, audio, video, or any raw formats. stions are asked and (MU - New Syllabus w.ef academic year 22-23)(MB-79) Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Si 1ce)...Page No. (1-15) * In this method, the data is collected directly by posting a few questions on the participants. For example, observing a group of customers and their behavior towards the products. The data obtained will be sent for processing. (4) Experimental method » The experimental method is the process of collecting data through performing experiments, research, and investigation. The most frequently used experiment methods are CRD, RBD, LSD, FD. @ CRD : Completely Randomized design is a simple experimental design used in data analytics which is based on randomization and replication. It is mostly used for comparing the experiments. (ii) RBD : Randomized Block Design is an experimental design in which the experiment is divided into small units called blocks. Random experiments are performed on each of the blocks and results are drawn using a technique known as analysis of variance (ANOVA). RBD was originated from the agriculture sector. (iii) LSD : Latin Square Design is an experimental design that is similar to CRD and RBD blocks but contains rows and columns. It is an arrangement of NxN squares with an equal amount of. rows and columns which contain letters that occurs only once in a row. Hence the differences can be easily found with fewer errors in the experiment. Sudoku puzzle is an example of a Latin square design, (iv) FD : Factorial design is an experimental design where each experiment has two factors each with possible values and on performing trail other combinational factors are derived. (B) Secondary data Secondary data is the data which has already been collected and reused again for some valid Purpose. This type of data is previously recorded from primary data and it has two types of sources ‘named internal source and external source. a (2) Internal source * These types of data can easily be found within the organization such as market record, a sales Tecord, transactions, customer data, accounting resources, etc, The cost and time consumption is less in obtaining internal sources. External source © The data which can’t be found at internal organizations and can be gained through external third party resources is external source data. © ~The cost and time consumption is more because this contains a huge amount of data. Examples of external sources are Government publications, news publications, Registrar General of India, planning commission, international labor bureau, syndicate services, and other non-governmental publications. (MU - New Syllabus w.e academic year 22-23)(M8-79) Ta rech-neo Publications Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-16) eee (3) Other sources (Sensors data : With the advancement of 1oT devices, the sensors of these devices collect data which can be used for sensor data analytics to track the performance and usage of products, (ii) Satellites data : Satellites collect a lot of images and data in terabytes on daily b: surveillance cameras which can be used to collect useful information through (iii) Web traffic : Due to fast and cheap internet facilities many formats of data which is uploaded by users on different platforms can be predicted and collected with their permission for data analysis. The search engines also provide their data through keywords and queries searched mostly hv bb 1.7 DATA ANALYTIC LIFE CYCLE : OVERVIEW {RQ Explain different phases of data analytics life cycle. [ Ref: - 0:'1(b), Aug. 18,.6.Marks [i {RQ Explain Data Analytic Life cycle, COO | ' RQ. Draw Data Analytics Lifecycle & give brief description about all phases. ; ' Ref. -. 1(b), May 19, 5 Marks © At this level we need to know more deep knowledge of specific roles and responsibilities of the data scientist. © The data scientist lifecycle is illustrated in Fig. 1.7.1 which gives the high-level overview of the data scientist discovery and analysis process. © It depicts the iterative behaviour of work performed by the data scientist's with several stages being repetitive in order to make sure that the data scientist is utilizing the “right” analytic model to locate the “right” insights. Fig. 1.7.1; Data Scientist Lifecycle (MU - New Syllabus w.ef academic year 22-23)(MB-79) al Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Introduction to Data lence) %® 1.7.1 Phase 1 - Discovery Phase The following activities of data scientists can be focused by the Discovery : Acquisition of a complete understanding of the business process and the business domain. This consists of recognizing the key metrics and KPIs against which the business users will measure success. Recognizing the most vital business questions and business decisions that the business users are attempting to answer in support of the targeted business process. This also should contain the ‘occurrence and optimal timeliness of those answers and decisions. Evaluating available resources and going through the process of framing the business problem as an analytic hypothesis. At this stage data scientist constructs the initial analytics development plan that will be used to direct and document the resulting analytic models and insights. It should be noticed that understanding into which production or operational environments the analytic insights requires to.be published is somewhat that should be recognized in the analytics development plan Such information will be essential as the data scientist recognizes in the plan where to “operationalize” the analytic insights and models. This is a best opportunity for tight association with the BI analyst who likely has already defined the metrics and processes required to support the business proposal. Requirements and the decision making environment of the business users can be well understand by the BI analyst to starts the data scientist's analytics development plan. 1.7.2 Phase 2 - Data Preparation The following activities of data scientists can be focused by the data preparation : Provisioning an analytic workspace, or an analytic sandbox, where the data scientist can work free of the constraints of a production data warehouse environment. Preferably, the analytic environment is set up such that the data scientist can self-provision as much data space and analytic horsepower as required and can fine-tune those requirements throughout the analysis process. Obtaining, cleaning, aligning, and examining the data. This contains use of data visualization techniques and tools to get an understanding of the data, recognizing outliers in the data and calculating the gaps in the data to decide the overall data quality; determine if the data is “good enough.” Transforming and enhancing the data, The data scientist will look to use analytic techniques, such as logarithmic and wavelet transformations, to sort out the potential skewing in the data. The data scientist will also look to use data enhancement techniques to create new composite metrics such as frequency, recentness, and order, The data scientist will make use of standard tools like SQL and Java, as well as both commercial and open source extract, transform, load (ETL) tools to transform the data, (MU = New Syllabus w.es academic year 22-23)(M8-79) & Tech-Neo Publications (Introduction to Data Sclence)...Page No. (1-18) Applied Data Science (MU + After this stage ix completed, the data scientist wants to feel comfortable enough with the quality and prosperity of the data to move ahead to the next stage of the analytics development process. %_ 1.7.3 Phase 3 - Model Planning The following activities of data scientists can be focused by the model planning : * Determining the numerous analytical models, methods, techniques and workflows to discover as part of the analytic model development. The data scientists knows in advance that which of the analytic models and methods are suitable but it is good thing to plan to check at least one to make sure that the opportunity to build a more predictive model is not missed. Determine association and co-linearity between variables in order to select key variables to be used in the model development. The data scientist desires to estimate the cause-and-effect variables as early as possible. Keep in mind, association does not provides assurance causation, so care must be taken in choosing variables that can be calculated while going forward. 7 1.7.4 Phase 4 - Model Building The following activities of data scientists can be focused by the model building : Manipulating the data sets for testing, training, and production. Whatever new transformation techniques are developed can be tested to observe if the quality, reliability, and predictive capabilities of the data can be enhanced or not. Calculating the feasibility and reliability of data to use in the predictive models. Decision calls are depends on quality and reliability of the data to check; is the data “good enough" to be used in developing the analytic models. ‘At the end, developing, testing, and filtering the analytic models is done. Testing is carrying out to notice which variables and analytic models deliver the maximum quality, most predictive and actionable analytic insights. ‘The model building stage is highly iterative step where manipulation of the data, calculating the reliability of the data, and determining the quality and predictive powers of the analytic model will be modified number of times. © In this stage the data scientist may be unsuccessful many time: modelling techniques before resolved into the “right” one. 1.7.5 Phase 5 - Communicate Results scientists can be focused by the communicate resul ytic model and statistical implication, ability of jalytic insights. The data scientist wants to make nd accomplishes the required analytic goal sin testing different variables and ‘The following activities of data © Determining the quality and reliability of the anal measuring and taking the action of the resulting an: sure that the analytic process and model was successful at of the project. (MU - New Syllabus wie academic year 22-23)(M8-79) lel rech-neo Publications Applied Data Science (MU-Sem &-Cornp) (Introduction to Data Science)...Page No. (1-19) Se ————————————————————— ——————— To communicate with the insights of analytic model, results and the suggestions requires the use of graphics and charts. It is significant that the business stakeholders such as business users, business analysts, and the BI analysts should realize and obtain the resulting analytic insights. The BI analysts are partner in this stage of the data science lifecycle. The BI analysts have the strong understanding of what to present to their business users and how to present it. %& 1.7.6 Phase 6 - Operationalize The following activities of data scientists can be focused by the operationalize : Providing the final suggestions, reports, meetings, code, and technical documents. Optionally, running a pilot or analytic lab to validate the business case, and the financial return on investment (ROI) and the analytic lift. Carrying out the analytic models in the production and operational environments. This engross working with the application and production teams to decide how best to surface the analytic results and insights. Combining the analytic scores into management dashboards and operational reporting systems, like sales systems, procurement systems, and financial systems ete. The operationalization stage is another area where association between the data scientist and the BI analysts should be very useful. Numerous BI analysts have the experience of combining reports and dashboards into the operational systems, as well as establishing centers of excellence to spread analytic learning and skills across the organization. bi 1.8 MODELING Data modeling is the process of creating a simplified diagram of a software system, and the data elements it contains. It uses text and symbols to represent the data and how it flows. Data models provide a blueprint for designing a new database. Thus, data modeling helps an organisation to use its data effectively to meet business needs for information. Actually, a data ~ model is a flowchart that illustrates data entities and the relationships between entities, Tt enables data management to document data requirements for applications in development plans. It also helps to identify errors before any code is written. (MU - New Syllabus w.e academic year 22-23)(MB8-79) Tech-Neo Publications 2A 1.8.1 Purpose of Data Modeling © Data modeling is a cove data management discipline. It provides visual representation of data sets and their business content.\ ‘© Ithelps to locate information that is needed for different business processes. © It mentions the characteristics of the data elements and then these elements are included in the datasets and then are processed. © Data modeling plays an important role in data architecture processes. ‘© It maps how data moves through IT systems and create a conceptual data management framework. * Earlier, data models were built by data architects and other data management professionals. © They used to take input from business analysts, executives and users. © But, nowadays data-modeling is an important skill for data scientists and analysts. © They develop “business intelligence applications’ and more complex ‘data science and advanced analytics’ Y\ 1.8.2 Different types of Data Models © Data models use three types of models to separately represent business concepts and workflows and technical structures for managing the data. The models are created in progression since organisations plan new applications and databases. The different types of data models are as follows > (1) Conceptual data model «This is a high-level visualisation of the analytics processes that a system will suppor. ¢ _Itgives the kinds of data that are required. © Itshows how different business entities interrelate. © These conceptual data-models helps to them to see how meets business needs. © These conceptual models ¥ (2) Logical data-model Logical data-models show how data entities are related and describe the data from a technical perspective. © They define data structures and then they provide detail other important characteristics, © The technical side of an organisation uses logical models to help understand required application and database designs. Again they are not related to a particular technology platform. a system will work and ensure that it are not connected to specific database or application technologies. Is on attributes, data types, keys and (MU - New Syllabus w.ef academic year 22-23)(M8-79) la rech.neo Publications oo Applied Data Science (MU-Sem 8-Comp) (introduction to Data Science)...Page No. (1-21) > (3) Physical data - model * A logical model acts as the basis for the creation of physical model. Physical models are specific to the application software that will be implemented ¢ — They define the structures that the database or a file system will use to store and manage the data. It includes tables, columns fields, indexes, constraints, triggers and other DBM’s elements, © Database designers use physical data models to create designs and generate schema for databases. >| 1.9 DIFFERENCE BETWEEN DATA SCIENCE AND DATA ANALYTICS = Data science © Data science deals with extracting meaningful information and insights by applying various algorithms, processes, scientific methods from structured and unstructured data. * This field of data science is related to big data and is one of the most required skills at present. * Data science consists of mathematics, computations, statistics, programming etc. to gain important and relevant insights from the large amount of data, that is provided in various formats. Data Analytics * Data analytics gets conclusions by processing the raw data. + Ithelps the company to make decisions based upon the conclusions from the data. It converts a large number of figures in the form of data into simple English, and these conclusions are further helpful in making the required decisions. * We mention below the table between Data Science and Data Analytics. ‘Table 1.9.1 Feature Data science Data analytics _ Coding Python is the commonly used language for | The knowledge of python and R language data science along with the use of other | Language is essential for data languages such as C++, Java, etc. analytics. Programming | In depth knowledge of programming is | Basic programming skills is necessary skills required for data science. for data analytics. Use of machine | Data science makes use of machine | Data analytics does not make use of learning earning algorithms to get insights. machine learning. Other skills Data science makes use of data mining | Hadoop based analysis is used for activities for getting meaningful insights | getting conclusions from raw data Scope ‘The scope of data science is very large ‘The scope of data analytics is very small, i.e, micro, L Goals Data science deals with explorations and | Data analytics makes use of existing new innovations. resources. (MU - New Syllabus w.e.f academic year 22-23)(M8-79) Dabrech-neo Publications Applied Data Science (MU-Sem 8-Comp) (introduction to Data Science)...Page No. (1-22) py 1.10 CASE STUDY - GINA : GLOBAL INNOVATION NETWORK AND ANALYSIS i ‘re. Write a case study on Global Innovation Network & Analysis (GINA). ' GONE ' ' 6. _Writea short note on Case of GINA (6 Marks) + : * EMC’ GINA (Global Innovation Network and Analytics) team is a group of senior technologists placed in centers of excellence (COES) all over the world. * The main goal of team is to connect employees all over the world to drive innovation, research as well as university partnerships. * The basic consideration of GINA team was that its approach would offer an interface to share ideas globally and enhance sharing of knowledge between GINA members who are not at one place geographically. * A data repository has been created to store both structured and unstructured data to achieve three important goals : (1) Store formal as well as informal data. (2). Keep track of research from technologists all over the world. (3) To enhance the operations and strategy, extract data for patterns and insights. The case study of GINA illustrates an example of the way by which a team applied the Data Analytics Lifecycle for the purpose of analyzing innovation data at EMC. Innovation is generally considered as a hard concept to measure, and this team is going to use advanced analytical methods so as to identify key innovators within the company. YA 1.10.1 Phase 1 - Discovery In this phase, identification of data sources is started by the team. Even though GINA has technologists which are skilled in several different aspects of engineering, it had few data and ideas regarding what it needs to explore but do not have a formal team which could perform these analytics. + They consults with various experts and decided to outsource the work to the volunteers within EMC. list of roles is as follows on the working team which were fulfilled : (User of Business, Sponsor of Project, Manager of Project : Vice President (W) Business Intelligence Analyst + Representatives from IT Field (it) DBA (Data Engineer and Database Administrator) : Representatives from IT (iv) Data Sclentist : Distinguished Engineer who are able to develop social graphs. (MU = New Syllabus w.ef academic year 22-23)(MB-79) Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-23) * The approach of project sponsor is to influence social media and blogging for the purpose of accelerating the set of innovation as well as research data across the world and to inspire teams of data scientists who can work as “volunteer” globally. «© — The data scienti hould show passion about data, and the project sponsor should have ability to tap into this passion of greatly talented people to achieve challenging work in a creative way. ‘The data regarding the project is divided into two important categories. The first category regards with the idea submissions of near about five years from EMC's internal innovation contests, called as the Innovation Roadmap or Innovation Showcase. The Innovation Roadmap is nothing but an organic innovation process in which ideas are submitted by employees globally which are then judged. + For further incubation, rest out of these ideas are selected. Consequently the data is combination of structured data, like idea counts, submission dates, inventor names, and unstructured content, like the textual descriptions regarding the ideas themselves. The second category of data consists of encompassed minutes as well as notes which represents innovation and research activity globally Additionally it represents combination of structured and unstructured data. The structured data consists of attributes like dates, names as well as geographic locations. In the unstructured documents data is regarding “who, what, when, and where” which represents rich data regarding knowledge growth and transfer inside the company. There are 10 important IHs which are developed by GINA team : (1) THI : It is possible to map innovation activity in dissimilar geographic locations to corporate strategic directions. (2) IH2: The delivery time of ideas minimizes by the transfer of global knowledge as part of the idea delivery process. (3) IH3 : Innovators participating in global knowledge are able to deliver ideas fast as compared to those who do not. (4) IBG4 : It is possible to analyze and evaluate an idea submission for the likelihood of receiving funding. (5) IHS : Knowledge invention and increase for a specific topic can be measured as well as compared across geographic locations. (6) IHG : Research-specific boundary can be identified by the knowledge transfer activity spanners in different regions. (1) THT = Itis possible to map strategic corporate themes to geographic locations. (8) IH8 : Continuous knowledge growth and transfer events minimize the time required to create corporate asset from an idea. (MU - New Syllabus we.f academic year 22-23)(M8-79) fe Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-24) (9) TH9 : Lineage maps get revealed when corporate asset is not generated by the knowledge expansion and transfer. (10) TH10 : It is possible to classify and map emerging research topics to particular ideators, innovators, boundary spanners, and assets. %& 1.10.2 Phase 2 - Data Preparation ‘* Anew analytics sandbox is set up by the team with its IT department for the purpose of storing and experimenting on the data. * Inthe process of data exploration exercise, the data scientists and data engineers come to know that specific data require conditioning and normalization. ‘* Also they come to know that various missing datasets were difficult to testing some of the analytic hypotheses. ‘* As data is explored by the team, it promptly realized that without good quality data, it would not be able to carry out the subsequent steps in the lifecycle process. Consequently it was essential to conclude for project what level of data quality and cleanliness was necessary. «In the case of the GINA, the team realizes that several of the names of the researchers and people who are communicating with the universities were misspelled or had spaces at leading and trailing side in the data-store. © Such little problems must be addressed in this phase to enable better analysis as well as data aggregation in subsequent phases. % 1.10.3 Phase 3 - Model Planning © In the GINA project, for large amount of dataset, it looks viable to use social network analysis techniques to observe the networks regarding innovators. © Inother cases, it was hard to provide appropriate methods to test hypotheses because of the lack of data. * Inone case (1H9), a decision is made by the team to begin a longitudinal study to start tracking data points over time about people who are developing new intellectual property. ‘© This data collection support the team to test the next two ideas later : (8) THB : Continuous knowledge growth and transfer events minimize the time required to create ‘a corporate asset from an idea? (il) TH9 : Lineage maps get revealed when corporate asset is not generated by the knowledge ‘expansion and transfer. » For the longitudinal study being proposed, there is need to team to establish goal criteria for the Purpose of study. (MU - New Syllabus w.e. academic year 22-23)(MB-79) a Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Introduction to Data Science)...Page No. (1-25) ee EE Particularly, it required to decide the end goal of a successful idea which had traversed the entire journey. The parameters regarding the scope of the study consist of the following considerations: (i) Identify the correct milestones for the purpose of accomplishing this goal. (ii) Trace the way by which people shift ideas from each and every milestone towards the goal. (iii) After this, trace ideas which unable to reach the goals, and trace others which are able to reach the goal. Compare the journeys of both types of ideas. ‘Make comparison regarding the times and the outcomes with the help of a few different methods based on the way by which data is collected and assembled. Chapter Ends... goa Sure Marks Notes and Paper Solutions Elevating Excellence ewe la Guide & University Paper Solutions Written, Edited by most experienced faculty. Chapterwise & Topicwise Paper Solutions. Most Likely question also included. Answers exactly as per the weightage of marks given in exam. All Latest Q. Papers included. MODULE.2 24 22 23 24 25 Data Exploration CHAPTER 2 Types of data, Properties of data Descriptive Statistics : Univariate Exploration: Measure of Central Tendency, Measure of Spread, Symmetry, Skewness: Karl Pearson Coefficient of skewness, Bowley's Coefficient, Kurtosis Multivariate Exploration: Central Data Point, Correlation, Different forms of correlation, Karl Pearson Correlation Coetticient for bivariate distribution. Inferential Statistics : Overview of Various forms of distributions: Normal, Poisson, Test Hypothesis, Central limit theorem, Confidence Interval, Z-test, t-test, Type-I, Type-t Errors, ANOVA. Introduction to Statistics... Measures of Central Tendency. Review of Basic Results in the Theory of Statistics. 2.3.1 Range and Mid-ranage : 2.3.2 Variance and Standard Deviation... 2.3.3 Arithmetic Mean. 2.3.4 Moments about Mean... 2.3.5 Relation between Moments about Mean (j,) and Moments about Origin ( 2.3.6 Kari Pearson's Coefficients of Kurtosis........ The Expected Value of x : (Mean Value of x). Testing of Hypothesis... 2.5.1 Statistical Hypothesis... 2.5.2 Test of Hypothesis... 2.5.3 Tests of Significance. 2.5.4 Null Hypothesis. Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-2) 255 256 257 2.5.7(A) 258 2.5.9 2.5.10 25.11 2.6 Chi-square test of goodness of fit 261 2.6.2 2.7 Chi-Square Test 2.7.1 Probability Density Function (p.d.t.) of Chi-square Distribution, 27.2 Remark 2.7.3 Applications of x°- Distribution 2.7.4 — Chi-Square Test of Goodness of Fit. 27.5 Steps to Compute x° and Drawing Concl 2.7.6 Conditions for the Validity of Chi-Square Test 2.7.7 Examples 2.7.8 Levels of Significance 2.7.9 Method of Solving the Problem 2.7.10 Student's Distribution 2.7.11 Properties of t-distribution 2.7.12 Applications of t-distribution. ‘f — test for Significance of Sample Correlation Coefficient 281 Examples t-Test for Difference of Mears 2.9.1 Assumptions for Difference of Means Test . 2.92 Examples Z-Test.. 2.101 Use of Zest 2.10.2 Hypothesis Testing 2.10.3 Steps of Performing Z-test 2104 Type of Zest... 2.105 Solved Example 2.10.6 Two - Sampled Z - Tos 211 Comelation... 211.1 Types of Correlation. 211.2 — Scatter Diagram. 2.42 Central Data Point 2.12.1 Data Point, Altemate Hypothesis... Types of Errors. Type | Error and Type Il Error Comparison between Type | and Type Il Errors . Power of Test. Level of Significance Critical Region Examples... Contingency Table... Degrees of Freedom (MU - New Syllabus w.ef academic year 22-23)(M8-79) al Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-3) an ieeeereneeceeceeeeeeeeeeeeee ee 213 244 245 (MU 2.12.2 Requirements for a Good Data Point...... 2-49 2.12.3 Use of Data ~ Points... 2-50 2.12.4 Collection of Data Points....... 12-50 2.12.5 Examples of Data Point Collection Methods 2-50 2.12.6 Analysis of Data Points 2-50 2.12.7 Examples of Data Points.. 251 2.12.8 Karl Pearson's Coefficient of Correlation. 2-51 2.12.9 Properties of Coefficient of Correlation 2-52 VEX. 2.12.1 won252 TMEEEE 110-19, 3 Marks 2.12.10 Examples on Correlation Coefficient. 2-53 Rank Correlation.. 2.13.1 Spearman's Rank Correlation Coefficient... 2.13.2 Tied Ranks.. UEx. 2.13.3 [IU EOREIET ..... UEx. 2.13.4 EEO ... Bowleys Coefficient... (Qg + Qy-20,) 2.14.1 Bowl ley Skewmess “> 2.14.2 Why Bowley Skewness Works. 2.14.3 Limitations of Bowley Skewness.. Poisson Distribution...... UQ. —_ Write short note on Rayleigh distribution. (Reo M eR CAO MTT PERC CN REA ORCA SS 2.15.1 We Derive Poisson's Distribution from Binomial Distribution 2-62 2-64 2.15.2 Moments of the Poisson Distribution. 2.15.3 Moment Generating Function 2-65 2.15.4 Additive Property of Poisson Distribution... 22-65, ~ New Syllabus w.e.f academic year 22-23)(M8-79) Dab rech-neo Publications ‘Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-4) Deplied Data Scie A U@. —_ Explain Poisson distribution and its properties. [Iq Prom 2.16 Examples on Poisson Distribution 2.17 Normal Distribution (or Gaussian Distribution... ua. Discuss normal distribution and its characteristics. Or Ua. —_ State importance of normal distribution . TOMA ETD) Dec. 16. Q. 6(c). May 19, 05 Marks [i 2.17.1 Characteristics of Normal Distribution. 2.17.2 Properties of Normal Distribution (N. D.).... ua. What is the importance of standard normal variate 7 MU-- Dec. 2020 2.18 Importance of Normal Distribution 2.19 Solved examples Normal Distribution 2.20 Analysis of Variance (ANOVA) 2.20.1 Definition of ANOVA 2.20.2 Assumptions for ANOVA Test 2.21 Hypothesis Testing for More than Two Means (Anova). 2.21.1 Altemative for Computation of Various Sums of Squares... 2.22 Solved Examples. 2.23 Central Limit Theorem va. State central limit theorem and explain. Ua. _Explain the central limit theorem. MUO. 1(b). Dec. 14. Q. 3(b). May 15. Q. 1(a), Dec, 15,0, Nic) CORT EA MIME EOL 223.1 Examples on Central Limit Theorem. 2.24 Confidence Intervals... > Chapter Ends (MU - New Syllabus w.ef academic year 22-23)(MB-79) Te recn-neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-5) ee EEE eee DM 2.1 INTRODUCTION TO STATISTICS ——— Q Definition (1): A Variate is any quantity or attribute whose value varies from one unit of investigation to another. GQ Definition (2) : An observation is the value taken by a variate for a particular unit of investigation. Variates differ in nature, and the methods of analysis of a variate depend on its nature. And we can distinguish between quantitative variates (like the birth-weight of the baby, etc.) and qualitative variates (such as the sex of the baby etc.) G_Definition (3) : A quantitative variate is a variate where values are numerical. Q- Definition (4) : A qualitative variate or attribute is a variate whose values are not numerical. Qualitative variates can also be divided into two types : (i) They may be continuous, if they can take any value we specify in some range or (ii) Discrete if their values change by steps or jumps. Q Definition (5) : A continuous variate is a variate which may take all values within a given range. Q Definition (6) :A discrete variate is a variate whose values change by steps. The choice of which variates to record is important in any investigation. Once choice is made, the information can be summarized by the frequency distribution of the possible ‘values’. Q Definition (7) : The frequency distribution of a (discrete) variate is the set of possible values of the variate, together with the associated frequencies. Q Definition (8) : The frequency distribution of a (continuous) variate is the set of class-intervals for the variate, together with the associated class-frequencies. If we classify the whole population according to birth-weights, then instead of looking at the frequency of each variate, we first group the values into intervals, which is the sub-division of the total range of possible values of the variate. In this example, the variate may be classified as 1-500, 500-1000, 4501-5000, 5001-5500 grams. Q Definition (9) : A class-interval is a sub-division of the total range of values which a (continuous) variate may take. Q Definition (10) : The class-frequency is the number of observations of the variate which fall in a given interval. Q Definition (11) : Cumulative frequency distribution is the sum of all observations which are less than the upper boundary of a given class interval : o¢ this number is the sum of the frequencies Upto and including that class to which the upper class boundary corresponds. * For example, consider the heights of 50 students. We prepare the Table 2.1.1. (MU - New Syllabus w.e.f academic year 22-23)(M8-79) Tech-Neo Publications Table 2.1.1 : Cumulative frequency (more than) Table Class (cm) interval | Frequency | Cumulative frequency more than 145-146 2 2 147-148 5 7 149-150 8 Is 151-152 15 30 153-154 9 39 155-156 6 45 157-158 4 49 159-160 1 50 Total 50 Table 2.1.2 : Cumulative frequency (less than) Table Class (em) | Frequency | Cumulative frequency interval more than 145-146 2 50 147-148 5 48 149-150 iE 8 43 151-152 15 35 153-154 9 20 155-156 6 MM 157-158 4 5 159-160 1 1 Total 50 Q Definition (12) Points to note while constructing the Tables, (1) Make the table self-explanatory provide a title, a brief description of a source of the data, St ‘what units the figures are expressed, label rows and columns where appropriate. (2) Keep the table as simple as possible. (3) Distinguish between zero values and missing observations. (4) Make alternations clearly. (S) Give the calculations of logical pattern on the sheet. tate in (MU - New Syllabus wef academic year 22-23)(M8-79) wl Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-7) ee io oe ‘Dp 2.2 MEASURES OF CENTRAL TENDENCY * One of the most important aspects of describing a distribution is the central value around which the observations are distributed. ¢ Any arithmetical measure which gives the centre or central value of a set of observations is known as a measure of central tendency or measure of location. » 2.3 REVIEW OF BASIC RESULTS IN THE THEORY OF STATISTICS 2.3.1 Range and Mid-ranage One way to measure the variability in a sample is simply to look at the highest and the lowest of the observations in a set, and calculate the difference between them. Q_ Definition 1 : The range of a set of observations is the difference in values between the largest and smallest observations in the set. Q Definition 2 : The mid-range is the average of the largest and smallest values in the data set. eg, for X=(1,3,5,7,9, 11, 13} 1+13 midrange = %® 2.3.2 Variance and Standard Deviation Pursuing the idea of measuring how closely a set of observations cluster round their mean, we square each deviation ( x; -X ) instead of taking its absolute value. ‘The next measure of variability is the Variance : It is the mean of the squared deviations. Q Definition (1) : Variance (The variance of a set of observations x}, x, a A 1 <-y their mean and equals 2) ( x;-*) + Xq is the average of the squared deviations from On simplification it is equal to 2 Lye (23) a Ua (ID) For grouped data Bie a Variance = 2 6 (%-%) i (MU - New Syllabus w.e,f academic year 22-23)(M8-79) Bal rech-neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration} Page No. (2-8) Definition : Standard deviation (D_ The standard deviation is the positive square root of the variance, and is equal to fa 2 YD (x-x)? and denoted by 6 (I) For grouped data & 2.3.3 Arithmetic Mean Iff,, ff, are frequencies of the variates x,, x9, .., X, then, a Dax M = Arithmetic Mean (A.M. a 2 fi ist Short-cut method of finding mean = x- Let, x = “~~ ; where Xp is assumed mean and h is length of the class interval Then M = x+hA De-x Where, A Xr \ 2.3.4 © Moments about Mean Let, (K, f,) be the given frequency distribution, then the r’” moment about mean M is given by, Lew-” we SE Do-M) LAs we Se = M-M-1=0 and for, r=2 LAw-MY a = Square of standard deviation (OU - New Syllabus wet academic year 22-23)(MB8-79) cl Tech-Neo Publication’ Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-9) a eeaeaeaeaeueua_n SS Definition : r” moment about origin are given by, » | Zh SE , _ Zhx 1, ad yy, Me™ , Lax , _ Zax, , Lax, Again Wai Ha Se OSE %_ 2.3.5 Relation between Moments about Mean (u,) and Moments about Origin (1) @ w=n,-(u,) @) y= wh -3 up Hy +2y Gi) a= wy -4 0) wy +oun, -3 “ Also note that, , , Lax! If x = Then, w= ze 2.3.6 Kari Pearson’s Coefficients of Kurtosis @_B, = Measure of skewness 7 D Gi) B= Measure of flatness of single humped distribution =—> Note : For normal distribution, Bp = 3. a ae ea aaa boc oh aey fen the. pommel Cure and is known as lepto-kurtic. If B <3, the distribution is flat compared to normal curve and is known as plato-kurtic. } (MU - New Syllabus w.e academic year 22-23)(M8-79) Bal rech-Neo Publications a Applied Data Science (MU-Sem 8-Comp) (Data Explorati Lepto kurtic Normal Plato kurtio Fig. 2.3.1 Ya. 2.3.7 The Expected Value of x : (Mean Value of x) If X is a random variable then the expected value of X is denoted by E (X) and means the value, on average, that X takes. Definition : If X =x, i= to n, isa discrete random variable with frequencies f,, i= 1 to n, then EX) = XY f(x) i= jm Note : The expected value of X is also called as the Mean value of X and is also denoted by M. ie. M=E() 5 Properties (i) The r™ moment about origin is also written as, wl =EG) = z (ft Clearly, EG) = =D Xfi i=l j E(@) = waDx fi EW) = =D x, hand | u, =D x, fyand soon. EG’) (Gi) Moment about the mean % are defined as, 1 uy = EL -x] =X (x\-%) 4 and iscalled as r" moment about mean x . ist (MU - New Syllabus wef academic year 22-23)(MB-79) fl Tech-Neo Publicatio™ Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-11) Clearly, py = 0 ty = E(x-%) =n, -y : Hy = E(x, wy-3uju, 2p" ‘ and py = E(x,-x) : ee ak = By 4H + 6H, B, - 3H, YA 2.3.8 Covariance + In probability theory and statistics, covariance is a measure of the joint variability of two random variables, «If the greater values of one variable correspond with the greater values of the other variable, and the same holds for the lesser values (that is, the variables tend to show similar behaviour), the covariance is positive. + _ In the opposite case, when the greater values of one variable correspond to the lesser values of the other, (that is, the variables tend to show opposite behaviour), the covariance is negative. ‘* The sign of the covariance shows the tendency in the linear relationship between the variabies. Y Y o x -x cov (x, y) <0 cov (x,y) =O cov (x, y) >0 Fig.23.2 "= Formulae of covariance If X and Y are two random variables, then covariance between them is defined as : cov (X, Y)= E ([X-E QO} [Y-E(¥)]) E {XY -XE(Y)-Y E(X) +E(X) E(¥)} = E(XY)-E(X)E(Y)-E(Y)E(X)+E(X)-E(Y) cov (X, Y)= E (XY) -E(X)-E(¥) .-i) TEX and Y are independent, then E(XY) = B(X)-E(Y) and hence in this case, cov (X,Y) = E(X)E(Y)-E(X)E(Y) =0 (MU - New Syllabus w.ef academic year 22-23)(M8-79) [al recn-neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-12) "Remarks (i) cov(aX, bY) = E {[aX -E (ax)] [bY -E (bY)]) = E {aX ab (X)] [bY ~bE (¥)]) =E (a(X-E (X)] b[Y-E(Y)]) = abE ([X~E (X)] [Y -E (¥)]) =ab cov (X, Y) (ii) cov (X +a, Y +b) =cov (X, Y) (iii) cov (aX +b, (Y + d)) = ac cov (X, Y) (iv) cov (X + Y, Z) = cov (X, Z) + cov (Y, D (v) IfX and Y are independent, then cov (X, Y) = 0 but the converse is not true. w 2.3.9 Examples Ex. 2.3.1 : For the following distribution, find : () Arithmetic mean (ji) Standard derivation (iii) First 4 moments about the mean (iv) By and Bp. *12)25/3 |35]4 |45]5 £15| 38 | 65 | 92 | 70} 40 | 10 Mson x? | 2 | 5 |-3]-15| 45 |-135 | 40s 25 | 38 |-2|-76 | 152 | -304| 608 3 | 65 |-1|-65] 65 | -65 | 65 4 | 70} 1 | 70 | 70] 70 | 70 5 10 | 3 | 30 | 90 | 270 | 810 24 | 582| 156 | 2598 Total | 320 (MU - New Syllabus w.ef academic year 22-23(M8-79) fel Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration (i) Arithmetic mean : Using result of A we have, _ Stk yy 7320 = 0.075 A and arithmetic mean = xp + hA = 3.5 + (0.5) (0.075) =- 3.538 (ii) Standard deviation : Using the result of B, we have oat (eRe (0.5) {= (3)] = 0.453 o = 0673 (ili) Moments about the mean M When assumed mean is A = xy = 3.5 and using C, we have, , Lik’ wy = hose aos 3 = 0.0375 F Dt? 582 moa Se = (05) (#8) = 0.4546 yo yp 156 wos pee, =(05)° (8) = 0.0609 . we ix 2598 n= ee 0s)" a) = 0.5074 Using result D, we have for moments about the mean M. w= 0 He = wi, -n = (0.4546) — (0.0375)* = 0.0453 poe Hs = H, —3H,m, +2p) = (0.0609) 3 (0.4546) (0.0375) + 2 (0.0375)° = 0.0600 rw My = B,-3h,m, +2H, = (0.5074) - 4 (0.0609) (0.0375) + 6 (0.0375)° (0.4546) ~ 3 (0.0375)* = 3.2385 (iv) By definition of B, and B, we get, Since B, > 3, the distribution is lepto-curtic i.e. it is peaked up sharply than the normal distribution, (MU - New Syllabus w.e.f academic year 22-23)(M8-79) [el Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-14) UEx. 2.3.2 From the following frequency distribution compute the standard deviation of 100 students. Table P. 2.3.2 Mass in kg. | No. of students 60-62 5 63-65 18 66-68 42 69-71 27 72-74 8 | :We construct the table, Let X = 67 be assumed mean and h = 3=class width Let uz Table P. 2.3.2(a) Class of masses | Midpoint of Classes x| Number of students f] x= 67 | fu |fu’ a3 60-62 61 5 -2 -10}20 63-65 64 18 = |—18]18 66-68 67 42 0 0 0 9-71 70 27 1 27 |27 72-74 B 8 2 16 |32 Total (2) 100 0 is |97 Wehave Df-u = 15, Lfu'=97 h = 3andN=)f=100 1 1 15, By definition, o = h\ ty at-(§Ls,) 3 tg 0n- (ap) 92292 >>| 2.4 SAMPLING DISTRIBUTIONS * A group of pupils in a school plan to investigate how long it takes to travel between home and school. There are 2000 pupils in their school, and they realise that they do not have the time t0 collect and analyse such a large amount of data. (MU - New Syllabus w.e academic year 22-23)(MB-79) i] Tech-Neo Publications Nee Applied Data Science (M mp) (Data Exploration)...Page No. (2-15) * The argue that information from some of the pupils should give them what they want provided these pupils are chosen properly. So they decide to collect data from only a part of the complete school population. We call this part a sample-of the school population. Q Definition : A sample is any subset of a population. An investigation of this type is said to be a survey of a population. Definition : It the above example, the information is collected by sampling; such an investigation is called is sample survey. Q__ Definition : If a survey plans to collect information from every member of a population, it is called a census of that population, * The sample chosen should be reflection of the whole population, it should reproduce characteristics of the population. In our problem, the mean journey time is the characteristic for the population of school children. %& 2.4.1 Random sampling There are many sampling schemes that may be called random. We shall only define a simple random sample, which is very straight forward. Other, more complex random sampling schemes are of particular use in certain special types of problem. © Definition of a simple random sample Q__ Definition : A (simple) random sample is a sample which is chosen so that every member of the population is equally likely to be a member of the sample, independently of which other members of the population are chosen. | Some useful terms For practical reasons the investigator often has to settle for obtaining information about a population which has similar properties to the population. It is convenient to distinguish these two populations by giving them separate names. (Definition : The target population ‘The target population is the population about which we want information. (i) Definition of study population The study population is the population about which we can obtain information. Gil) Definition of a sample unit A sample unit is a potential member of the sample. (MU - New Syllabus w.e.f academic year 22-23)(M8-79) TBhrech-neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-16) p>] 2.5 TESTING OF HYPOTHESIS ee —._—_000. OO UQ Explain Hypothesis testing with example. (Ref. - Q. 3(b), Aug. 18. 4 Marks) 1 Explain hypothetical testing in detait ith example. (Ref. - Q. 3(b), Oct. 19, 5 Marks} « Inference based on deciding about the characteristics of the population on the basis of sample stud: is called the inductive inference. , . a involve a of ee wrong decisions. For example, a pharmaceutical concern may a es in if ey ig is really effective for the particular ailment, say, in reducing . ns ce ot nh baeesrtal of probability pays a very vital role in decisions making and the ton statistics which helps us in arriving at the criterion for such decisions is known as testing «The theory of testing of hypothesis employs statistical techniques to arrive at decisions in certain — where there is an element of uncertainty on the basis of sample whose size is fixed in YS 2.5.1 Statistical Hypothesis + A statistical hypothesis is some assumption or statement, which may or may not be true, about a population, or about the probability distribution which characteristics the given population. © We are supposed to test it on the basis of the evidence from a random sample. © If the hypothesis completely specifies the population, then it is known as simple hypothesis, otherwise it is known as composite hypothesis. YW 2.5.2 Test of Hypothesis ‘A test of a statistical hypothesis is a two-action decision-after observing a random sample from the given population. The two actions are the acceptance or rejection of the hypothesis under consideration. © The truth or falsity of a statistical hypothesis is based may be consistent or inconsistent with the hypothesi accepted or rejected. ©The acceptance of a statistical b reject it and does not necessarily imply that it is true. TR 2.5.3 Tests of Significance From the knowledge of the sampling distribution that a sample statistic would differ from a given hypo! sample value, by more than a certain amount and then to answer the question of significance, between two independent statistics. It is known as test of significance. nthe information contained in the sample. It is and accordingly the hypothesis may be ypothesis is due to insufficient evidence provided by the sample to of a statistic, it is possible to find the probability thetical value of the parameter or from another (MU - New Syllabus wef academic year 22-23)(MB-79) Bal rech.neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-17) Se ‘Thus we can say that : (i) The difference between a statistic and the corresponding population, or (ii) The difference between two independent statistics is not significant if it depends on fluctuations of sampling, otherwise it is said to be significant. %& 2.5.4 Null Hypothesis : YQ. _ Explain Null Hypothesis, EXECEOATy rs) | + For applying any test of significance, we set up a bypothesis‘a definite statement about the population parameter (s)’. * In the words of Prof. A. R. Fisher : “Null hypothesis is the hypothesis which is tested for possible rejection under the assumption that it is true.” 5 Setting up a null hypothesis As the name suggests, it is always taken as a hypothesis of no difference. 5 To set the null hypothesis (Express the claim or hypothesis to be tested in the symbolic form. (ii) Identify the null hypothesis and the alternate hypothesis as : * Take the expression involving equality sign as the null hypothesis (Hp) and the other as the alternative hypothesis (H,). * Thus, depending on the wording of the original claim, the original claim can be regarded as Hy (if it contains equality sign) and sometimes it can be regarded as H, (if it does not contain the equ: * Any hypothesis which is complementary to the null hypothesis is called an alternative hypothesis. Itis usually denoted by Hy. * Alternative hypothesis Hj is stated in respect of any null hypothesis Hp, because the acceptance or ejection of Hy is meaningful only if it is being tested against a rival hypothesis. WA 2.5.6 Types of Errors DONA IANIAAQMNMMADAMAD rae reaaeeeess 1UQ. Explain the following (() Type 1 and 2 errors 1 { ¢ (Ref. - Q. 4a), Aug. 18, Q. 3(b), May 19 4 Marks) * The decision to accept or reject the null hypothesis Hy is made on the basis of the information supplied by the observed sample observations. (MU - New Syllabus wef academic year 22-23)(M8-79) fech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-18) * The four possible situations that arise in any test procedure are as follows Decision from sample Jes Reject Hy ‘Accept Ho True state | Ho true Wrong (Type Terror) | Correct Ho False (Hj true) | Correct Wrong, (Type Il error) From the above table, it is clear that we may commit two types of errors. 3 2.5.7 Type | error and Type Ill error > (i) Type Terror Type I error has occurred when we reject the null hypothesis, even when the null hypothesis is true. This error is denoted by «. > Gi) Type error Type error occurs when we did not reject the null hypothesis, even when the hypothesis is false. This error is denoted by B. Null hypothesis is true Null hypothesis is false Reject null hypothesis Type I error Correct decision (False positive) Fail to reject the Null hypothesis Correct decision (False negative) 5 Difference of Means Now suppose that we have not just a single sample but two samples from different populations and that we wish to compare the separate means. . © We also assume that the variances of the two populations are equal but unknown (the most common situation). 2a. 2.5.7(A) Comparison between Type I and Type I Errors ' YQ. Compare Type -1and Type - Il errors. Q. 4(a), Oct. 19, 5 Marks) © We make type J error by rejecting a true null hypothesis. And, © We make Type Il error by accepting a wrong null hypothesis. If we write : P [rejecting Hp when it is true] = P {type I error] =o. (MU - New Syllabus w.ef academic year 22-23)(M8-79) Dad rech-neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-19) eee SSS RE and P [accept Hp when it is wrong] = p [type If error] = B; then o and B are also called as sizes of Type I and Type Il errors respectively. In the terminology of Industrial quality control, the type I error amounts to rejecting a good lot and type Il error amounts to accepting a bad lot. Hence a B The sizes of type I and type II errors are also known as producer’s risk and consumer's risk respectively. P [rejecting a good lot] and P [accepting a bad lot] Practically it is not possible to minimise both the errors simultaneously. An attempt to decrease a results in an increase in B and vice-versa. And it is more risky to accept a wrong hypothesis than to reject a correct one; i.e. consequences of type Il error are likely to be more serious than the consequences of type I error. So for a given sample, a compromise is made by minimising more serious errors after fixing up the less serious error. Thus we fix 0, the size of type I error and then try to obtain a criterion which minimises 8, the size of type Il error. We have P [type I error] P [accepting Hp when Ho is false or H, is true] Since, P [accept Hy when Hy is wrong] + P [accept Hy when Hy is true) = | +. P [accept Hp when Hg is true] = 1-P [Accept Ho when Hp is wrong] -B 2.5.8 Power of Test 1-B = P [Accept Hp when Ho is true] is called the power of test. Naturally, when Ho is true, it ought to be accepted. Hence, minimizing 8 amounts to maximizing (1 ~B), which is called the power of the test. Hence, the usual practice in testing hypothesis isto fix 0, the size of type I error and then try to obtain a criterion which minimizes B, the size of the type II error or maximises (1 ~ f), the power of the test. (MU - New Syllabus wef academic year 22-23)(M8-79) Dib! tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration} 2A 2.5.9 Level of Significance * The maximum size of type I error, which we prepare to risk is known as level of significance. It is denoted by a and is P [rejecting Ho when Ho is true] = * Commonly used levels of significance in practice are 5% (0.05) and 1% (0.01). + If we adopt 5% level of significance, it implies that we are 95% confident that our decision to reject Hp is correct. ¢ Level of significance is always fixed in advance before collecting the sample information. WS 2.5.10 Critical Region ‘* Suppose we take several samples of the same size from a given population and compute some statistic t, (say X, p etc.), for each of these samples. © — Letty, ta, ..., be the values of the statistic for these samples. Each of these values may be used to test some null hypothesis Ho. © These sample statistics t, t, .... ty (comprising the sample space), may be divided into two mutually disjoint groups, one leading to rejection of Ho and other leading to acceptance of Ho. © The statistics which lead to the rejection of Hy give us a region called critical region (C) or Rejection Region (R), while those which lead to the acceptance of Ho give us a region called Acceptance Region (A). Thus, if the statistic t € C, Hy is rejected and if t € A, Hp is accepted. ‘The sizes of type (1) and type (II) errors in terms of the critical region are defined as a = P [Rejecting Hy when Hp is true] = P [Rejecting Ho/ Ho] = P[te C/Hy) B= P [Accepting Hp when Hg is wrong] = P [accepting Hp when Hy is true] = P [Accepting Ho / Hy] = Pfte A/H) ection) region, A is acceptance region and CW A= CUA=S * Where C is the critical (rej (sample space) (MU - New Syllabus wef academic year 22-23)(M8-79) &l Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-21) —SSSSSSSsS9S93939393935 2.5.11 Examples Ex. 2.5.1 : In order to test whether a coin is perfect, it is tossed 5 times. The null hypothesis of perfection is rejected if and only if more than 4 heads are obtained. Obtain the (Critical region, (ii) Probability of type-I error, and Gi) Probability of type Il error, when the corresponding probability of getting a head is 0.2. Soin. : Let X be the number of heads obtained in 5 tosses of a coin. Ho : The coin is perfect ie. unbiased, Let Ho: p=4 We use binomial distribution. under Hy: X~B(n .P=4) P(X=x1Hp)= "Cyp'q?* Sones P(X=x1H)="¢,(4) =4%q,; x=0,1,23,45 @ Critical region or region of rejection Reject Hy if more than 4 heads are obtained Critical region = (x>4} = (x=5} Gi) Probability of type I error («) is @ = P [Reject Hp! Ho] =P [X=51Ho] 5 Hx°e5=F= 003125 iii) The probability of type I error ) is B = P [Accept Ho! Hy] = 1-P [Reject Hy Hy ] = 1-P[X=SIP=0.2) = 1-Ueyp'g?* )x=5,P=02 = 1-°C5(0.2)°- 1 =0.99968 (MU - New Syllabus w.e.f academic year 22-23)(M8-79) ab rech-neo Publications ‘Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page N 22) Ex, 2.6.2 : In order to test whether a coin is perfect, it is fussed 5 times, The null hypothesis of perfectness of the coin is accepted if an only if atmost 3 heads are obtained. Then the power of the test corresponding to the alternative hypothesis that probability of head is 0.4 is 272. 2853... _56 @ F725 Gi) 3725 Citi) F]QE Civ) none of these. Msom. : Let X be the number of heads in n = 5 tosses of a coin and let P = probability of a head in a random toss of the coin. Null Hypothesis : Hy : Alternate hypothesis : Hy : P= 0.4, Critical region : X >3 Power of the test for testing Ho against Hy is given by : 1-B = P [reject Hy when H; is true] P [reject Hy | Hy] = P(X>31P=0.4) 5 = X *c,04' 06°" r=4 [. binomial distance X ~ B (n= 5, P = 0.4, under Hj) = 5c, 0.4)* 0.6)! +°c50.4)° 0.6)" ay 6) (4) = 5-(35) Gis) +G5) 2y¥ 2] 16x17_272 = (3) [+3] -asxs73ns (@ is correct answer. ‘Dal 2.6 CHI-SQUARE “TEST OF GOODNESS OF FIT. Before going into the details of the chi-square test, we study some terms used in this connection: ‘2.6.1 Contingency Table Let A and B be the attributes of the given data. Let the data be classified into $ classes Ay) Azy As according to attribute A and into t classes By, B,, .-B, according to the attribute B. Let Oy be the observed frequency of the cell belonging to the classes Aj (i= 1, 2,» 8) and Bj = 1, 2,» €). The data can be set into a s xt contingency table of s rows and t columns as follows : (MU - New Syllabus w.e academic year 22-23)(MB-79) fal Tech-Neo Publications ‘Applied Data Science (MU-Sem 8-Comp) (Data Exploration). Classes | By | Bz B By | Total Aj O14 | O12 O71 Lj On} Ar Az | Or | On 035 On | Az Ai As | On | Os O55 On | As Total | By | Bz B BI ON Z& 2.6.2 Degrees of Freedom 23) The term degrees of Freedom refers to the number of “independent constraints” in a set of data. We explain this concept with a few examples : 1) If the data are given in a contingency table, then the degree of freedom is calculated by the formula y=(e-D(r-1). ‘Where y stands for degree of freedom, c for number of columns and r for the number of rows. Thus in a2 x2 table, degrees of freedom are (2 — 1) - (2-1) = 1 and so on. @ If the data are not given in the form of contingency table but given in the shape of a series of individual observations then the degrees of freedom are calculated in a different way. Consider the following distributions : Number of heads 0 1 wclolrafasulalol[r Total (MU - New Syllabus w.e.f academic year 22-23)(M8-79) ‘Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration)...Page No. (2-24) Here, if we write down the expected frequencies we have freedom to write any ten figures we choose but the eleventh figure must be equal {0 1024 minus the total of the ten figures we have written, because the total of the expected frequencies must be equal to the (otal of the actual frequencies. Thus, there are ten degrees of freedom in the above question. In such cases the degrees of freedom are equal tg (ii 1) where n is number of frequencies. Sera DH 2.7 CHI-SQUARE TEST EEE ee The square of a standard normal variable is called a chi-square (pronounced as Sky, without §) variate with I degree of freedom (d.f.) Thus if X is a random variable following normal distribution with mean p and standard deviation o, then A) isa standard normal variate. x — 2 ey is a chi-square (°) variate with 1 d.f. If X,, Xo, ...X; are r independent random variables following normal distribution with means 4, Hoye Hy and standard deviations 6}, 6>,... 6, respectively then the variate Which is the sum of the squares of r independent standard normal variates, follows Chi-square distribution with r d.f. Ys. 2.7.1 Probability Density Function (p.d.f.) of Chi-square Distribution 1f 7 is a random variable following Chi-square distribution with y 4-f. then its probability function is given by, ; 1__ in (2 -) gegcw PO) = Tape Gye”. Where [7/2 is Gamma function W 2.7.2 Remark 2 (1) The probability function P (17) depends on degrees of freedom. As y changes, P (x) changes. (2) Constants of 77 distribution with y df. Mean = y ; Mode=y-2 Variance = 27 (MU - New Syllabus w.ef academic year 22-23)(M8-79) Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration). No. (2-25) HY Wo= 2 Wy=8% my =48 y+ 127. (3) Pearson;s Co-efficient of skewness : s, = Mean — Mode ay iS -2) 'k sd (i) Since coeff. of skewness > 0 we 21; ¢ ~ distribution is positively skewed. (ii) Since skewness is inversely proportional to the square-root of d.f., it tends to symmetry as d.f. increases. ‘Thus for large d.f, x-distribution tends to normal distribution. (4) For large the standard variate 2 22 2 = X=EQ) _x=7 sx) — V2y is a standard normal variate. (5) Additive Property : 2 2 If xy .%g > ~ ty, ate independent 72 variates with ny, np... my dfs. then the sum k 2 2 2 2. 2 . y Dak ty tot % isax’ - variate with (nj +nj+...+m) df. i=l %_ 2.7.3. Applications of x’ - Distribution Some of the applications of x”-distribution are : (@ _ Chi- square test of goodness of fit (i) 72 test for independence of attributes 2.7.4 Chi-Square Test of Goodness of Fit 2 X test of goodness of fit is used to test if the deviation between observation (experiment) and theory may be attributed to chance or if it is really due to the inadequacy of the theory to fit the observed data. karl Pearson proved that the statistic (Oj 0 2 in [ 5 iz 2 2 2. > = Ey =E) af =F) oe Cab) E Ey Ey Follows x°- distribution with y= n—1 d.f, where 0}, Oy, ...Oq are observed frequencies and By, Ep, ....Ey are corresponding expected frequencies under some theory of hypothesis. (MU - New Syllabus w.e academic year 22-23)(M8-79) ce Tech-Neo Publications 2S 2.7.5 Steps to Compute x” and Drawing Conclusion (i) Compute the expected frequencies By, Ep, ... Ey comesponding to observed frequencies 01, Oy, (0, under same theory. (ii) Compute the deviation (0; ~E,) and square them to obtain (0; - EB)”. (iii) Divide (0, — E)” by E, and add the values to compute x2 = 5 [s (iv) Under the hypothesis Ho : the theory fits the data well, the ab iti ?. the above statistic follows 7-distribut Loser pote lows x -distribution , 2 (») Look up tabulated (critical) values of x” at 5% or 1% level of significance and draw the conclusion 2 2.7.6 Conditions for the Validity of Chi-Square Test The x -test can be used only if the following conditions are satisfied : (i) The total frequency, N, should be large, say greater than 50. (ii) The sample observations should be independent; i.e., no individual item should be included twice or more in the sample. Gii) The constraints should not involve square or higher powers of the frequencies. (iv) No theoretical frequency should be small; i.e. it should be larger than 10 but not less than 5. (v)_ The data should be in original units. 2 2.7.7 Examples Ex 27.4 2 The number of scooter accidents per month in a certain town were as follows : 12, 8, 2,2, 14, 10, 15, 6.9, 4 ‘Are these frequencies in agreement with the belie that accidently conditions were the same during thi 10 month period? Soln. : Null Hypothesis : mber of accidents per month) are consistent with the belief that Ho : The given frequencies (i.e. nu the accident conditions were same during the 10-month period. Now, total number of accidents = 1248420424 14410415 +649+4=100, under the null hypothesis, co M re , Expected number of accidents for each of the 10 months. = “jq = 10. (~" these accidents at uniformly distributed) Now, df.=10-1=9 (MU - New Syllabus wef academic year 22-23)(M8-79) Tal rech-Neo Publication’ Applied Data Science (MU-Sem 8-Comp) ©. Tabulated %, 95 {0 9d.f. = 16.919 We prepare table for computation of 72. [vionthjobserved no. of accidents (O)pExpected no. of accidents (EKO EO cr i 1. 12 10 2 4 04 2. 8 10 2 4 04 3. 20 10 10 100 10.0 4. 2 10 -8 64 64 5. 14 10 4 16 1.6 6. 10 10 0 cv) i) 7 15 10 5 25 25 8. 6 10 -4| 16 | 16 9. 9 10 -1 1 0.1 10. 4 10 -6 36 3.6 Total 100 100 0 - 26.6 ¢ = F [OF 266 Since calculated value of x” = 26.6 is greater than tabulated value from (i) = 16.919, it is significant and hence the null hypothesis is rejected at 5% level of significance. ii) Hence, we conclude that the accident conditions are certainly not uniform over the 10-month period. Ex. 2.7.2 : The theory predicts the proportion of beans, in the four groups A, B, C and D should be 9 : 3 23: 1. In an experiment among 1,600 beans, the numbers in the four groups were 882, 313, 287 and 118. does the experimental result support the theory ? (The table value of x" for 3 df. at 5% level of significance is 7.81). Soln. : Null Hypothesis : Ho : There is no significant difference between the experimental values and the theory; i.e. the theory supports the experiment. * The proportion of beans in four groups A, B, C and D should be 9: 3 : 3: 1. Hence the theoretical (expected) frequencies are as shown, (MU - New Syllabus wef academic year 22-23)(M8-79) c Tech-Neo Publications Applied Data Science (MU-Sem 8-Comp) (Data Exploration) Category | Expected frequency (E) 3 io 715% 1600 = 300 3 C Fg * 1600 = 300 1 D 7g * 1600 = 100 2 Computation of x |Category| Observed | Expected Jo-xlo-»'(0-n? frequency | frequency z () ® A 882 900 |-18| 324 | 0.360 B 313 300 [+13] 169 | 0.563 ic 287 300 |-13| 169 | 0.563 D 118 100 | 18 | 324 | 3.240 0 | 986 | 4.726 Now, df, =4~1=3 and tabulated "for 3 df 2 = 781 Tos © Conctusion : Since calculated val we accept the null Hypothesis at the theory. Ex. 2.7.3: A die is rolled 100 times with the following distribution. 314 |5 |6 20 | 17 | 17 | 15 ‘At0.01 level of significance, determine whether the die is true Usoin. : We have number of categories = 6 N =Total Frequency = 17 4 14 420417 +17 + 15 = 100 Jue of 72 is less than tabulated value, itis not significant. Hence '5% level of significance. Thus, the experimental results support (or uniform). (MU - New Syllabus w.ef academic year 22-23)(M8-79) fel Tech-Neo Publications Applied Data Science (MU-Sem &-Comp) (Data Exploration)...Page No. (2-29) Sate Sclonce (Wu-Sem 6 )..-Page No_ (2-29) * Nall Hypothesis : Hy : The die is true (uniform) + Under Ho, the probability of obtaining each of the six faces 1, 1 is same, ie. P=§ -. Expected frequency for each face = N. P = 100-4 = 16.67 Computation of ‘Number | Observed frequency] Expected frequency | OE |. (©) ® 1. 17 16.67 0.33 [0.1089 0.0065 2. 14 16.67 ~2.67 | 7.1289 | 0.4276 3. 20 16.67 3.33 [11.0889] 0.6652 4. 17 16.67 0.33 [0.1089 | 0.0065 5. 17 16.67 0.33 | 0.1089 | 0.0065 6. 15 16.67 — 1.67 | 2.7889| 0.1673 total : : 0 -_ | 12796 ¢ = E [52]. 127% @ The degrees of freedom (d.f.) = 6-1=5 The critical or tabulated value of Chi-square for y = 5 and at 1% level of significance is : x (0.01) = 15.086 ...ii) Since calculated valued of 7 is less than critical value, itis not significant. Hence Ho may be accepted at 1% level of significance; i.e. the die may be regarded as true or uniform. Ex. 2.7.4 : Records taken of the number of male and female births in 800 families having four children are given in the Table P. 2.7.4 Table P. 2.7.4 ‘No. of births ‘Male. | Female | F°Te7y 0 4 32 1 3 178 2 2 290 3 1 236 4 oO 64 ‘Test whether the data are consistent with the hypothesis that the binomial law holds and the chance of a male birth is equal to that of a female birth. (MU - New Syllabus w.ef academic year 22-23)(M8-79) [al rech-Neo Publications

You might also like