0% found this document useful (0 votes)
65 views32 pages

Data Analytics Pyq

The document outlines a syllabus for a Data Analytics course, including definitions and explanations of key concepts such as Data Analytics, Machine Learning, and various analytical techniques. It includes instructions for candidates, a series of questions for assessment, and covers topics like confusion matrices, support and confidence in association rule mining, and the life cycle of data analytics. Additionally, it discusses applications of machine learning and challenges in social media analytics.

Uploaded by

onkarborhade25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
65 views32 pages

Data Analytics Pyq

The document outlines a syllabus for a Data Analytics course, including definitions and explanations of key concepts such as Data Analytics, Machine Learning, and various analytical techniques. It includes instructions for candidates, a series of questions for assessment, and covers topics like confusion matrices, support and confidence in association rule mining, and the life cycle of data analytics. Additionally, it discusses applications of machine learning and challenges in social media analytics.

Uploaded by

onkarborhade25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 32
T.Y.B.Sc. COMPUTER SCIENCE CS-364 : Data Analytics (CBCS 2019 Pattern) (Semester -VI) Time : 2 Hours] [Max. Marks : 35 Instructions to the candidates: 1) Figures to the right indicate full marks. 2) All questions are necessary. 3) Neat diagrams must be drawn wherever necessary. Q1) Attempt any EIGHT of the following : [8x1=8] a) Define Data Analytics. —_ Data Analytics: The process of examining, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. b) Define Tokenization. —_ Tokenization: The process of splitting text into smaller units, like words, phrases, or sentences, to be processed in natural language processing (NLP). c) Define Machine Learning. =>> Machine Learning: A subset of artificial intelligence (Al) that enables systems to learn and make predictions from data without explicit programming. d) What is clustering? —_— Clustering: A machine learning technique where data points are grouped into clusters based on similar characteristics or patterns, typically used in unsupervised learning. e) What is Frequent Itemset? — Frequent Itemset: A set of items that frequently appear together ina transaction database, often used in association rule mining. f) Whatis data characterization? _ Data Characterization: The process of summarizing the general characteristics or properties of a dataset to understand its main features. g) What is outlier? —_ Outlier: A data point that significantly deviates from the other points in the dataset, often indicating an anomaly or error. h) What is Bag of words? _ Bag of Words: A text representation method where text is broken down into a collection of words, ignoring grammar and word order, used in natural language processing tasks. i) What is Text Analytics? _ Text Analytics: The process of analyzing and extracting meaningful information from text data, including techniques like text mining, sentiment analysis, and NLP. j) Define Trend Analytics? ed Trend Analytics: The analysis of data to identify patterns or trends over time, helping to predict future behaviors or events based on historical data. Q2) Attempt any FOUR of the following : [4x2=8] a) What is confusion matrix? _> Confusion Matrix: A confusion matrix is a performance measurement tool used in classification problems. It is a table that compares the predicted labels with the actual labels. It consists of four values: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN), which help in calculating other performance metrics like accuracy, precision, recall, and F1-score. b) Define support and confidence in association rule mining. —_ @ Support: The support of an itemset is the proportion of transactions in a dataset that contain the itemset. It measures how Os frequently an itemset appears in the dataset. @ Confidence: Confidence is the probability that a rule's consequent (the outcome) occurs given that the antecedent (the premise) has occurred. It is a measure of the reliability of the rule. c) Explain any two Machine Learning (ML) Applications. —_— @ Email Spam Filtering: ML is used to classify emails as spam or not spam based on features like subject line, sender, and content. The model is trained on labeled data to identify spam patterns. @ Image Recognition: ML techniques, especially deep learning, are applied in recognizing objects, faces, or scenes in images. For example, ML is used in autonomous vehicles to detect pedestrians and road signs. d) Write a short note on stop words. _ Stop Words: Stop words are common words such as "and," "the," "is," or "in" that are filtered out during text processing in natural language processing (NLP). These words are usually removed because they do not carry significant meaning and can add noise to the analysis. e) Define supervise Learning and unsupervise Learning. —_ @ Supervised Learning: It is a type of machine learning where the model is trained on labeled data. The algorithm learns the mapping from input to output, and it is used for tasks like classification and regression. @ Unsupervised Learning: In unsupervised learning, the model is trained on unlabeled data, and the goal is to find hidden patterns or groupings in the data. Examples include clustering and dimensionality reduction. Q3) Attempt any Two of the following : [2x4=8] a) What is prediction? Explain any one regression model in detail. ed Prediction: Prediction refers to the process of making an educated guess or estimate about future events or unseen data based on historical patterns Os and existing data. In machine learning, it is the process where a model is trained on known data (training set) and used to predict the outcome for new or unseen data. Regression Model: One of the most common regression models is Linear Regression. It predicts a continuous target variable based on one or more predictor variables. It assumes a linear relationship between the dependent variable and the independent variable(s). Linear Regression: * Equation: The basic form of the linear regression equation is: Y = Bo + BiX1 + B2X2 +--+ + BnXn +e Where: + Y is the dependent variable (what you're predicting). + X1,Xo,..., Xn are independent variables (predictors). + £o is the intercept (the value of Y when all predictors are 0). * fi, B2,..-, Bn are coefficients representing the impact of each predictor on the dependent variable. * is the error term (the difference between actu. \ id predicted values). Training the Model: The model is trained by finding the best-fitting line (or hyperplane in multiple dimensions) that minimizes the sum of squared differences between the actual data points and the predicted values (using a method like Ordinary Least Squares). nterpretation: The coefficients (8) represent the strength and direction of the relationship between each predictor and the target variable. For example, if 81=3,it means that for every unit increase in X1, the target Y increases by 3 units, holding other variables constant. b) Differentiate between Stemming and Lemmatization. — Stemming: Definition: Stemming is a process in natural language processing (NLP) that reduces words to their root form by stripping off prefixes or suffixes. The root form may not necessarily be a valid word. Example: “running” — “run” “happiness” — “happi” Advantages: It is faster and simpler than lemmatization. Disadvantages: It can produce non-dictionary words that might not be meaningful in context. Lemmatization: Definition: Lemmatization is the process of reducing a word to its base or dictionary form (lemma), ensuring that the resulting word is a valid word in the language. Example: “running” = “run” “better” — “good” Advantages: It results in actual words that carry more meaning, thus preserving the semantic context. Disadvantages: It is computationally more expensive and requires more linguistic resources (like a dictionary). c) Describe types of Data Analytics. _> Descriptive Analytics: Purpose: Descriptive analytics aims to summarize and describe the main features of a dataset by organizing and presenting the data in a digestible form, such as reports, charts, and graphs. Example: Analyzing sales data to see how sales have fluctuated over time. Use Case: Business intelligence, reporting, dashboard visualizations. Diagnostic Analytics: Purpose: Diagnostic analytics helps identify the causes or reasons behind certain outcomes. It analyzes historical data to understand why Os something happened. Example: Investigating why sales dropped in a particular quarter (e.g., seasonality, marketing issues). Use Case: Root cause analysis, troubleshooting, and performance analysis. Predictive Analytics: Purpose: Predictive analytics uses historical data and machine learning algorithms to make predictions about future outcomes or trends. Example: Predicting future sales or customer churn. Use Case: Forecasting, risk assessment, and predictive maintenance. Prescriptive Analytics: Purpose: Prescriptive analytics provides recommendations for actions that can help optimize outcomes. It uses optimization techniques and simulations to suggest the best course of action. Example: Recommending the best marketing strategy to improve sales or customer satisfaction. Use Case: Decision-making, resource allocation, and strategy formulation. Q4) Attempt any TWO of the following : [2x4=8] a) Consider the following transactional database and find out Frequent Itemsets using Apriori algorithm with minimum support count=2 TID List _ of _ Item_IDs Ia LL, aly s Alalalslslslals To find the frequent itemsets using the Apriori algorithm with minimum support count = 2, we follow these steps: Step 1: List all transactions From the table: TID Items 11 1, 12,15 12 12,14 13 12,13 v4 11, 12,14 15 13 16 12,13 7 13 18 11, 12, 13,15 19 1, 12,13 Step 2: Count support for individual items (1-itemsets) Item Support Count "1 6 (T1, T4, TS, T7, T8, T9) 12 7 (T1, T2, T3, T4, 6, T8, T9) 13 6 (13, TS, T6, T7, T8, T9) 14 2 (T2, T4) 15 2 (71, T8) All have support 2 2, so all are frequent. Frequent 1-itemsets (L1): {11}, {12}, {13}, {14}, {15} Step 3: Generate candidate 2-itemsets and count support Itemset Support Count (11, 12) 4 (11,74, T8, T9) 411, 13) 4 (15, 77, T8, T9) 411, 14) 1(74) 411, 15} 2(T1, 78) {12, 13) 4 (73, T6, T8, T9) {12,14} 2 (12, T4) {12, 15) 2(T1, 78) 413, 14) 0 {13,15} 1 (78) (14, 15) 0 Frequent 2-itemsets (L2): {I1, 12}, {I1, 13}, {11, 15}, 12, 13}, {12, 14}, {12, 15} Step 4: Generate candidate 3-itemsets from L2 Only combine those 2-itemsets with enough overlap: Itemset Support Count {I1, 12, 13} 2 (78, T9) {I1, 12, 15} 2 (71, T8) {I2, 13, 15} 1 (T8) {I1, 13, 15} 1 (T8) Frequent 3-itemsets (L3): {I1, 12, 13}, {11, 12, 15} Final Answer: All Frequent Itemsets @ 1-itemsets: {I1}, {12}, {13}, {14}, {15} @ 2-itemsets: {I1, 12}, {I1, 13}, {I1, 15}, {12,13}, {12, 14}, {12, 15} @ 3-itemsets: {I1, 12, 13}, {I1, 12, 15} These are the frequent itemsets with support 2 2 using the Apriori algorithm. b) Which are the challenges in social media analytics? —_— Data Overload: Social media generates a massive amount of data daily, making it challenging to analyze and extract meaningful insights from the overwhelming volume of posts, comments, images, and videos. Sentiment Analysis: Accurately determining the sentiment behind social media content (positive, negative, or neutral) is difficult due to variations in language, sarcasm, and context. Privacy and Ethics: Analyzing personal data while respecting user privacy and adhering to regulations like GDPR presents ethical and legal challenges for organizations. Dynamic Nature of Content: Social media trends and user behavior change rapidly. Keeping up with evolving topics, slang, and trends requires continuous adjustment to analytical models. c) Explain Reinforcement learning. _ Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent performs actions and receives feedback in the form of rewards or penalties. The goal is to maximize the cumulative reward over time by exploring different actions and learning from past experiences. RL is often used in applications such as robotics, gaming, and autonomous systems, where the agent improves its decision-making strategy based on trial and error. Q5) Attempt any ONE of the following : [1x3=3] a) Write a short note on support vector machine. —_—_ Support Vector Machine (SVM) is a supervised machine learning algorithm primarily used for classification and regression tasks. The goal of SVM is to find a hyperplane that best divides a dataset into two classes, maximizing the margin between them. The algorithm works by selecting support vectors—data points that lie closest to the hyperplane— to define the decision boundary. SVM can be used in both linear and non- linear classifications by employing kernel functions, such as polynomial or radial basis function (RBF), to transform data into higher-dimensional space for better separation. b) Explain life cycle of Data Analytics. — The life cycle of data analytics typically involves the following stages: @ Data Collection: Gathering raw data from various sources like databases, sensors, or external APIs. @ Data Cleaning: Processing the collected data to handle missing values, remove duplicates, and correct errors. @ Data Exploration: Analyzing the data to understand patterns, relationships, and distributions using statistical and visualization tools. @ Data Analysis: Applying statistical methods or machine learning models to draw insights or make predictions from the data. @ Model Deployment: Implementing the analysis or model in real- world environments for decision-making or automation. @ Monitoring and Maintenance: Continuously tracking model performance, updating it, and making improvements as needed to ensure accuracy over time. tek TY. B.Sc. (Semester - VI) COMPUTER SCIENCE CS-364 : Data Analytics (2019 Pattern) (CBCS) Time : 2 Hours] [Max. Marks : 35 Instructions to the candidates: 7) All guestions are compulsory. 2) Figures to the right indicate full marks. Q7) Attempt any eight of the following (out of 10). [8x1=8] a) Define Data Analytics. _ Data Analytics is the process of examining raw data with the purpose of drawing conclusions about that information. It involves various techniques to clean, transform, and analyze data to uncover useful insights for decision-making. b) What is AVC & ROC curve? _ AVC (Average Variable Cost): It refers to the total variable cost divided by the quantity of output produced. It helps to understand how variable costs change with the level of output. ROC curve (Receiver Operating Characteristic curve): It is a graphical representation used to assess the performance of a classification model. It plots the true positive rate (sensitivity) against the false positive rate (1- specificity). c) Write any two applications of Supervised Machine Learning. _ Spam email detection Medical diagnosis (e.g., detecting diseases based on symptoms) d) Give the formula for support & confidence. — © Support: Support(.A) = Number of transactions containing item A Total number of transactions * Confidence: Confidence(A — B) = Number of transactions containing both A and B Number of transactions containing A e) What is an outlier? _—_> An outlier is an observation in a dataset that deviates significantly from the other observations, often indicating a variability in measurement or an error. f) State applications of NLP. — Sentiment analysis (e.g., determining the sentiment of text data such as reviews or social media posts) Chatbots and virtual assistants (e.g., Siri, Alexa) g) What is web scraping? — Web scraping is the process of extracting data from websites by parsing the HTML of web pages. It’s used for collecting large amounts of data from the internet for analysis. h) What is the purpose of n-gram? _ N-grams are sequences of 'n' items (words or characters) from a given text. They are used in NLP for tasks like text classification, machine translation, and speech recognition to capture patterns in the language. i) Define classification. — Classification is a supervised learning task where the objective is to assign a label or category to an input based on training data (e.g., classifying emails as spam or not spam). )) Define Recall. _> Recall, also known as Sensitivity or True Positive Rate, is a performance metric for classification models. It measures the proportion of actual Os positive instances that are correctly identified by the model. True Positives Formula: Recall = True Positives + False Negatives Q2) Attempt any four of the following (Out of five). [4x2=8] a) Explain the concept of underfitting & overfitting. — Underfitting occurs when a model is too simple to capture the underlying patterns in the data. It typically results in poor performance on both training and test data. Overfitting occurs when a model is too complex and fits the noise or random fluctuations in the training data, leading to poor generalization on new, unseen data. b) What is linear Regression? What type of Machine learning applications can be solved with linear Regression? _> Linear regression is a supervised machine learning algorithm used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between them. Applications: Predicting house prices based on features like size, location, and number of rooms. Forecasting sales based on advertising spend. c) What is Social Media Analytics? —_ Social Media Analytics is the process of collecting, analyzing, and interpreting data from social media platforms (like Facebook, Twitter, Instagram) to gain insights into user behavior, brand sentiment, and engagement. d) What are the advantages of FP-growth Algorithm? — Efficiency: FP-growth is faster than the Apriori algorithm as it eliminates the need to generate candidate itemsets. Memory efficiency: It uses a compact data structure (FP-tree) to store frequent itemsets, reducing memory usage. e) What are dependent & independent variables? _> Dependent variable: The variable that you are trying to predict or explain (also known as the outcome or target variable). Independent variable: The variable(s) that are used to predict or explain the dependent variable (also known as predictors or features). Q3) Attempt any two of the following (Out of three). [2x4=8] a) What are frequent itemsets & association rules? Describe with example. _> Frequent Itemsets: In data mining, a frequent itemset is a set of items that appear together in a transaction database with frequency greater than or equal to a predefined threshold (minimum support). These itemsets are important in market basket analysis and help to discover which products are frequently bought together. For example, if you have a retail store dataset where transactions contain items like ‘milk’, ‘bread’, and ‘butter’, a frequent itemset could be {'milk’, 'bread’} if they appear together in a significant number of transactions. Association Rules: Association rules are a key component of the Apriori algorithm and are used to discover relationships between items in a transaction database. These rules suggest that if a customer buys item A, they are likely to buy item B as well. The association rule is written as {A} - {B}, where A and B are items, and the rule implies that if A is bought, B will also likely be bought. For example, if 70% of customers who buy bread also buy butter, the association rule could be {bread} — {butter} with 70% confidence. b) What is stemming & lemmatization? > @ Stemming: Stemming is the process of reducing a word to its root form by removing suffixes or prefixes. It is often used in text mining and natural language processing (NLP). The goal is to get a common base form of a word, though this may not be a valid word in the dictionary. For example, the stem of "running" is "run," and "happily" becomes "happi.” @ Lemmatization: Lemmatization is similar to stemming but is more advanced as it reduces a word to its base or dictionary form, called alemma. It takes into account the word's meaning and applies correct grammar rules, often producing valid words. For example, "better" becomes "good" and "running" becomes "run" using lemmatization. It usually gives more meaningful and contextually appropriate words than stemming. c) Explain various types of Data Analytics. _ Descriptive Analytics: This type of analytics focuses on understanding and summarizing past data to identify trends and patterns. It answers the question, "What happened?" For example, a company might analyze past sales data to see how their performance was last quarter. Diagnostic Analytics: Diagnostic analytics is used to understand the cause of past outcomes or events. It answers the question, "Why did it happen?" For example, analyzing customer churn could help determine the reasons customers left a service. Predictive Analytics: Predictive analytics uses historical data, statistical algorithms, and machine learning techniques to predict future outcomes. It answers the question, "What could happen?" For example, it can predict future sales or customer behavior. Prescriptive Analytics: This type of analytics provides recommendations for actions to achieve desired outcomes. It answers the question, "What should we do?" For example, it may suggest actions to increase sales or improve customer satisfaction. These types of data analytics allow businesses and organizations to make better data-driven decisions and improve strategies based on insights gathered from data. Q4) Attempt any two of the following (Out of three). [2x4=8] a) What is Bag of words & DOS tagging in NLP? —_—> Bag of Words (BoW): Bag of Words is a simple and widely used technique in Natural Language Processing (NLP) for text representation. In this approach, a text (such as a sentence or document) is represented as a collection of words, disregarding grammar and word order but keeping Os track of the frequency of words. Each word in the document becomes a feature, and the count of each word is used as its feature value. The main advantage of BoW is its simplicity, but it has limitations as it ignores word order and context. Example: Consider two sentences: "|love NLP" "NLP is fun" The vocabulary would be: ['I", "love", "NLP", "is", "fun"] The corresponding BoW representation would be: "\love NLP" — [1, 1, 1, 0, 0] "NLP is fun" = [0, 0, 1,1, 1] POS Tagging (Part of Speech Tagging): POS tagging is the process of assigning a part of speech (such as noun, verb, adjective, etc.) to each word in a sentence. It helps understand the structure and meaning of sentences by identifying the role of each word. Example: In the sentence "The cat sleeps on the mat," POS tagging would assign: "The" ~ Determiner (DT) "cat" = Noun (NN) "sleeps" — Verb (VBZ) "on" — Preposition (IN) "the" — Determiner (DT) "mat" = Noun (NN) b) What is Logistic Regression? Explain it with example. —_—> Logistic Regression is a statistical model used for binary classification problems (i.e., problems with two possible outcomes). It predicts the probability that a given input belongs to a certain class. The output is a value between 0 and 1, representing the probability of the positive class. The logistic function (sigmoid function) is used to map the linear output of the model to a probability. Mathematical Representation: Logistic regression models the relationship between the input features (X) and the probability (P) of the positive class (Y = 1) using the logistic function: 1 PY = 1X) = L$ e- Bot Xi + Pa Xat.. + AnXn) Where: + P(Y¥ =1|X) is the probability of class 1 given input X. * Boy B1, +++; Bn are the coefficients of the model. « X1,Xo,..., Xp are the input features. Example: Suppose you're building a model to predict whether an email is spam (1) or not spam (0) based on the number of specific words appearing in the email. Feature: Number of times the word "buy" appears (X). The logistic regression model would learn a function like: 1 P(spam = 1|X) = 1+ e7 (60+ AX) After training, if X (the number of times "buy" appears) is 3, and the model gives a probability of 0.8, it means there is an 80% chance that the email is spam. In summary, logistic regression outputs a probability, and depending ona threshold (commonly 0.5), you classify the outcome as either 0 or 1 (spam or not spam). c) Consider the following database & find out the frequent itemset using Apriori Algorithm with minimum support threshold = 3. T. id. Ttem purchased 1 M,T,B _ To find the frequent itemsets using the Apriori Algorithm with a minimum support threshold of 3, let's follow these steps: Step 1: List all transactions From the table: T. id Items Purchased 1 M,T, B 2 ETC 3 M,E, TC 4 E,c 5 J Step 2: Count support for individual items (1-itemsets) We count how many transactions each item appears in: M: 2 times (T1, T3) T: 3 times (T1, T2, T3) B: 1 time (T1) E: 3 times (T2, T3, T4) C: 3 times (T2, T3, T4) J: 1 time (T5) Frequent 1-itemsets (support = 3): {T}, {E}, {C} Step 3: Generate candidate 2-itemsets from frequent 1-itemsets Candidates from {T, E, C}: {T, E} {T, C} {E, C} Count support: {T, E}: 2 times (T2, T3) {T, C}: 2 times (T2, T3) {E, C}: 3 times (T2, T3, T4) Frequent 2-itemsets (support 2 3): {E, C} Step 4: Generate candidate 3-itemsets from {E, C} Only one possible 3-itemset: {T, E, C}, but it needs all three items to be frequent. Check support for {T, E, C}: Appears in T3 only = support = 1 No frequent 3-itemsets. Final Result: Frequent Itemsets Frequent 1-itemsets: {T}, {E}, {C} Frequent 2-itemsets: {E, C} Q5) Attempt any one of the following (Out of 2). [1x3=3] a) Define the terms i) Confusion Matrix ii) Accuracy iil) Precision —_ Confusion Matrix: A confusion matrix is a performance measurement tool for classification models. It is a table that shows the actual versus predicted classifications, which allows you to see how well the model is performing. The matrix includes values such as true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Accuracy: Accuracy is a metric used to evaluate the performance of a classification model. It is the ratio of the number of correct predictions (both true positives and true negatives) to the total number of predictions made. The formula is: TP+TN TP+TN + FP+ FN Accuracy = Precision: Precision is a metric that measures how many of the predicted positive instances are actually positive. It is the ratio of true positives to the sum of true positives and false positives. The formula is: TP TP + FP b) What is Machine Learning? Explain its type. > Machine Learning (ML) is a subset of artificial intelligence (Al) that involves the use of algorithms and statistical models to allow a system to improve its performance on a task through experience, without explicit programming. In other words, ML allows machines to learn from data, identify patterns, and make decisions with minimal human intervention. Types of Machine Learning: Supervised Learning: In supervised learning, the model is trained using labeled data, where the input comes with corresponding output. The goal is for the model to learn the mapping from inputs to outputs so that it can make predictions on unseen data. Examples: Linear regression, Decision trees. Unsupervised Learning: In unsupervised learning, the model is trained using data that has no labels. The goal is to find hidden patterns or groupings in the data. It is used for tasks like clustering and association. Examples: K-means clustering, Hierarchical clustering. Reinforcement Learning: Reinforcement learning involves training an agent to make a sequence of decisions by rewarding or punishing it based on the outcomes of its actions. The agent learns to maximize the cumulative reward over time. Example: Q-learning, Deep Q Networks (DQNs). tk Precision = T.Y.BSc. COMPUTER SCIENCE CS-364 : Data Analytics (CBCS Rev 2019 Pattern) (Semester - VI) Time : 2 Hours] [Max. Marks : 35 Instructions to the candidates: 7) All questions are compulsory. 2) Neat diagrams must be drawn wherever necessary. 3) Figures to the right indicate full marks. Q1) Attempt any Eight of the following. [8x1=8] a) State occam’s razor principle. _ Occam's Razor Principle: Occam's Razor is a problem-solving principle that suggests that among competing hypotheses, the one with the fewest assumptions should be selected. b) Define Data Analytics _ Data Analytics: Data Analytics is the process of examining datasets to draw conclusions about the information they contain, often with the help of specialized systems and software. c) What is supervise learning? => Supervised Learning: Supervised learning is a type of machine learning where a model is trained on labeled data, meaning that the input data is paired with the correct output, and the model learns to map the inputs to the outputs. d) What is TF-IDF? _—_> TF-IDF: TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic used to reflect the importance of a word ina document relative to a collection of documents (corpus). It helps in identifying the relevance of words in a document. e) What is frequent itemset? i Frequent Itemset: A frequent itemset is a set of items that appear together in a dataset with a frequency that exceeds a specified threshold, commonly used in association rule mining. f) Define stemming. _ Stemming: Stemming is the process of reducing a word to its root form, often by stripping suffixes, to help in text processing and information retrieval. g) What is Link prediction? _ Link Prediction: Link prediction is a type of machine learning task that involves predicting the likelihood of a connection or relationship between two entities in a network, based on the existing patterns. h) State Applications of AL _> Applications of Al: Some applications of Al include natural language processing, image recognition, autonomous vehicles, and healthcare diagnostics. i) State types of logistic regression. _> Types of Logistic Regression: The two main types of logistic regression are: Binary Logistic Regression: Used when the dependent variable has two possible outcomes. Multinomial Logistic Regression: Used when the dependent variable has more than two categories. }) Define precision _ Precision: Precision is a metric used in classification tasks to measure the accuracy of positive predictions. It is defined as the ratio of true positive predictions to the total predicted positives (True Positives / (True Positives + False Positives)). Q2) Attempt any four of the following: [4x2=8] a) State types of Machine learning. Explain any one in detail. _—_> Types of Machine Learning: Supervised Learning Unsupervised Learning Semi-supervised Learning Reinforcement Learning Explanation of Supervised Learning: Supervised learning is a type of machine learning where the model is trained on a labeled dataset. This means that each input data point is paired with the correct output label. The algorithm learns by adjusting its weights and parameters to minimize the error between the predicted output and the actual output during training. The main goal of supervised learning is to make predictions based on input-output pairs. Example: A simple example is email spam detection. The system is trained using a dataset of emails labeled as "spam' or "not spam." The model learns patterns from the features (like words used in the email) to classify new emails correctly. b) How Receiver operating characteristic (ROC) curve is created? _> An ROC curve is created by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. Here's the process: True Positive Rate (TPR) is plotted on the y-axis, which represents the ratio of correctly identified positive samples to the total actual positive samples. TP TPR = ————_ TP + FN False Positive Rate (FPR) is plotted on the x-axis, representing the ratio of incorrectly identified negative samples to the total actual negative samples. FP FPR = ———__ FP+TN Different threshold values are chosen to classify the outputs (e.g., adjusting the threshold for classifying a sample as positive). For each threshold, the TPR and FPR are calculated, and these values are plotted as a point on the graph. The final ROC curve is created by connecting these points. A perfect classifier would have a curve that goes to the top left corner (high TPR and low FPR). c) What is association rule? Give one example. _ An Association Rule is a rule-based machine learning method used to identify relationships or patterns among a set of items in a dataset. It is commonly used in market basket analysis to find associations between products purchased together. An Association Rule has two parts: Antecedent (LHS): The condition or set of items. Consequent (RHS): The result or outcome based on the antecedent. Example: Rule: If a customer buys bread, they are likely to buy butter. Antecedent: {bread} Consequent: {butter} This can be written as: bread + butter d) What is Influence Maximization? —> Influence Maximization refers to the process of identifying a small set of nodes (individuals, users, or entities) in a network that can maximally spread information, influence, or behavior across the entire network. This concept is often used in social networks, viral marketing, and the spread of opinions or trends. The goal is to select the most influential nodes (seed nodes) in the network so that when they are activated (i.e., exposed to information), they will trigger the spread of influence to the rest of the network in an optimal way. e) Explain Knowledge discovery in database (KDD) process. ——> Knowledge Discovery in Databases (KDD) is the process of discovering useful patterns and knowledge from large datasets. It is a multi-step process that involves the following stages: Data Selection: The process of selecting the relevant data from the available dataset to be used for further analysis. Data Preprocessing: Cleaning the data by handling missing values, removing noise, and correcting inconsistencies. Data Transformation: Transforming data into an appropriate format or structure for the analysis (e.g., normalization, aggregation). Data Mining: The core step where algorithms are applied to identify patterns, relationships, or trends in the data. Pattern Evaluation: Identifying the most interesting and valuable patterns based on predefined criteria or metrics. Knowledge Representation: Presenting the discovered knowledge in a human-readable form, often through visualization or reports. The ultimate goal of KDD is to extract actionable insights and knowledge from the raw data to support decision-making and problem-solving. Q3) Attempt any two of the following: [2x4=8] a) Write a short note on community detection. > Community Detection: Community detection refers to the process of identifying groups or clusters within a network where nodes (individuals or entities) are more densely connected to each other than to the rest of the network. These communities often represent different subsets or groups with similar characteristics, such as social groups or organizational structures. Various algorithms, such as modularity optimization and spectral clustering, are used to detect these communities. Community detection helps in understanding the underlying structure of networks, which is useful in fields like social network analysis, biology, and information systems. b) Explain Apriori algorithm. > Apriori Algorithm: The Apriori algorithm is a classical data mining algorithm used to find frequent itemsets in transactional databases and derive association rules. It works by identifying frequent individual items and then extending them to larger itemsets as long as those itemsets Os appear sufficiently frequently in the database. The algorithm uses a "bottom-up" approach, generating candidate itemsets and pruning those that are infrequent. It's widely used for market basket analysis, where the goal is to find patterns such as items that are often bought together. c) Short note on challenges in social Media Analytics (SMA). — Challenges in Social Media Analytics (SMA): Social Media Analytics (SMA) involves analyzing user data from platforms like Twitter, Facebook, and Instagram to extract valuable insights. Some of the key challenges include: Data Volume and Variety: Social media generates vast amounts of unstructured data, making it difficult to process and analyze efficiently. Data Privacy: Protecting users’ privacy and ensuring ethical data collection is a major concern in social media analytics. Sentiment Analysis: Understanding the nuances of human language, including sarcasm, slang, and regional differences, is challenging in sentiment analysis. Data Noise and Spam: Social media data often contains irrelevant or misleading information, which can impact the accuracy of analyses. Dynamic Nature: Social media trends and conversations evolve rapidly, requiring continuous updates and real-time processing. Q4) Attempt any two of the following: [2x4=8] a) Explain phases in Natural language processing (NLP). —> Natural Language Processing (NLP) is a subfield of Al that focuses on enabling computers to understand, interpret, and produce human language. The main phases of NLP include: Text Preprocessing: Tokenization: Breaking down text into smaller chunks like words or sentences. Normalization: Converting text to a uniform format, such as lowercasing or removing punctuation. Stop Word Removal: Eliminating common words like "and", "the, etc., that don't contribute much to meaning. Stemming/Lemmatization: Reducing words to their base forms (e.g., "running" becomes "run"). Syntactic Analysis: This phase involves understanding the grammatical structure of sentences through techniques like: Part-of-Speech (POS) Tagging: Identifying the grammatical role of each word. Parsing: Constructing a syntactic tree to represent sentence structure. Semantic Analysis: This phase focuses on extracting meaning from the text, which involves: Named Entity Recognition (NER): Identifying and categorizing entities like names, places, dates. Word Sense Disambiguation (WSD): Determining the meaning of a word based on context. Pragmatics and Discourse: Understanding the context in which language is used, including: Coreference Resolution: Determining which words refer to the same entity. Sentiment Analysis: Identifying the sentiment (positive, negative, neutral) in a sentence or document. Text Generation and Machine Translation: In this phase, NLP is used to generate human-like text or translate between languages, often involving deep learning techniques. b) Explain exploratory data analytics. —> Exploratory Data Analysis (EDA) is the initial step in analyzing data sets to summarize their main characteristics, often with visual methods. It helps understand the structure, patterns, and potential anomalies in the data before applying more complex statistical modeling. Key steps in EDA include: Data Collection and Cleaning: Gathering relevant data and handling issues such as missing values, duplicates, and inconsistencies. Descriptive Statistics: Calculating summary statistics such as mean, median, mode, variance, and standard deviation to understand the central tendency and spread of Os the data. Data Visualization: Visualizing data using histograms, scatter plots, box plots, and pair plots to identify relationships, distributions, and outliers. Correlation Analysis: Assessing the relationships between variables, often using correlation matrices, to detect potential dependencies between variables. Handling Outliers: Identifying and deciding how to treat outliers, either by removing them or adjusting them. Feature Engineering: Creating new features or transforming existing ones to improve model performance in later stages of data analysis. EDA is a crucial step to inform decisions about model choice, data transformations, and further hypothesis testing. c) Explain life cycle of social media Analytics. — The life cycle of social media analytics refers to the steps involved in collecting, analyzing, and interpreting social media data to gain insights into user behavior, brand performance, or audience engagement. The key phases are: Data Collection: Gathering data from various social media platforms (e.g., Twitter, Facebook, Instagram) using APIs, scraping tools, or social media monitoring platforms. Data Cleaning and Preprocessing: Handling missing data, duplicates, and irrelevant information. Standardizing the data formats and ensuring consistency in timestamps, usernames, and post content. Sentiment and Content Analysis: Sentiment Analysis: Determining the mood (positive, negative, neutral) of posts or comments. Content Analysis: Categorizing posts, identifying trending topics, and analyzing hashtags or keywords. Engagement Analysis: Measuring user engagement (likes, shares, comments) to assess how Os well content resonates with the audience. Trend and Pattern Detection: Identifying emerging trends, patterns, or themes in the data. This can be done by analyzing hashtags, mentions, or keywords over time. Reporting and Visualization: Presenting findings in an easily digestible format using dashboards, charts, and graphs that highlight key metrics (e.g., engagement rate, reach, sentiment scores). Actionable Insights and Strategy: Drawing conclusions from the analysis and using those insights to adjust marketing strategies, improve customer interactions, or manage brand reputation. Continuous Monitoring: Social media analytics is an ongoing process. Regular monitoring is essential to track changes in user behavior, the effectiveness of campaigns, and shifting trends. Q5) Attempt any one of the following: [1x3=3] a) Consider the following transactional database and find out Frequent Itemsets using Apriori algorithem with minimum - support = 50% TID T, I, 1, ly T, 1, 1,1, a " T, 1,41], T, 1,4, L,I, — To find the frequent itemsets using the Apriori algorithm with minimum support = 50%, follow these steps: Step 1: Transaction List From the table: TID Items Purchased T1 11, 12, 13 T2 12, 13, 14 13 14,15 T4 11, 12, 14 TS 11, 12, 13, 15 T6 11, 12, 13, 14 Total Transactions = 6 Minimum Support Count = 50% of 6 = 3 transactions Step 2: Find Frequent 1-Itemsets Item Count n 4 12 5 13 4 14 4 15 2 Frequent 1-Itemsets (support = 3): {11}, {12}, {13}, (14} Step 3: Generate Candidate 2-Itemsets from Frequent 1-Itemsets Candidate 2-itemsets: {I1, 12}, {I1, 13}, {11, 14}, {12, 13}, {12, 14}, {13, 14} Count them: Itemset Count {I1, 12} 4 {11, 13} 3 {I1, 14} 2 {I2, 13} 4 {I2, 14} 3 {13, 14} 2 Frequent 2-Itemsets: {I1, 12}, {I1, 13}, {12, 13}, (12, 14} Step 4: Generate Candidate 3-Itemsets from Frequent 2-Itemsets Valid combinations: {I1, 12, 13} {I1, 12, 14} {I2, 13, 14} Count them: Itemset Count {11, 12, 13} 3 {I1, 12, 14} 2 {12, 13, 14} 2 Frequent 3-Itemsets: {I1, 12, 13} Final Frequent Itemsets (with support > 50%) 1-itemsets: {I1}, {12}, {13}, {l4} 2itemsets: {I1, 12}, {I1, 13}, {12, 13}, {I2, 14} 3-itemsets: {I1, 12, 13} b)Write a short note on Text analytics. _ Text Analytics Text analytics, also known as text mining, is the process of extracting useful information and insights from unstructured text data. It involves techniques from natural language processing (NLP), machine learning, and statistics to analyze text, identify patterns, detect sentiment, and uncover trends. Common applications of text analytics include sentiment analysis, topic modeling, keyword extraction, and spam detection. It is widely used in industries such as marketing, customer service, healthcare, and finance to gain actionable insights from sources like social media, emails, reviews, and reports. Pott

You might also like