0% found this document useful (0 votes)

51 views29 pages

Data Modeling March 16

The document discusses key concepts in data modeling including defining modeling, exploring foundational modeling knowledge, Drew Conway's data science Venn diagram, statistical thinking in the age of big data, framing problems, understanding data through exploratory data analysis, extracting features, discussing different types of data and modeling approaches, population and sampling, and probability distributions. The overall topics covered are introducing data modeling, foundational modeling concepts, and statistical inference approaches for modeling data.

Uploaded by

Harsh Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views29 pages

Data Modeling March 16

Uploaded by

Harsh Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Data Modeling

MY NOTES

CSE4/587 B. Ramamurthy 10/27/2021

Learning Objectives
2

What is modeling?
Let’s explore some foundational knowledge for
modeling.

CSE4/587 B. Ramamurthy 10/27/2021

Drew Conway’s Venn Diagram on Data Science

Math&
Statistics
Knowledge
Traditional
ML research
DS
Hacking DangerSubstantive
Skills zone Expertise

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

CSE4/587 B. Ramamurthy 3 10/27/2021

Chapter 1 and 2 Data Science
4

Statistical thinking in the age of big data

You build models to understand the data and extract
meaning and information from the data: statistical
inference

CSE4/587 B. Ramamurthy 10/27/2021

Lets discuss the road map (Project 1)
5

1. Frame the problem: understand the use case

2. Extract features: what are the dependent and independent
variables, cols and rows in a table data for example.
3. Understand the data: Exploratory data analysis
4. Design, code and experiment: use tools to clean, extract, plot,
view
5. Model the data and analyze: big data, small data, historical,
steaming, realtime etc.
6. Evaluate the goodness of fit! Error coefficients?
7. Prediction (of course) is another outcome of modeling phase
8. Present and test results: two types of clients: humans and systems
9. Go back to any of the steps based on the insights!

CSE4/587 B. Ramamurthy 10/27/2021

Frame The problem
6

Have a standard use case format (What, why, how,

stakeholders, data in, info out, challenges,
limitations, scope etc.)
Refer to your software engineering course
Statement of work (SOW): clearly state what you will
accomplish; in real world context

CSE4/587 B. Ramamurthy 10/27/2021

Understand Data
7

Data represents the traces of the real-world

processes.
 What traces we collect depends on the sampling methods
 You build models to understand the data and extract meaning
and information from the data: statistical inference
Two sources of randomness and uncertainty:
 The process that generates data is random
 The sampling process itself is random
Your mind-set should be “statistical thinking in the
age of big-data”
 Combine statistical approach with big-data

CSE4/587 B. Ramamurthy 10/27/2021

Here are some questions to ask?
8

How big is the data?

Any outliers?
Missing data? (cleaning, preprocessing)
Sparse or dense?
Collision of identifiers in different sets of data
Dimensionality reduction
Normalization
Validating and choosing parameters for modeling (for
example K in K-means clustering)
Guidance of model selection

CSE4/587 B. Ramamurthy 10/27/2021

Exploratory Data Analysis (EDA)
9

 You achieve two things to get you started:

 Get an intuitive feel for the data
 You can get a list of hypotheses
 Traditionally: histograms
 EDA is the prototype phase of ML and other sophisticated approaches;
 Basic tools of EDA are plots, graphs, and summary stats.
 It is a method for “systematically” going through data, plotting
distributions, plotting time series, looking at pairwise relationships
using scatter plots, generating summary stats.eg. mean, min, max,
upper, lower quartiles, identifying outliers.
 Gain intuition and understand data.
 EDA is done to understand data before using expensive big data
methodology or data modeling.

CSE4/587 B. Ramamurthy 10/27/2021

New Kinds of Data (towards big data)
10

Traditional: numerical, categorical, or binary

Text: emails, tweets, NY times articles
Records: user-level data, time-stamped event data,
json formatted log files
Geo-based location data
 Network data (How do you sample and preserve
network structure?)
Sensor data
Images

CSE4/587 B. Ramamurthy 10/27/2021

Uncertainty and Randomness
11

A mathematical model for uncertainty and randomness is offered

by probability theory.
A world/process is defined by one or more variables. The model
of the world is defined by a function:
Model == f(w) or f(x,y,z) (A multivariate function)
The function is unknown model is unclear, at least initially.
Typically our task is to come up with the model, given the data.
Uncertainty: is due to lack of knowledge: this week’s weather
prediction (e.g. 90% confident)
Randomness: is due lack of predictability: 1-6 face of when
rolling a die
Both can be expressed by probability theory

CSE4/587 B. Ramamurthy 10/27/2021

Statistical Inference
12

World  Collect Data Capture the

understanding/meaning of data through models or
functions  statistical estimators for predicting
things about world
Development of procedures, methods, and theorems
that allow us to extract meaning and information
from data that has been generated by stochastic
(random) processes

CSE4/587 B. Ramamurthy 10/27/2021

Population and Sample
13

Population is complete set of traces/data points

 US population 314 Million, world population is 7 billion for example
 All voters, all things
Sample is a subset of the complete set (or population): how we
select the sample introduces biases into the data
See an example in http://www.sca.isr.umich.edu/
Here out of the 314 Million US population, 250000 households are
form the sample (monthly)
Population mathematical model  sample
(My) big-data approach for the world population: k-nary tree (MR)
of 1 billion (of the order of 7 billion) : I basically forced the big-data
solution/did not sample: This is possible in the age of big-data
infrastructures

CSE4/587 B. Ramamurthy 10/27/2021

Population and Sample (contd.)
14

Example: Emails sent by people in the CSE dept. in a

year.
Method 1: 1/10 of all emails over the year randomly
chosen
Method 2: 1/10 of people randomly chosen; all their
email over the year
Both are reasonable sample selection method for
analysis.
However estimations pdfs (probability distribution
functions) of the emails sent by a person for the two
samples will be different.
CSE4/587 B. Ramamurthy 10/27/2021
Big Data vs statistical inference
15

Sample size N
For statistical inference N < All
For big data N == All
For some atypical big data analysis N == 1
 World model through the eyes of a prolific twitter user
 Followers of Ashton Kuchar: If you analyze the twitter data you
may get a world view from his point of view

 I heard he moved away from twitter in 2020 to text messaging!

 Tom Brady has also started this trend!
 Anyway our goal in big data to consider N== All.

CSE4/587 B. Ramamurthy 10/27/2021

Big-data context
16

Analysis for inference purposes you don’t need all the data.
At Google (at the originator big data algs.) people sample all
the time.
However if you want to render, you cannot sample.
Some DNA-based search you cannot sample.
Say we make some conclusions with samples from Twitter data
we cannot extend it beyond the population that uses twitter.
And this is what is happening now…be aware of biases.
Another example is of the tweets pre- and post- hurricane
Sandy..
Yelp example..

CSE4/587 B. Ramamurthy 10/27/2021

Extract Features (Project 2 kind)
17

Data is cleaned up : Data wrangling

Ex: remove tags from html data
Filter out only the important fields or features, say
from a json file
Often defined by the problem analysis and use case
defined.
Example: location and temperature are the only
important data in a tweet for a particular analysis

CSE4/587 B. Ramamurthy 10/27/2021

Modeling
18

 Abstraction of a real world process

Lets say we have a data set with two columns x and y
and y is dependent on x, we could write is as:
y=
(linear relationship)
How to build a model?
Probability distribution functions (pdf) are building
blocks of statistical models.
There are many distributions possible

CSE4/587 B. Ramamurthy 10/27/2021

Probability Distributions
19

Normal, uniform, Cauchy, t-, F-, Chi-square,

exponential, Weibull, lognormal,..
They are know as continuous density functions
Any random variable x or y can be assumed to have
probability distribution p(x), if it maps it to a positive
real number.
For a probability density function, if we integrate the
function to find the area under the curve it is 1,
allowing it to be interpreted as probability.
Further, joint distributions, conditional distribution..

CSE4/587 B. Ramamurthy 10/27/2021

Fitting a Model
20

Fitting a model means estimating the parameters of

the model: what distribution, what are the values of
min, max, mean, stddev, etc.
Don’t worry python libraries have built-in
optimization algorithms that readily offer all these
functionalities
It involves algorithms such as maximum likelihood
estimation (MLE) and optimization methods…
Example: y = β1+β2∗𝑥  y = 7.2 + 4.5*x

CSE4/587 B. Ramamurthy 10/27/2021

Look at Lin and Dyer’s exploration
21

Next graph: performance of word co-occurrence

different approaches.
Example for linear relationship (alternate view: an
example for linear relationship.)
Goodness of fit: R2
R2 is simply the square of the sample
correlation coefficient (i.e., r) between the observed
outcomes and the observed predictor values.
We want this as close to 1. (0-0.99 possible)

CSE4/587 B. Ramamurthy 10/27/2021

Run it on AWS and evaluate the two approachesRun times for pairs
and stripes!

R2 is the goodness of fit.

APW corpus:
Associated Press Worldstream
5.7 GB data

https://data.world/associatedpress
Design, code, deploy
23

Design first before you code: an important principle

Code using best practices and “Software
engineering” principles
Choose the right language (Java or Python) and
development environment
Document within the code and outside
Clear state the steps in deploying the code
Provide trouble shooting tips

CSE4/587 B. Ramamurthy 10/27/2021

Present the Results (project 2 : 20 points)
24

Good annotated graphs and visuals are important explaining

the results
 Annotate using text, markup and markdown
Extras: provide ability to interact with plots and assess
what-if conditions
Explore
d3.js : https://d3js.org/
Tableau: https://www.tableau.com/academic
And a lot of creativity. Do not underestimate this: how to
present your results effectively?
Should need no explanation!

CSE4/587 B. Ramamurthy 10/27/2021

Iterate
25

Iterate thru’ any of steps as warranted by the

feedback and the results
Data science process is an iterative process
Before you develop a tool or automation based on
the results test the code thoroughly.
Read Chapter 2

CSE4/587 B. Ramamurthy 10/27/2021

Example1: Data Collection in Automobiles
26

Large volumes of data is being collected the increasing

number of sensors that are being added to modern
automobiles.
Traditionally this data is used for diagnostics purposes.
How else can you use this data?
How about predictive analytics? For example, predict
the failure of a part based on the historical data and on-
board data collected?
On-board-diagnostics (OBDI) is a big thing in auto
domain.

CSE4/587 B. Ramamurthy 10/27/2021

Example 2: Oil Price Prediction
27

CSE4/587 B. Ramamurthy 10/27/2021

Modeling methods
28

Linear regression : pandas, numpy, matplotlib, scikit

https://scikit-learn.org/stable/
Other models: regression (linear and logistic),
clustering, classification
More next class.

CSE4/587 B. Ramamurthy 10/27/2021

Next lecture: To do
29

Look at the project 2 description and the all

requirements.
There is no project 3. So do well in project 2.
Project 2: focus on processing NoSQL data using big
data methods.
Prepare for the quiz: Chapter 2-4 Lin And Dyer and
class notes.

CSE4/587 B. Ramamurthy 10/27/2021

Amtech ProDesign
67% (9)
Amtech ProDesign
305 pages
Enrique López Mañas - Living by The Code-Razeware LLC (2019) PDF
100% (1)
Enrique López Mañas - Living by The Code-Razeware LLC (2019) PDF
520 pages
Data Strategy Feb 9 Part 2
No ratings yet
Data Strategy Feb 9 Part 2
36 pages
Data Strategy
No ratings yet
Data Strategy
41 pages
Data Science and Visualization
No ratings yet
Data Science and Visualization
37 pages
Most Compact and Complete Data Science Cheat Sheet 1672981093
No ratings yet
Most Compact and Complete Data Science Cheat Sheet 1672981093
10 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
10 pages
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
No ratings yet
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
44 pages
Lecture 1
No ratings yet
Lecture 1
62 pages
Wa0004.
No ratings yet
Wa0004.
44 pages
Model_Qp_Scheme-2
No ratings yet
Model_Qp_Scheme-2
19 pages
BIG DATA PART-I
No ratings yet
BIG DATA PART-I
15 pages
6220010
No ratings yet
6220010
37 pages
Data Science By Internshala Trainings
No ratings yet
Data Science By Internshala Trainings
46 pages
statistics for applied science 200l
No ratings yet
statistics for applied science 200l
122 pages
Module 1
No ratings yet
Module 1
19 pages
Data Science & Aiml (Mile Stone Solution)
No ratings yet
Data Science & Aiml (Mile Stone Solution)
37 pages
Summary DS231
No ratings yet
Summary DS231
11 pages
Data Science & Analytics: Course Code: CSE3105 Credits: 02 Credit Hours: 02/week Exam Hours: 03
No ratings yet
Data Science & Analytics: Course Code: CSE3105 Credits: 02 Credit Hours: 02/week Exam Hours: 03
2 pages
Data Mining Notes: 7 Semester. CS 1435: Syllabus
No ratings yet
Data Mining Notes: 7 Semester. CS 1435: Syllabus
4 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
data science dse
No ratings yet
data science dse
24 pages
ds sem
No ratings yet
ds sem
71 pages
Introduction To Data Science
75% (4)
Introduction To Data Science
74 pages
STAT121 / AC209 / E-109: CS109 Data Science
No ratings yet
STAT121 / AC209 / E-109: CS109 Data Science
74 pages
Previous Lecture
No ratings yet
Previous Lecture
43 pages
1
No ratings yet
1
32 pages
Data Science Regular Handout
No ratings yet
Data Science Regular Handout
25 pages
Chap2-Some Unique Features of Data Science Projects
No ratings yet
Chap2-Some Unique Features of Data Science Projects
44 pages
Unit - I & II
No ratings yet
Unit - I & II
59 pages
Data Science Course Syllabus
No ratings yet
Data Science Course Syllabus
13 pages
Fundamentals of Data Source and Preparation For ML v31
No ratings yet
Fundamentals of Data Source and Preparation For ML v31
45 pages
Ann
No ratings yet
Ann
88 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
19 pages
NAC.pdf (1)
No ratings yet
NAC.pdf (1)
23 pages
ML-Lecture-6-7-preprocess
No ratings yet
ML-Lecture-6-7-preprocess
43 pages
Week 12 Intro to DS and ML
No ratings yet
Week 12 Intro to DS and ML
67 pages
1.4
No ratings yet
1.4
98 pages
Doubt Clearance Session(AI) on 29.12.2024
No ratings yet
Doubt Clearance Session(AI) on 29.12.2024
41 pages
Syllabus_Principle of Data Science
No ratings yet
Syllabus_Principle of Data Science
4 pages
AI-ML Syllabus
100% (1)
AI-ML Syllabus
8 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Unit-1
No ratings yet
Unit-1
84 pages
Prob and Stats in AI Unit-4
No ratings yet
Prob and Stats in AI Unit-4
24 pages
Data Science Master
No ratings yet
Data Science Master
11 pages
Mid 1 Answers IDS
No ratings yet
Mid 1 Answers IDS
22 pages
Roadmap AI
No ratings yet
Roadmap AI
19 pages
DS Unit 1
No ratings yet
DS Unit 1
26 pages
Stan Users Guide 2 32
No ratings yet
Stan Users Guide 2 32
456 pages
AI_SYLLABUS
No ratings yet
AI_SYLLABUS
7 pages
DJ 14 Ai&ds 3
No ratings yet
DJ 14 Ai&ds 3
20 pages
Unit 4 Big Data Complete Notes
No ratings yet
Unit 4 Big Data Complete Notes
32 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
30 pages
BigData QB (c.format)
No ratings yet
BigData QB (c.format)
6 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
25 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Data Science
From Everand
Data Science
John D. Kelleher
3/5 (8)
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
Microprediction: Building an Open AI Network
From Everand
Microprediction: Building an Open AI Network
Peter Cotton
No ratings yet
AccountStatement_Report_6028036052_30042025_9_37
No ratings yet
AccountStatement_Report_6028036052_30042025_9_37
6 pages
Flexent /autoplex Wireless Networks Executive Cellular Processor (ECP) Release 22.0
No ratings yet
Flexent /autoplex Wireless Networks Executive Cellular Processor (ECP) Release 22.0
826 pages
Instrument CV
No ratings yet
Instrument CV
25 pages
FAR3 Group Assignment
No ratings yet
FAR3 Group Assignment
11 pages
Admin
No ratings yet
Admin
278 pages
Rajivranjan Resume
No ratings yet
Rajivranjan Resume
4 pages
Advanced Microprocessor
No ratings yet
Advanced Microprocessor
8 pages
BioStar SDK Manual V1.62 PDF
No ratings yet
BioStar SDK Manual V1.62 PDF
528 pages
Conversion 35 41 AEA
No ratings yet
Conversion 35 41 AEA
7 pages
Final Practical
No ratings yet
Final Practical
53 pages
Dbms Important Questions For JNTU Students
50% (2)
Dbms Important Questions For JNTU Students
3 pages
3D - Quick Start - OptitexHelpEn
No ratings yet
3D - Quick Start - OptitexHelpEn
26 pages
Premium 200+ Courses
33% (3)
Premium 200+ Courses
15 pages
Clave de licencia de Kaspersky Internet Security
No ratings yet
Clave de licencia de Kaspersky Internet Security
3 pages
RR, HTML
No ratings yet
RR, HTML
4 pages
Guideline For Task 2 Detail and Topographic Surveying
No ratings yet
Guideline For Task 2 Detail and Topographic Surveying
64 pages
A Level Production Schedule: Process Medium Dates Individual
No ratings yet
A Level Production Schedule: Process Medium Dates Individual
4 pages
Nuendo 3.2.0 New Features Manual Addendum
No ratings yet
Nuendo 3.2.0 New Features Manual Addendum
124 pages
Cars 2 (Video Game) - Wikipedia
No ratings yet
Cars 2 (Video Game) - Wikipedia
1 page
OOP1 Lab - 07 - Fall
No ratings yet
OOP1 Lab - 07 - Fall
2 pages
Webpshere Upgration Guide
No ratings yet
Webpshere Upgration Guide
98 pages
David latest cv
No ratings yet
David latest cv
1 page
House Dzone Refcard 388 Threatmodelingcorepractice
No ratings yet
House Dzone Refcard 388 Threatmodelingcorepractice
7 pages
The Payments Transformation Journey
No ratings yet
The Payments Transformation Journey
4 pages
Ravni Okvir - Primjer
No ratings yet
Ravni Okvir - Primjer
6 pages
Elektor 1988 02
No ratings yet
Elektor 1988 02
64 pages
GEC ELEC2 Living in The IT Era CM1
No ratings yet
GEC ELEC2 Living in The IT Era CM1
15 pages
Synopsis Final
No ratings yet
Synopsis Final
19 pages

Data Modeling March 16

Uploaded by

Data Modeling March 16

Uploaded by

Data Modeling

CSE4/587 B. Ramamurthy 10/27/2021

CSE4/587 B. Ramamurthy 10/27/2021

CSE4/587 B. Ramamurthy 3 10/27/2021

Statistical thinking in the age of big data

CSE4/587 B. Ramamurthy 10/27/2021

1. Frame the problem: understand the use case

CSE4/587 B. Ramamurthy 10/27/2021

Have a standard use case format (What, why, how,

CSE4/587 B. Ramamurthy 10/27/2021

Data represents the traces of the real-world

CSE4/587 B. Ramamurthy 10/27/2021

How big is the data?

CSE4/587 B. Ramamurthy 10/27/2021

 You achieve two things to get you started:

CSE4/587 B. Ramamurthy 10/27/2021

Traditional: numerical, categorical, or binary

CSE4/587 B. Ramamurthy 10/27/2021

A mathematical model for uncertainty and randomness is offered

CSE4/587 B. Ramamurthy 10/27/2021

World  Collect Data Capture the

CSE4/587 B. Ramamurthy 10/27/2021

Population is complete set of traces/data points

CSE4/587 B. Ramamurthy 10/27/2021

Example: Emails sent by people in the CSE dept. in a

 I heard he moved away from twitter in 2020 to text messaging!

CSE4/587 B. Ramamurthy 10/27/2021

CSE4/587 B. Ramamurthy 10/27/2021

Data is cleaned up : Data wrangling

CSE4/587 B. Ramamurthy 10/27/2021

 Abstraction of a real world process

CSE4/587 B. Ramamurthy 10/27/2021

Normal, uniform, Cauchy, t-, F-, Chi-square,

CSE4/587 B. Ramamurthy 10/27/2021

Fitting a model means estimating the parameters of

CSE4/587 B. Ramamurthy 10/27/2021

Next graph: performance of word co-occurrence

CSE4/587 B. Ramamurthy 10/27/2021

R2 is the goodness of fit.

Design first before you code: an important principle

CSE4/587 B. Ramamurthy 10/27/2021

Good annotated graphs and visuals are important explaining

CSE4/587 B. Ramamurthy 10/27/2021

Iterate thru’ any of steps as warranted by the

CSE4/587 B. Ramamurthy 10/27/2021

Large volumes of data is being collected the increasing

CSE4/587 B. Ramamurthy 10/27/2021

CSE4/587 B. Ramamurthy 10/27/2021

Linear regression : pandas, numpy, matplotlib, scikit

CSE4/587 B. Ramamurthy 10/27/2021

Look at the project 2 description and the all

CSE4/587 B. Ramamurthy 10/27/2021

You might also like