0% found this document useful (0 votes)
51 views29 pages

Data Modeling March 16

The document discusses key concepts in data modeling including defining modeling, exploring foundational modeling knowledge, Drew Conway's data science Venn diagram, statistical thinking in the age of big data, framing problems, understanding data through exploratory data analysis, extracting features, discussing different types of data and modeling approaches, population and sampling, and probability distributions. The overall topics covered are introducing data modeling, foundational modeling concepts, and statistical inference approaches for modeling data.

Uploaded by

Harsh Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views29 pages

Data Modeling March 16

The document discusses key concepts in data modeling including defining modeling, exploring foundational modeling knowledge, Drew Conway's data science Venn diagram, statistical thinking in the age of big data, framing problems, understanding data through exploratory data analysis, extracting features, discussing different types of data and modeling approaches, population and sampling, and probability distributions. The overall topics covered are introducing data modeling, foundational modeling concepts, and statistical inference approaches for modeling data.

Uploaded by

Harsh Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Data Modeling

MY NOTES

CSE4/587 B. Ramamurthy 10/27/2021


Learning Objectives
2

What is modeling?
Let’s explore some foundational knowledge for
modeling.

CSE4/587 B. Ramamurthy 10/27/2021


Drew Conway’s Venn Diagram on Data Science

Math&
Statistics
Knowledge
Traditional
ML research
DS
Hacking DangerSubstantive
Skills zone Expertise

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

CSE4/587 B. Ramamurthy 3 10/27/2021


Chapter 1 and 2 Data Science
4

Statistical thinking in the age of big data


You build models to understand the data and extract
meaning and information from the data: statistical
inference

CSE4/587 B. Ramamurthy 10/27/2021


Lets discuss the road map (Project 1)
5

1. Frame the problem: understand the use case


2. Extract features: what are the dependent and independent
variables, cols and rows in a table data for example.
3. Understand the data: Exploratory data analysis
4. Design, code and experiment: use tools to clean, extract, plot,
view
5. Model the data and analyze: big data, small data, historical,
steaming, realtime etc.
6. Evaluate the goodness of fit! Error coefficients?
7. Prediction (of course) is another outcome of modeling phase
8. Present and test results: two types of clients: humans and systems
9. Go back to any of the steps based on the insights!

CSE4/587 B. Ramamurthy 10/27/2021


Frame The problem
6

Have a standard use case format (What, why, how,


stakeholders, data in, info out, challenges,
limitations, scope etc.)
Refer to your software engineering course
Statement of work (SOW): clearly state what you will
accomplish; in real world context

CSE4/587 B. Ramamurthy 10/27/2021


Understand Data
7

Data represents the traces of the real-world


processes.
 What traces we collect depends on the sampling methods
 You build models to understand the data and extract meaning
and information from the data: statistical inference
Two sources of randomness and uncertainty:
 The process that generates data is random
 The sampling process itself is random
Your mind-set should be “statistical thinking in the
age of big-data”
 Combine statistical approach with big-data

CSE4/587 B. Ramamurthy 10/27/2021


Here are some questions to ask?
8

How big is the data?


Any outliers?
Missing data? (cleaning, preprocessing)
Sparse or dense?
Collision of identifiers in different sets of data
Dimensionality reduction
Normalization
Validating and choosing parameters for modeling (for
example K in K-means clustering)
Guidance of model selection

CSE4/587 B. Ramamurthy 10/27/2021


Exploratory Data Analysis (EDA)
9

 You achieve two things to get you started:


 Get an intuitive feel for the data
 You can get a list of hypotheses
 Traditionally: histograms
 EDA is the prototype phase of ML and other sophisticated approaches;
 Basic tools of EDA are plots, graphs, and summary stats.
 It is a method for “systematically” going through data, plotting
distributions, plotting time series, looking at pairwise relationships
using scatter plots, generating summary stats.eg. mean, min, max,
upper, lower quartiles, identifying outliers.
 Gain intuition and understand data.
 EDA is done to understand data before using expensive big data
methodology or data modeling.

CSE4/587 B. Ramamurthy 10/27/2021


New Kinds of Data (towards big data)
10

Traditional: numerical, categorical, or binary


Text: emails, tweets, NY times articles
Records: user-level data, time-stamped event data,
json formatted log files
Geo-based location data
 Network data (How do you sample and preserve
network structure?)
Sensor data
Images

CSE4/587 B. Ramamurthy 10/27/2021


Uncertainty and Randomness
11

A mathematical model for uncertainty and randomness is offered


by probability theory.
A world/process is defined by one or more variables. The model
of the world is defined by a function:
Model == f(w) or f(x,y,z) (A multivariate function)
The function is unknown model is unclear, at least initially.
Typically our task is to come up with the model, given the data.
Uncertainty: is due to lack of knowledge: this week’s weather
prediction (e.g. 90% confident)
Randomness: is due lack of predictability: 1-6 face of when
rolling a die
Both can be expressed by probability theory

CSE4/587 B. Ramamurthy 10/27/2021


Statistical Inference
12

World  Collect Data Capture the


understanding/meaning of data through models or
functions  statistical estimators for predicting
things about world
Development of procedures, methods, and theorems
that allow us to extract meaning and information
from data that has been generated by stochastic
(random) processes

CSE4/587 B. Ramamurthy 10/27/2021


Population and Sample
13

Population is complete set of traces/data points


 US population 314 Million, world population is 7 billion for example
 All voters, all things
Sample is a subset of the complete set (or population): how we
select the sample introduces biases into the data
See an example in http://www.sca.isr.umich.edu/
Here out of the 314 Million US population, 250000 households are
form the sample (monthly)
Population mathematical model  sample
(My) big-data approach for the world population: k-nary tree (MR)
of 1 billion (of the order of 7 billion) : I basically forced the big-data
solution/did not sample: This is possible in the age of big-data
infrastructures

CSE4/587 B. Ramamurthy 10/27/2021


Population and Sample (contd.)
14

Example: Emails sent by people in the CSE dept. in a


year.
Method 1: 1/10 of all emails over the year randomly
chosen
Method 2: 1/10 of people randomly chosen; all their
email over the year
Both are reasonable sample selection method for
analysis.
However estimations pdfs (probability distribution
functions) of the emails sent by a person for the two
samples will be different.
CSE4/587 B. Ramamurthy 10/27/2021
Big Data vs statistical inference
15

Sample size N
For statistical inference N < All
For big data N == All
For some atypical big data analysis N == 1
 World model through the eyes of a prolific twitter user
 Followers of Ashton Kuchar: If you analyze the twitter data you
may get a world view from his point of view

 I heard he moved away from twitter in 2020 to text messaging!


 Tom Brady has also started this trend!
 Anyway our goal in big data to consider N== All.

CSE4/587 B. Ramamurthy 10/27/2021


Big-data context
16

Analysis for inference purposes you don’t need all the data.
At Google (at the originator big data algs.) people sample all
the time.
However if you want to render, you cannot sample.
Some DNA-based search you cannot sample.
Say we make some conclusions with samples from Twitter data
we cannot extend it beyond the population that uses twitter.
And this is what is happening now…be aware of biases.
Another example is of the tweets pre- and post- hurricane
Sandy..
Yelp example..

CSE4/587 B. Ramamurthy 10/27/2021


Extract Features (Project 2 kind)
17

Data is cleaned up : Data wrangling


Ex: remove tags from html data
Filter out only the important fields or features, say
from a json file
Often defined by the problem analysis and use case
defined.
Example: location and temperature are the only
important data in a tweet for a particular analysis

CSE4/587 B. Ramamurthy 10/27/2021


Modeling
18

 Abstraction of a real world process


Lets say we have a data set with two columns x and y
and y is dependent on x, we could write is as:
y=
(linear relationship)
How to build a model?
Probability distribution functions (pdf) are building
blocks of statistical models.
There are many distributions possible

CSE4/587 B. Ramamurthy 10/27/2021


Probability Distributions
19

Normal, uniform, Cauchy, t-, F-, Chi-square,


exponential, Weibull, lognormal,..
They are know as continuous density functions
Any random variable x or y can be assumed to have
probability distribution p(x), if it maps it to a positive
real number.
For a probability density function, if we integrate the
function to find the area under the curve it is 1,
allowing it to be interpreted as probability.
Further, joint distributions, conditional distribution..

CSE4/587 B. Ramamurthy 10/27/2021


Fitting a Model
20

Fitting a model means estimating the parameters of


the model: what distribution, what are the values of
min, max, mean, stddev, etc.
Don’t worry python libraries have built-in
optimization algorithms that readily offer all these
functionalities
It involves algorithms such as maximum likelihood
estimation (MLE) and optimization methods…
Example: y = β1+β2∗𝑥  y = 7.2 + 4.5*x

CSE4/587 B. Ramamurthy 10/27/2021


Look at Lin and Dyer’s exploration
21

Next graph: performance of word co-occurrence


different approaches.
Example for linear relationship (alternate view: an
example for linear relationship.)
Goodness of fit: R2
R2 is simply the square of the sample 
correlation coefficient (i.e., r) between the observed
outcomes and the observed predictor values.
We want this as close to 1. (0-0.99 possible)

CSE4/587 B. Ramamurthy 10/27/2021


Run it on AWS and evaluate the two approachesRun times for pairs
and stripes!

R2 is the goodness of fit.

APW corpus:
Associated Press Worldstream
5.7 GB data

https://data.world/associatedpress
Design, code, deploy
23

Design first before you code: an important principle


Code using best practices and “Software
engineering” principles
Choose the right language (Java or Python) and
development environment
Document within the code and outside
Clear state the steps in deploying the code
Provide trouble shooting tips

CSE4/587 B. Ramamurthy 10/27/2021


Present the Results (project 2 : 20 points)
24

Good annotated graphs and visuals are important explaining


the results
 Annotate using text, markup and markdown
Extras: provide ability to interact with plots and assess
what-if conditions
Explore
d3.js : https://d3js.org/
Tableau: https://www.tableau.com/academic
And a lot of creativity. Do not underestimate this: how to
present your results effectively?
Should need no explanation!

CSE4/587 B. Ramamurthy 10/27/2021


Iterate
25

Iterate thru’ any of steps as warranted by the


feedback and the results
Data science process is an iterative process
Before you develop a tool or automation based on
the results test the code thoroughly.
Read Chapter 2

CSE4/587 B. Ramamurthy 10/27/2021


Example1: Data Collection in Automobiles
26

Large volumes of data is being collected the increasing


number of sensors that are being added to modern
automobiles.
Traditionally this data is used for diagnostics purposes.
How else can you use this data?
How about predictive analytics? For example, predict
the failure of a part based on the historical data and on-
board data collected?
On-board-diagnostics (OBDI) is a big thing in auto
domain.

CSE4/587 B. Ramamurthy 10/27/2021


Example 2: Oil Price Prediction
27

CSE4/587 B. Ramamurthy 10/27/2021


Modeling methods
28

Linear regression : pandas, numpy, matplotlib, scikit


https://scikit-learn.org/stable/
Other models: regression (linear and logistic),
clustering, classification
More next class.

CSE4/587 B. Ramamurthy 10/27/2021


Next lecture: To do
29

Look at the project 2 description and the all


requirements.
There is no project 3. So do well in project 2.
Project 2: focus on processing NoSQL data using big
data methods.
Prepare for the quiz: Chapter 2-4 Lin And Dyer and
class notes.

CSE4/587 B. Ramamurthy 10/27/2021

You might also like