Unit 1 Rept
Unit 1 Rept
1. Structured data: Data containing a defined data type, format, and structure.
It’s information designed with the explicit function of being easily searchable – it’s
quantitative and highly organized
2. Semi-structured data: includes text that is organized by subject or topic or fit into a
hierarchical programming language yet the text, having no structure itself. Struc data +
Ex: 1. Email - Email messages contain structured data like name, email address,
recipient, date, time, etc., and they are also organized into folders,
like Inbox, Sent, Trash, etc. The data within each email is unstructured
– 2. XML is widely used to store and exchange semi-structured data. It
allows its user to define tags and attributes to store the data in
hierarchical form
3. Quasi-structured data: Textual data with erratic data formats that can be
formatted with effort, tools, and time (for instance, web clickstream data that
may contain inconsistencies in data values and formats).
4. Unstructured data: usually open text, images, videos, etc., that have no
predetermined organization or design. Text documents, including chats, PDFs,
and presentations. Social media data, like posts, tweets, and comments
1.1.2 Analyst Perspective on Data Repositories:
Advantage disadvantage
Spreadsheets:
1. Structured data 1. Lack of Security.
2. Easy to share
3. End – user control
Data warehouse:
1. Faster and more efficient data
analysis 1. Limited flexibility
2. Better decision-making 2. Data latency
3. Increased data accessibility
Analytic Sandbox:
4. robust analytics
5. Flexible
6. high-performance computing
4. variety of data, such as raw data,
textual
data, and other kinds of unstructured
• There are several things to consider with Big Data Analytics projects to
ensure the approach fits with the desired goals
1.2 State of the Practice in Analytics
1.2.1 BI Versus Data Science
What is BI?
• Business Intelligence is a process of collecting, integrating, analyzing and
presenting the data. With Business Intelligence, executives and managers
can have a better understanding of decision-making.
What is Data science?
• Video surveillance,
• Mobile devices, which provide geospatial location data of the users, as well as
metadata about text messages, phone calls, and application usage on smart phones
• Smart devices, which provide sensor-based collection of information from smart
• electric grids, smart buildings, and many other public and industry
infrastructures
• Nontraditional IT devices, including the use of radio-frequency
identification (RFID)readers, GPS navigation systems, and seismic
processing.
1.2.4 Emerging Big Data Ecosystem and a New Approach to Analytics:
• Organizations and data collectors are realizing that the data they can gather
from individuals contains intrinsic value and, as a result, a new economy is
emerging.
• As this new digital economy continues to evolve, the market sees the
introduction of data vendors and data cleaners
there are four main groups of players within this interconnected web
– 1. Data devices
– 2. Data collectors
– 3. Data aggregators
– 4. Data users and buyers
• 1.3 Key Roles for the New Big Data Ecosystem:
The Big Data ecosystem demands three categories of roles
1. Deep Analytical Talent:
• combination of skills to handle raw, unstructured data and to apply complex
analytical techniques at massive scales.
• This group has advanced training in quantitative disciplines, such as
mathematics, statistics, and machine learning
2. Data Savvy Professionals:
• tend to have a base knowledge of working with data, or some of the work
being performed by data
scientists and others with deep analytical talent
• the team needs to assess the resources available to support the project include
technology, tools, systems, data, and people.
– long-term goals of this kind of project, without being constrained by the
current data.
– The team will need to determine whether it must collect additional data,
purchase it from outside sources
– evaluate how much time is needed
– identify what needs to be achieved in business terms, and identify what needs to be
done to meet the needs. Additionally, consider the objectives and the success criteria
– equally important is to establish failure criteria
– failure criteria will guide the team in understanding when it is best to
stop trying or settle for the results that have been gleaned from the data
– 2.2.4 Identifying Key Stakeholders
– Another important step is to identify the key stakeholders and their
interests in the project.
– During these discussions, the team can identify the success criteria, key
risks, and stakeholders, which should include anyone who will benefit
from the project or will be significantly impacted by the project
• This evaluation gets the team thinking about which technologies may be good for
the project and how to start getting access to these tools.
5. Scope the sort of data infrastructure needed for this type of problem:
• In addition to the tools needed, the data influences the kind of infrastructure that’s
required, such as disk storage and network capacity
Phase 2: Data Preparation
• The second phase of the Data Analytics Lifecycle involves data preparation,
which includes the steps to explore, preprocess, and condition data prior to
modeling and analysis
• done by preparing an analytics sandbox.
• To get the data into the sandbox, the team needs to perform ETLT, by a
combination of extracting, transforming, and loading data into the sandbox.
• Once the data is in the sandbox, the team needs to learn about the data and
become familiar with it.
• The team also must decide how to condition and transform data to get it into
a format to facilitate subsequent analysis.
• The team may perform data visualizations to help team members understand
the data
• 2.3.1 Preparing the Analytic Sandbox
• The first sub phase of data preparation requires the team to obtain an
analytic sandbox (also commonly referred to as a workspace), in which the
team can explore the data without interfering with live production databases.
• When developing the analytic sandbox, it is a best practice to collect all
kinds of data there, as team members need access to high volumes and
varieties of data for a Big Data analytics project.
• This can include everything from summary-level aggregated data, structured
data, raw data feeds, and unstructured text data from call logs or web logs,
depending on the kind of analysis the team plans to undertake.
2.3.2 Performing ETLT
• It advocates extract, load, and then transform.
• In this case, the data is extracted in its raw form and loaded into the data
store, where analysts can choose to transform the data into a new state or
leave it in its original, raw condition.
• The reason for this approach is that there is significant value in preserving
the raw data and including it in the sandbox before any transformations take
place.
• The team may want clean data and aggregated data and may need to keep a
copy of the original data to compare against or look for hidden patterns that
may have existed in the data before the cleaning stage.
• This process can be summarized as ETLT to reflect the fact that a team may
choose to perform ETL in one case and ELT in another.
• Depending on the size and number of the data sources, the team may need to
consider how to parallelize the movement of the datasets into the sandbox.
2.3.3 Learning About the Data
• become familiar with the data itself, learn the datasets provides context to
understand what constitutes a reasonable value and expected output versus what is a
surprising finding.
• it is important to identify additional data sources that the team can leverage.
• Some of the activities in this step may overlap with the initial investigation of the
datasets that occur in the discovery phase
2.3.4 Data Conditioning
• Data conditioning refers to the process of cleaning data, normalizing datasets, and
performing transformations on the data.
• about which data to keep and which data to transform or discard
• Data conditioning is often viewed as a preprocessing step for the data
analysis.
• Additional questions and considerations for the data conditioning step
include these.
1. What are the data sources? What are the target fields
2. How clean is the data?
3. How consistent are the contents and files?
4. Assess the consistency of the data types
5. Look for any evidence of systematic error
2.3.5 Survey and Visualize
• After the team has collected and obtained at least some of the datasets needed for
the subsequent analysis, a useful step is to leverage data visualization tools to gain
an overview of the data.
• Shneiderman’s mantra-for visual data analysis of “overview first, zoom and filter,
then details-on-demand.”- describes how data should be presented on screen so that
it is most effective for users
2.3.6 Common Tools for the Data Preparation Phase
1. Hadoop -can perform massively parallel ingest and custom analysis for
web traffic parsing, GPS location analytics, genomic analysis, and combining of
massive unstructured data feeds from multiple sources.
• SQL - Analysis services can perform in-database analytics of common data mining
functions, involved aggregations, and basic predictive models.
• SAS/ACCESS (Statistical Analysis Software) - provides integration between SAS
and the analytics sandbox via multiple data connectors such as OBDC, JDBC, and
OLE DB. SAS itself is generally used on file extracts, but with SAS/ACCESS,
users can connect to relational databases (such as Oracle or Teradata) and data
warehouse appliances (such as Greenplum or Aster), files, and enterprise
applications
2.5 Phase 4: Model Building
• In Phase 4, the data science team needs to develop datasets for training,
testing, and production purposes.
• These datasets enable the data scientist to develop the analytical model and
train it (“training data”), while holding aside some of the data (“hold-out
data” or “test data”) for testing the model.
• In the model building phase, an analytical model is developed and fit on the
training data and evaluated (scored) against the test data
• The phases of model planning and model building can overlap quite a bit
2. Alpine Miner
• embeds statistical algorithms in the database to leverage the innate
capabilities of parallel processing databases.
2.6 Phase 5: Communicate Results
• After executing the model, the team needs to compare the outcomes of the
modeling to the criteria established for success and failure
• the key is to remember that the team must be rigorous enough with the data
to determine whether it will prove or disprove the hypotheses outlined in
Phase 1 (discovery).
• When conducting this assessment, determine if the results are statistically
significant and valid. If they are, identify the aspects of the results that stand
out
• If the results are not valid, think about adjustments that can be made to
refine and iterate on the model to make it valid
• By this time, the team should have determined which model or models
address the analytical challenge in the most appropriate way.
• Make recommendations for future work or improvements to existing
processes, and consider what each of the team members and stakeholders
needs to fulfill her responsibilities.
• As a result - The deliverable of this phase will be the most visible portion of
the process to the outside stakeholders and sponsors, so take care to clearly
articulate the results, methodology
2.7 Phase 6: Operationalize
• In the final phase, the team communicates the benefits of the project more
broadly and sets up a pilot project to deploy the work in a controlled way
before broadening the work to a full enterprise or ecosystem of users
• This approach enables the team to learn about the performance and related
constraints of the model in a production environment on a small scale and
make adjustments before a full deployment
• Part of the operationalizing phase includes-> creating a mechanism for
performing ongoing monitoring of model accuracy and, if accuracy
degrades, finding ways to retrain the model.
• If feasible, design alerts for when the model is operating “out-of-bounds.”
• This includes situations when the inputs are beyond the range that the model
was trained on, which may cause the outputs of the model to be inaccurate
or invalid.
• If this begins to happen regularly, the model needs to be retrained on new
data
The key outputs for each of the main stakeholders of an analytics project and
what they usually expect at the conclusion of a project.
• Business User typically tries to determine the benefits and implications of
the findings to the business.
• Project Sponsor typically asks questions related to the business impact of
the project, the risks and return on investment (ROI),
• Project Manager needs to determine if the project was completed on time
and within budget and how well the goals were met.
• Business Intelligence Analyst needs to know if the reports and dashboards
he manages will be impacted and need to change.
• Data Engineer and Database Administrator (DBA) typically need to
share their code from the analytics project and create a technical document
on how to implement it.
• Data Scientist needs to share the code and explain the model to her peers,
managers, and other stakeholders.