Dzone Refcard160 Datawarehousing Updated
Dzone Refcard160 Datawarehousing Updated
CONTENTS
                                      Data Warehousing:
                                                                                                                                                      -   DATA
- DATA MODELING
- NORMALIZED DATA
- FACTS
- DIMENSIONS
decision-support data for some or all of an enterprise. Data The data warehouse's technical architecture includes data sources,
                                      warehousing is a broad subject that is described point-by-point in           data integration, BI/analytics data stores, and data access.
D ZO N E . CO M / R E F C A R D Z
                                                                                                               1
                                                                                                                                                                   DATA WAREHOUSING
A software tool that contains data that present it as reports and/or graphical
                                                          describes other data. The two kinds of            Reporting and             displays. The business or analyst will
                                    Metadata                                                                Query Tools               be able to explore the data-exploration
                                                          metadata are: business metadata and
                                                          technical metadata.                                                         sanction. These tools also help produce
                                                                                                                                      reports and outputs that are desired
                                                                                                                                      and needed to understand the data.
                                                          A software tool that enables the design
                                                          of data and databases through graphical
                                                                                                                                      Software tools that find patterns in
                                                          means. This tool provides a detailed
                                    Repository                                                                                        stores of data or databases. These tools
                                                          design capability that includes the               Data Mining Tools
                                                                                                                                      are useful for predictive analytics and
                                                          design of tables, columns, relationships,
                                                                                                                                      optimization analytics.
                                                          rules, and business definitions.
                                                          that other types of software cannot.             Data architecture is a blueprint for the management of data in an
                                                                                                           enterprise. The data architect builds a picture of how multiple sub-
                                                                                                       2
                                                                                                                                                                         DATA WAREHOUSING
                                                                                                            3
                                                                                                                                                                             DATA WAREHOUSING
                                    DATA M O D ELI N G
                                    Three levels of data modeling are developed in sequence:
                                    ENTITIES
                                    An entity is a core part of any conceptual and logical data model. An
                                    entity is an object of interest to an enterprise --- it can be a person,
                                    organization, place, thing, activity, event, abstraction, or idea.
                                    Entities are represented as rectangles in the data model. Think of
                                    entities as singular nouns.
                                                                                                               4
                                                                                                                                                                         DATA WAREHOUSING
                                    Minimum cardinality is expressed by the symbol farther away from             HEADER AND DETAIL ENTITIES
                                    the entity. A circle indicates that an entity is optional, while a bar       The ADW is organized into non-changing data with logical keys and
                                    indicates that an entity is mandatory. At least one is required.             changeable data that supports tracking of changes and rapid load/
                                                                                                                 insert. Use an integer as the primary surrogate key. Then, add the
                                                                                                                 effective date to track changes.
                                                                                                                 ASSOCIATIVE ENTITIES
                                                                                                                 Track the history of relationships between entities using an
                                                                                                                 associative entity with effective dates and expiration dates.
D ZO N E . CO M / R E F C A R D Z
                                    N O R M ALIZED DATA
                                    Normalization is a data modeling technique that organizes data by
                                    breaking it down to its lowest level, i.e. its "atomic" components, to
                                    avoid duplication. This method is used to design the atomic data
                                    warehouse part of the data warehousing system.
                                     First Normal Form          Entities contain no repeating groups             ATOMIC DW SPECIALIZED AT TRIBUTES
                                     (1NF)                      of attributes.                                   Use specialized attributes to improve ADW efficiency and
                                                                                                                 effectiveness. Identify these attributes using a prefix of ADW_.
                                                                Entity is in the first normal form and
                                                                attributes that depend on only part                  ATTRIBUTE NAME                        DESCRIPTION
                                     Second Normal
                                     Form (2NF)                 of a composite key are separated into
                                                                new entities.                                                               Data warehouse assigned surrogate
                                                                                                                                            key. Replace ‘xxx’ with a reference to
                                                                                                                  dw_xxx_id
                                                                The entity is in the second normal form                                     the table name, such as ‘dw_customer_
                                                                                                             5
                                                                                                                                                                       DATA WAREHOUSING
                                    SU PP O R T I N G TAB LE S
                                    Supporting data is required to enable the data warehouse to
                                                                                                                D I M EN SI O NAL DATABA SE
                                    operate smoothly. Here is some supporting data:
                                                                                                                A dimensional database is a database that is optimized for query
                                      •   Code management and translation.                                      and analysis and is not normalized like the atomic data warehouse.
                                                                                                                It consists of fact and dimension tables, where each fact is
                                      •   Data source tracking.
                                                                                                                connected to one or more dimensions.
                                      •   Error logging.
CODE TRANSLATION The sales order fact includes the measurer's order quantity and
Data warehousing requires that codes, such as gender code and currency amount. Dimensions of Calendar Date, Product, Customer,
units of measure, be translated to standard values aided by code- Geo Location, and Sales Organization put the sales order fact into
                                    translation tables like these:                                              context. This star schema supports looking at orders in a cubical
D ZO N E . CO M / R E F C A R D Z
                                                                                                            6
                                                                                                                                                                         DATA WAREHOUSING
                                                                                                                 D I M EN SI O N S
                                                                                                                 A dimension is a database table that contains properties that
                                                                                                                 identify and categorize. The attributes serve as labels for reports
                                                                                                                 and as data points for summarization. In the dimensional model,
                                                                                                                 dimensions surround and qualify facts.
                                    AGGREGATED FAC T
                                    Aggregated facts provide summary information, such as general
                                                                                                                 DEGENERATE DIMENSION
                                    ledger totals during a period of time or complaints per product per
                                                                                                                 A degenerate dimension has a dimension key without a dimension
                                    store per month.
                                                                                                                 table. Examples include transaction numbers, shipment numbers,
                                                                                                                 and order numbers.
                                                                                                             7
                                                                                                                                                                          DATA WAREHOUSING
Adds a new row. Each change will add CHANGE DATA CAPTURE (CDC)
a new row where all the values will be The CDC pattern of data integration is strong in event processing.
                                                               the same except for the changed fields.           Database logs that contain a record of database changes are
                                     SCD Type 2                                                                  replicated near real time at staging. This information is then
                                                               This will mean that a new field(s) will be
                                                               added to mark the rows and state which            transformed and loaded to the data warehouse.
one is effective.
                                    for batch processing of bulk data.                                           modeling, where adaptive schema changes at real time along with
                                                                                                                 the data, and changes are seamless. You would only need to just
                                                                                                                 upload the data sources, everything else is automated including
                                                                                                                 the following tasks:
                                                                                                             8
                                                                                                                                                                                    DATA WAREHOUSING
                                      •     Data types are automatically discovered, and a schema is                SOLVING CONCURRENCY ISSUES
                                            generated based on the initial data structure.                          To remedy concurrency issues, new cloud data warehousing
                                                                                                                    technologies today can separate storage from compute
                                      •     Likely relationships between tables are automatically                   and increase the compute nodes based on the amount of
                                            detected and used to model a relational schema.                         connections. Consequently, the number of available clusters
                                                                                                                    scales with the number of users and the intensity of the workload,
                                      •     Aggregations are automatically generated.
                                                                                                                    supporting hundreds of parallel queries that are load-balanced
• Table history, which stores data uploaded from API data between clusters.
• Re-indexing happens automatically whenever the algorithm uninterrupted. When the scaling is complete, the old and new
                                            detects changes in query patterns.                                      clusters are swapped instantly. Data warehouse maintenance itself
                                                                                                                    has been greatly improved as well, by automating the cleaning and
                                      •     Redistributing the data across nodes to improve data locality           compressing of tables to boost database performance.
                                            and join performance is done automatically.
D ZO N E . CO M / R E F C A R D Z
                                                                                                                     DZone, Inc.
                                          DZone communities deliver over 6 million pages each                        150 Preston Executive Dr. Cary, NC 27513
                                          month to more than 3.3 million software developers,                        888.678.0399 919.678.0300
                                          architects and decision makers. DZone offers something for
                                                                                                                     Copyright © 2018 DZone, Inc. All rights reserved. No part of this publication
                                          everyone, including news, tutorials, cheat sheets, research
                                                                                                                     may be reproduced, stored in a retrieval system, or transmitted, in any form
                                          guides, feature articles, source code and more. "DZone is a                or by means electronic, mechanical, photocopying, or otherwise, without
                                          developer’s dream," says PC Magazine.                                      prior written permission of the publisher.