Dimensional data model overview - update in 2024.
About Jack Zheng
Faculty of IT at Kennesaw.edu
dimension model
data model
star schema
multidimensional model
dimension
Model
Data
table
http www
Dimensional Model
Data Model
Fact table
dimension table
Dimensional data model overview - update in 2024.
About Jack Zheng
Faculty of IT at Kennesaw.edu
dimension model
data model
star schema
multidimensional model
dimension
Model
Data
table
http www
Dimensional Model
Data Model
Fact table
dimension table
Dimensional Data Model Basics Jack G. Zheng Spring 2024 http://idi.kennesaw.edu/it4713 http://idi.kennesaw.edu/it7123 https://www.edocr.com/v/bmvojelj/jgzheng/dimensional-model IT 4713 BI System IT 7123 BI Overview 1. Dimensional model basic concepts 2. Dimensional model as relational schemas – Star schema, snowflake schema 3. Basic dimension modeling techniques and process 2 Dimensional model is the basis of multidimensional analysis. It is also the foundation of traditional BI - data warehouse design and OLAP reporting. The concepts are still used in many modern BI tools including Power BI. This lecture notes introduce the basics the dimensional data model and its design at the conceptual and logical level. Data Model Review • A data model conceptualizes data elements and standardizes how the data elements relate to one another • Why data model (in BI and analytics)? – Facilitates the understanding, query, reporting, analysis, and other use of data – Creating business meaning & context – Understand source and target data systems – Optimize data structures to align queries and reports • Three levels: conceptual vs. logical vs. physical data models. • Typical data models – Entity relationship model (a conceptual model) – Relational data model (a logical model) – Dimensional data model – Graph (network) data model – Hierarchical data model 3 Image from https://www.slideshare.net/Dataversity/ldm-webinar- data-modeling-business-intelligence and also view the webinar https://www.dataversity.net/ldm-webinar- data-modeling-business-intelligence/ The complexity increases from conceptual to logical to physical. For complete info, refer to the IT 3703 data model lecture notes. Dimensional Data Model • Dimensional data model is a data model that specifically constructed around the elements of facts and dimensions – It is directly based on the most common business questions and queries. – It is the platform for (traditional) BI. • Dimensional model is a conceptual model and can be implemented in a logical model (like the relational model) or in a logical multidimensional database model like an OLAP server • Where is Dimensional Model used? 1. A major data warehouse design method by Ralph Kimball (http://en.wikipedia.org/wiki/Ralph_Kimball) • Dimensional modeling has been broadly accepted as the dominant technique for DW/BI presentation. • “Based on our experience and the overwhelming feedback from numerous practitioners from companies like your own, we believe that dimensional modeling is absolutely critical to a successful DW/BI initiative.” (Kimball) 2. The basis of OLAP (multidimensional database) 3. Commonly referenced (but may not be enforced) in many self-service BI tools. 4. The foundation of multidimensional analysis; most widely used in many business (descriptive) analysis 4 IT 4713 covers all areas; IT 7123 particularly focuses on #3 and #4. Multidimensional Analysis • Multi-dimensional queries – A dimension is a particular way (or an attribute) of describing and categorizing data – Such queries are usually arithmetic aggregation operations (sum, average, etc.) on records grouped by multiple dimensions (attributes) at different aggregation levels. – A pivot table or crosstab is usually used for OLAP result view (aggregated data) • Example analysis – More about multidimensional analysis (we will cover more about multidimensional analysis in a later module) https://www.youtube.com/watch?v=IhFkNmVmwn4 5 "What is the total sales amount grouped by product line (dimension 1), location (dimension 2), time (dimension 3) and … (other dimensions)?" "Which segment of business provides the most revenue growth?" Multidimensional Analysis using the Matrix Table 6 A pivot table (or matrix table, crosstab) is usually used for multidimensional analysis result view (data aggregated by dimensions). This is the transactional data report with line by line data in a flat table. Dimension (columns) Dimension (rows) Aggregated data by dimension members Measure Dimension Model Key Concepts • The following are the key concepts and elements in dimensional model – Fact/measure – Dimension – Attribute – Hierarchy – Aggregate 7 Fact/Measure • A measure (or fact) contains measurement data (often numeric values) that measures an aspect of the business. • A measure has two components – Numerical value: e.g. sales price – Aggregation formula: e.g. sum, used for aggregating/combining a set of values number of measure values into one measure value determined by dimension value combination • Different types of businesses have different facts. For example: – Retail: gross sales dollars, total cost, profit, etc. – University: student enrollment, number of degree programs, number of graduates, etc. – Basketball player: score, rebound, assist, etc. – Web traffic: page views, number of visitors, etc. 8 Three Types of Measures Additive • Can be added over all dimensions • Example: sales amount Semi-additive • Cannot be added (but may be averaged) over some dimensions - typically time • Example: inventory, enrollment, class size, account balance, etc. Non-additive • Cannot be added over any dimension. Can use other functions like count or average. • Example: product unit price, download speed, network quality, signal strength, ratios, etc. 9 Dimension • A dimension is a structure by which facts can be calculated/aggregated – Consists of non-overlapping data elements which are used to categorize or group facts and provided aggregated calculations – For example: date, geo region, product, etc. • Dimension attribute: is a piece of information describing a particular component of a dimension. – For example, the date dimension has attributes of month (month number, full month name, abbreviate month name, etc.), day, year, etc. – Dimension attributes are usually qualitative type of data (ordinal or nominal, see https://dwbi.org/pages/31/classifying-data-for-successful-modeling) • Dimension attribute serves the primary source of query constraints and grouping levels. In a typical query and report, usually identified by the “by” keyword. • Example – Sales amount by state (location), by year (date), by gender (customer), product type (product), etc. 10 Example/Exercise • In this following statements – What is the measure? – What are the dimensions? Or what kind of information is used to group data? • In this following analysis/report https://twitter.com/ZachLowe_NB A/status/547036034660573184 – How did the author group the data? – Which part is used as dimension? 11 “Currently, 65 percent of republican women view him favorably, up from 55 percent in March -- and 49 percent in September.” “Trilliant Health's analysis found that during the pandemic, telehealth was primarily used for behavioral health, especially by commercially insured women between the ages of 20 and 49.” Banded/Ranged Dimension • Can numeric values be a dimension? See the following example about the “income” data • Income as a fact/measure – “According to the US Census Bureau persons with doctorates in the United States had an average income of roughly $81,400. The average for an advanced degree was $72,824 with men averaging $90,761 and women averaging $50,756 annually.” • Income (range) as a dimension – ranged dimension – http://en.wikipedia.org/wiki/Household_income_in_the_United_State s#Household_income 12 This is called a ranged/banded dimension; as they are used to group the data in this case, not to measure it. Hierarchies • Hierarchy is a structured way to group attributes in multiple levels within a dimension – The lowest level is called a leaf level. • Examples of hierarchy in a dimension – Location dimension: country, state, city – Organization dimension: section, division, branch, region – Date dimension: year, quarter, month, date • A dimension can have multiple hierarchies – https://www.informationweek.com/software/information- management/kimball-university-maintaining-dimension-hierarchies/d/d- id/1073326 • Dimensions contain hierarchies. The hierarchies contain levels. And the levels contain members. • Dimension member – Members are the instances (values) in a level. For example, month has members of Jan, Feb, etc. 1 3 Aggregate • Aggregate is a value formed by combining facts from a given dimension member of one or more dimensions to create a single value. • Aggregation can be based on any formula, with the common ones to be: summation, average, count, etc. • Examples – Total sales amount in December (the month member in the date dimension) – Total sales amount in Georgia in December (the state member of the location dimension + the month member in the date dimension) – Average sales amount in each state in December 1 4 Modeling Dimensions • The identification of dimensions and dimension members is the key to multidimensional analysis • It directly reflects modeler’s understanding of the domain and problems being analyzed • Exercise – Identify dimensions with possible hierarchies for the following piece of data example, based on your understanding of the KSU class schedule and class information. 15 https://owlexpress.kennesaw.edu/prodban/bwckctlg.p_disp_listcrse? term_in=201808&subj_in=IT&crse_in=6713&schd_in=A Implementation of Dimensional Models • Dimensional model can be implemented as relational model (as schemas) in – data warehouse/mart design (strictly following the relational rules) – analytics app (Power BI) data modeling (loosely following the relational logic) • Schema describes how data are organized in a (data) system • Dimensional models can be designed in two types of schemas under the relational data model (logical model) – Star schema – Snowflake schema 16 Star Schema 17 Image from https://www.slideshare.net/Dataversity/ldm-webinar-data-modeling-business-intelligence Star is a metaphor here to visualize how dimensional tables are linked to the fact table. Star Schema in Relational Model • Dimension tables are linked to the fact table through primary key and foreign key pairs. 18 Foreign keys in the fact table. Star Schema Design Example • This is an example conceptual dimensional data model for a typical retail business. It has one fact and four dimensions. 19 Date Sales Data Customer Product Store A fact table holds measure information. Each measure can be described by referring to linked dimensional tables. Data can be aggregated by summing relevant rows. Dimension tables hold dimensions, hierarchies, attributes. Each dimension uses at least one table. Star Schema in the Relational Model • This is the relational model designed based on the conceptual model in the last slide. It has one fact table (in the center) and four dimension tables linked to it. 20 Date DateKey DayofMonth DayofWeek Month Quarter Year Product ProductKey Product Brand Category Subcategory Store StoreKey StoreType City State Region Country Customer CustomerKey AgeGroup Gender Sales DateKey CustomerKey ProductKey StoreKey SalesAmount Tax The fact table represents facts with foreign Keys pointing to each dimension Facts attributes (measures) Each dimension table has a primary key and links to the fact table though foreign keys. Star Schema in SQL Server Database 21 Image from https://learn.microsoft.com/en-us/power-bi/guidance/star-schema Fact Table • The fact table stores all measures. A fact table consists of two types of columns. The foreign keys column allows to join with dimension tables and the measure columns contain the data that is being analyzed. • Each row in the fact table represents the atomic grain value for measurement units (an intersection of all dimensions) – that cannot break down anymore. It is the most basic unit for further aggregation. – For example (refer to the prior schema): Customer A purchased product M in Store 01 on Date X, and spent 99 dollars with 10 dollar tax. • Type of fact tables (based on type of facts) – Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. A typical example is the Transaction fact table: each row represents a single event/transaction at the lowest granularity. – Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts, such as account balance. – http://www.1keydata.com/datawarehousing/fact-table-types.html – Factless fact tables: no measures contained (with an implicit measure of 1, count of rows) http://www.1keydata.com/datawarehousing/factless-fact-table.html 22 Dimension Table • Dimension tables are used to describe dimensions. They contain dimension keys, attributes and value. • Each dimension table has a single column PK. It is best to use the surrogate key as the PK, not natural key – A surrogate key has a unique value assigned to each row in the dimension. It becomes primary key of the dimension table and is used to join the dimension to the associated foreign key field in the fact table • Benefits of surrogate keys: – Protect the DW/BI system from changes in the source system – Allow the DW/BI system to integrate data from multiple source system – Enable developers to add rows to dimensions that do not exist in the source system – Provide the means for tacking changes in dimension – Are efficient in the relational database and analysis services – http://www.disoln.org/2013/11/Surrogate-Key-in-Data-Warehouse-What-When- Why-and-Why-Not.html • The date dimension is exempt from the surrogate key rule; this highly predictable and stable dimension can use a more meaningful primary key. https://www.kimballgroup.com/data-warehouse-business-intelligence- resources/kimball-techniques/dimensional-modeling-techniques/dimension- surrogate-key/ • Usually de-normalized, with multiple hierarchies in it (in star schema). 23 Grain/Granularity • Granularity is defined as the (lowest) level of detail that will be maintained. • The greater the level of detail, the finer the level of granularity. • The level of granularity for each dimension determines the grain for the atomic level of the facts. 24 Date Dimension Table • Date data is often stored as a separate date dimension table (instead of a date column in a fact table) for two reasons: • With a date dimension table, the rules to group dates are specifically defined and can be customized. This is much more flexible than using SQL date functions. – In addition, there are several date attributes that the SQL date functions do not support, like fiscal periods, holidays, seasons, weekdays, weekends, and national events. • It is much easier to select columns from a date table instead of using complex functions to dynamically generate the attribute for reports. 25 Here are two ways to model the date dimension. Examine its attributes. Snowflake Schema • The snowflake schema is a variation of the star schema with – normalization of dimension tables – defined hierarchy • In star schema, one dimension is modeled in only one table, with potential hierarchy embedded. • In snowflake schema, hierarchies are specifically modelled as a set of one-to-many tables. – See next slide 26 Product ProductKey ProductName Category Subcategory Category subcategory ProductName is an embedded hierarchy Snowflaking Product Dimension 27 AdventureWorks Data Warehouse Schema Foreign keys pointing to dimensional tables Role-playing dimensions (ShipDate, OrderDate, DueDate) Defined product hierarchy in a chain of normalized 1 to many tables (snowflaking). Top level hierarchy at the far end. Star vs. Snowflake Snowflake Schema Star Schema Ease of maintenance/change No redundancy and hence easier to maintain and change Has redundant data and hence less easy to maintain/change Ease of Use More complex queries and hence less easy to understand Less complex queries and easy to understand Query Performance More foreign keys-and hence more query execution time Less no. of foreign keys and hence lesser query execution time Type of Data Warehouse Good to use for data warehouse core to simplify complex relationships (many:many) Good for data marts with simple relationships (1:1 or 1:many) Joins: Higher number of Joins Fewer Joins Dimension table It may have more than one dimension table for each dimension Contains only single dimension table for each dimension When to use: When dimension table is relatively big in size, snowflaking is better as it reduces space. When dimension table contains less number of rows, we can go for Star schema. Normalization/ De- Normalization: Dimension Tables are in Normalized form but Fact Table is still in De-Normalized form Both Dimension and Fact Tables are in De-Normalized form Data model Bottom up approach Top down approach 28 http://www.diffen.com/difference/Snowflake_Schema_vs_Star_Schema http://searchbusinessintelligence.techtarget.in/answer/Star-schema-vs-snowflake-schema-Which-is-better http://oracle-online-help.blogspot.com/2006/11/star-vs-snowflake-schema.html Star Schema in Various Products • Star and snowflake schema is not only in databases. • As a general concept and model, it can be implemented in self-service analytics tools like Power BI or Tableau. 29 Table join in Tableau, following a star schema style. Table join in Power BI, with a snowflake schema style. Dimensional vs. Relational Modeling 30 Extended reading from Kimball: https://www.kimballgroup.com/1997/08/a-dimensional-modeling-manifesto/ This comparison table is from: https://learndatamodeling.com/blog/comparison-of- relational-and-dimensional-data-modeling/ Dimensional Modeling Process Identifying the business rules and processes • Business requirements document should contain a list of analysis cases, with the business measures and a detailed list of all dimensions, down to the lowest level of detail for each dimension. Creating fact tables with measures • Translate business measures into fact tables • Analyze source system information for additional measures Identifying the dimensions and hierarchies • Identifying the granularity levels • Create dimension tables (normalized or not) with all columns • Link fact tables to the dimension tables 31 An alternative approach: see Kimball’s four step at: http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball- techniques/dimensional-modeling-techniques/four-4-step-design-process/ Star Schema Design Steps 1. First create all dimension tables – Each dimension table needs to have an identifying column (primary key) 2. Then create a fact table with all measures stored in this table 3. Link dimension tables and the fact table – Starting with one dimension table, make a key column (foreign key) by adding a new one or transforming an existing one, that matches the primary key in a dimension table – Link the dimension table to the fact table through the key pairs. – Repeat this for all dimension tables 32 Modeling and Designing Tools • https://erdplus.com – A simple online modeling tool for star Schemas – This tool meets our needs, so we are using it for some schema design practice. • Others – Visio https://www.youtube.com/watch?v=Z3uTWpuv54U – Visual Studio Architect – http://www.symcorp.com/tech_expertise_design.html – IBM InfoSphere Data Architect http://www- 03.ibm.com/software/products/en/ibminfodataarch – MySQL Workbench https://www.youtube.com/watch?v=7Pwj7nV-oRM 33 Why do we need dimensional model or star schema? • Self-service tools like Excel and Power BI might allow you to get away with building queries and reports out of a big flat table, or even more unstructured. • A dimensional model (at least the concept of facts and dimensions) seeks to divide your data into two big groups, and thus two different types of attributes in your model: facts and dimensions. – facts contain the quantifiable data we’re interested in, – dimensions contain the contextual information that we wish to analyze • The end result for users is a logical and intuitive grouping of values • Having dimensional model in mind, even the model is not specifically built in a tool, will help greatly for the analysis and reporting of data. 34 Star Schema Exercise • Create a star schema for the following example data, for the potential analysis of student enrollment in classes (enrollment pattern, class size, history, trend, etc.) 35 https://owlexpress.kennesaw.edu/prodban/bwckctlg.p_disp_listcrse? term_in=201808&subj_in=IT&crse_in=6713&schd_in=A Essential Resources • The 101 Guide to Dimensional Data Modeling: https://dwbi.org/pages/3/the-101-guide-to-dimensional-data-modeling • Star schema and snowflake schema are two common relational database schema types that implements the dimension model: https://dwbi.org/pages/19/dimensional-modeling-schema • “What are Dimensions and Measures” a very quick, simple, and high- level introduction video by Intricity: http://www.youtube.com/watch?v=qkJOace9FZg • A more technical introduction of dimensional modeling by LeapFrogBI (the following three are the basic ones required in this module) – Facts and dimensions: https://www.youtube.com/watch?v=6k3nwXXpnMY – Types of Keys (some review of keys): https://www.youtube.com/watch?v=_aN-8kszIdA – Snowflake: https://www.youtube.com/watch?v=ZVV56JBPDgA 36 Good Learning Resources • Webinar - Data Modeling & Business Intelligence (and data analytics): Webinar recording https://www.dataversity.net/ldm-webinar-data- modeling-business-intelligence/ and slides here https://www.slideshare.net/Dataversity/ldm-webinar-data-modeling-business-intelligence • Dimensional model and schema – https://dwbi.org/categories/17/dimensional-model – Dataversity data modeling: http://www.dataversity.net/category/data-topics/modeling/ – DWBI.org dimensional model tutorials http://dwbi.org/data-modelling/dimensional-model – https://learndatamodeling.com/blog/data-modeling-topics-index/ – 1KeyData: https://www.1keydata.com/datawarehousing/dimensional.html – http://en.wikipedia.org/wiki/Star_schema and http://en.wikipedia.org/wiki/Snowflake_schema • https://senturus.com/resources/data-modeling-comparison-tableau-cognos-and-power-bi/ • Compare relational model and dimensional model – https://www.kimballgroup.com/1997/08/a-dimensional-modeling-manifesto/ – https://pdfs.semanticscholar.org/f583/bd7fcc5cb027e22c6a94ba37bfda57ba9bb6.pdf • LeapFrog Dimensional Modeling training serials playlist https://www.youtube.com/playlist?list=PLrbIyvYCdg0iAUQoxG5vI_yKqzZ2AcgGe • Kimball DW/BI Resources http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/ – A Dimensional Modeling Manifesto by Ralph Kimball: http://www.kimballgroup.com/1997/08/a-dimensional-modeling-manifesto/ – Kimball Dimensional Modeling Techniques collection: http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball- techniques/dimensional-modeling-techniques/ or http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf – Kimball Design Tips: http://www.kimballgroup.com/category/design-tips/ – Articles: http://www.kimballgroup.com/category/business-intelligence-and-data-warehouse-articles/ – https://www.kimballgroup.com/2009/05/the-10-essential-rules-of-dimensional-modeling/ • Vendor resources – IBM redbook, Dimensional Modeling: In a BI Environment http://www.redbooks.ibm.com/abstracts/sg247138.html – https://www.ibm.com/support/knowledgecenter/SS9UM9_9.1.2/com.ibm.datatools.dimensional.ui.doc/topics/c_ida_dm_container.html – Oracle Data Warehousing Logical Design https://docs.oracle.com/en/database/oracle/oracle-database/23/dwhsg/data-warehouse-logical-design.html 37