Big Data

Dimensional modeling in Amazon Redshift


Amazon Redshift is a fully managed and petabyte-scale cloud data warehouse that is used by tens of thousands of customers to process exabytes of data every day to power their analytics workload. You can structure your data, measure business processes, and get valuable insights quickly can be done by using a dimensional model. Amazon Redshift provides built-in features to accelerate the process of modeling, orchestrating, and reporting from a dimensional model.

In this post, we discuss how to implement a dimensional model, specifically the Kimball methodology. We discuss implementing dimensions and facts within Amazon Redshift. We show how to perform extract, transform, and load (ELT), an integration process focused on getting the raw data from a data lake into a staging layer to perform the modeling. Overall, the post will give you a clear understanding of how to use dimensional modeling in Amazon Redshift.

Solution overview

The following diagram illustrates the solution architecture.

In the following sections, we first discuss and demonstrate the key aspects of the dimensional model. After that, we create a data mart using Amazon Redshift with a dimensional data model including dimension and fact tables. Data is loaded and staged using the COPY command, the data in the dimensions is loaded using the MERGE statement, and facts will be joined to the dimensions where insights are derived from. We schedule the loading of the dimensions and facts using the Amazon Redshift Query Editor V2. Lastly, we use Amazon QuickSight to gain insights on the modeled data in the form of a QuickSight dashboard.

For this solution, we use a sample dataset (normalized) provided by Amazon Redshift for event ticket sales. For this post, we have narrowed down the dataset for simplicity and demonstration purposes. The following tables show examples of the data for ticket sales and venues.

According to the Kimball dimensional modeling methodology, there are four key steps in designing a dimensional model:

  1. Identify the business process.
  2. Declare the grain of your data.
  3. Identify and implement the dimensions.
  4. Identify and implement the facts.

Additionally, we add a fifth step for demonstration purposes, which is to report and analyze business events.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Identify the business process

In simple terms, identifying the business process is identifying a measurable event that generates data within an organization. Usually, companies have some sort of operational source system that generates their data in its raw format. This is a good starting point to identify various sources for a business process.

The business process is then persisted as a data mart in the form of dimensions and facts. Looking at our sample dataset mentioned earlier, we can clearly see the business process is the sales made for a given event.

A common mistake made is using departments of a company as the business process. The data (business process) needs to be integrated across various departments, in this case, marketing can access the sales data. Identifying the correct business process is critical—getting this step wrong can impact the entire data mart (it can cause the grain to be duplicated and incorrect metrics on the final reports).

Declare the grain of your data

Declaring the grain is the act of uniquely identifying a record in your data source. The grain is used in the fact table to accurately measure the data and enable you to roll up further. In our example, this could be a line item in the sales business process.

In our use case, a sale can be uniquely identified by looking at the transaction time when the sale took place; this will be the most atomic level.

Identify and implement the dimensions

Your dimension table describes your fact table and its attributes. When identifying the descriptive context of your business process, you store the text in a separate table, keeping the fact table grain in mind. When joining the dimensions table to the fact table, there should only be a single row associated to the fact table. In our example, we use the following table to be separated into a dimensions table; these fields describe the facts that we will measure.

When designing the structure of the dimensional model (the schema), you can either create a star or snowflake schema. The structure should closely align with the business process; therefore, a star schema is best fit for our example. The following figure shows our Entity Relationship Diagram (ERD).

In the following sections, we detail the steps to implement the dimensions.

Stage the source data

Before we can create and load the dimensions table, we need source data. Therefore, we stage the source data into a staging or temporary table. This is often referred to as the staging layer, which is the raw copy of the source data. To do this in Amazon Redshift, we use the COPY command to load the data from the dimensional-modeling-in-amazon-redshift public S3 bucket located on the us-east-1 Region. Note that the COPY command uses an AWS Identity and Access Management (IAM) role with access to Amazon S3. The role needs to be associated with the cluster. Complete the following steps to stage the source data:

  1. Create the venue source table:
CREATE TABLE public.venue (
    venueid bigint,
    venuename character varying(100),
    venuecity character varying(30),
    venuestate character(2),
    venueseats bigint
) DISTSTYLE AUTO
        SORTKEY
    (venueid);

  1. Load the venue data:
COPY public.venue
FROM 's3://redshift-blogs/dimensional-modeling-in-amazon-redshift/venue.csv'
IAM_ROLE '<Your IAM role arn>'
DELIMITER ','
REGION 'us-east-1'
IGNOREHEADER 1

  1. Create the sales source table:
CREATE TABLE public.sales (
    salesid integer,
    venueid character varying(256),
    saletime timestamp without time zone,
    qtysold BIGINT,
    commission numeric(18,2),
    pricepaid numeric(18,2)
) DISTSTYLE AUTO;

  1. Load the sales source data:
COPY public.sales
FROM 's3://redshift-blogs/dimensional-modeling-in-amazon-redshift/sales.csv'
IAM_ROLE '<Your IAM role arn>'
DELIMITER ','
REGION 'us-east-1'
IGNOREHEADER 1

  1. Create the calendar table:
CREATE TABLE public.DimCalendar(
    dateid smallint,
        caldate date,
        day varchar(20),
        week smallint,
        month varchar(20),
        qtr varchar(20),
        year smallint,
        holiday boolean
) DISTSTYLE AUTO
SORTKEY
    (dateid);

  1. Load the calendar data:
COPY public.DimCalendar
FROM 's3://redshift-blogs/dimensional-modeling-in-amazon-redshift/date.csv'
IAM_ROLE '<Your IAM role arn>'
DELIMITER ',' 
REGION 'us-east-1'
IGNOREHEADER 1

Create the dimensions table

Designing the dimensions table can depend on your business requirement—for example, do you need to track changes to the data over time? There are seven different dimension types. For our example, we use type 1 because we don’t need to track historical changes. For more about type 2, refer to Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift. The dimensions table will be denormalized with a primary key, surrogate key, and a few added fields to indicate changes to the table. See the following code:

create schema SalesMart;

CREATE TABLE SalesMart.DimVenue( 
    "VenueSkey" int IDENTITY(1,1) primary key
    ,"VenueId" VARCHAR NOT NULL
    ,"VenueName" VARCHAR NULL
    ,"VenueCity" VARCHAR NULL
    ,"VenueState" VARCHAR NULL
    ,"VenueSeats" INT NULL
    ,"InsertedDate" DATETIME NOT NULL
    ,"UpdatedDate" DATETIME NOT NULL
) 
diststyle AUTO;

A few notes on creating the dimensions table creation:

  • The field names are transformed into business-friendly names
  • Our primary key is VenueID, which we use to uniquely identify a venue at which the sale took place
  • Two additional rows will be added, indicating when a record was inserted and updated (to track changes)
  • We are using an AUTO distribution style to give Amazon Redshift the responsibility to choose and adjust the distribution style

Another important factor to consider in dimensional modelling is the usage of surrogate keys. Surrogate keys are artificial keys that are used in dimensional modelling to uniquely identify each record in a dimension table. They are typically generated as a sequential integer, and they don’t have any meaning in the business domain. They offer several benefits, such as ensuring uniqueness and improving performance in joins, because they’re typically smaller than natural keys and as surrogate keys they don’t change over time. This allows us to be consistent and join facts and dimensions more easily.

In Amazon Redshift, surrogate keys are typically created using the IDENTITY keyword. For example, the preceding CREATE statement creates a dimension table with a VenueSkey surrogate key. The VenueSkey column is automatically populated with unique values as new rows are added to the table. This column can then be used to join the venue table to the FactSaleTransactions table.

A few tips for designing surrogate keys:

  • Use a small, fixed-width data type for the surrogate key. This will improve performance and reduce storage space.
  • Use the IDENTITY keyword, or generate the surrogate key using a sequential or GUID value. This will ensure that the surrogate key is unique and can’t be changed.

Load the dim table using MERGE

There are numerous ways to load your dim table. Certain factors need to be considered—for example, performance, data volume, and perhaps SLA loading times. With the MERGE statement, we perform an upsert without needing to specify multiple insert and update commands. You can set up the MERGE statement in a stored procedure to populate the data. You then schedule the stored procedure to run programmatically via the query editor, which we demonstrate later in the post. The following code creates a stored procedure called SalesMart.DimVenueLoad:

CREATE OR REPLACE PROCEDURE SalesMart.DimVenueLoad()
AS $$
BEGIN
MERGE INTO SalesMart.DimVenue USING public.venue as MergeSource
ON SalesMart.DimVenue.VenueId = MergeSource.VenueId
WHEN MATCHED
THEN
UPDATE
SET VenueName = ISNULL(MergeSource.VenueName, 'Unknown')
, VenueCity = ISNULL(MergeSource.VenueCity, 'Unknown')
, VenueState = ISNULL(MergeSource.VenueState, 'Unknown')
, VenueSeats = ISNULL(MergeSource.VenueSeats, -1)
, UpdatedDate = GETDATE()
WHEN NOT MATCHED
THEN
INSERT (
VenueId
, VenueName
, VenueCity
, VenueState
, VenueSeats
, UpdatedDate
, InsertedDate
)
VALUES (
ISNULL(MergeSource.VenueId, -1)
, ISNULL(MergeSource.VenueName, 'Unknown')
, ISNULL(MergeSource.VenueCity, 'Unknown')
, ISNULL(MergeSource.VenueState, 'Unknown')
, ISNULL(MergeSource.VenueSeats, -1)
, ISNULL(GETDATE() , '1900-01-01')
, ISNULL(GETDATE() , '1900-01-01')
);
END;
$$
LANGUAGE plpgsql;

A few notes on the dimension loading:

  • When a record in inserted for the first time, the inserted date and updated date will be populated. When any values change, the data is updated and the updated date reflects the date when it was changed. The inserted date remains.
  • Because the data will be used by business users, we need to replace NULL values, if any, with more business-appropriate values.

Identify and implement the facts

Now that we have declared our grain to be the event of a sale that took place at a specific time, our fact table will store the numeric facts for our business process.

We have identified the following numerical facts to measure:

  • Quantity of tickets sold per sale
  • Commission for the sale

Implementing the Fact

There are three types of fact tables (transaction fact table, periodic snapshot fact table, and accumulating snapshot fact table). Each serves a different view of the business process. For our example, we use a transaction fact table. Complete the following steps:

  1. Create the fact table
CREATE TABLE SalesMart.FactSaleTransactions( 
    CalendarDate date NOT NULL
    ,SaleTransactionTime DATETIME NOT NULL
    ,VenueSkey INT NOT NULL
    ,QuantitySold BIGINT NOT NULL
    ,SaleComission NUMERIC NOT NULL
    ,InsertedDate DATETIME DEFAULT GETDATE()
) diststyle AUTO;

An inserted date with a default value is added, indicating if and when a record was loaded. You can use this when reloading the fact table to remove the already loaded data to avoid duplicates.

Loading the fact table consists of a simple insert statement joining your associated dimensions. We join from the DimVenue table that was created, which describes our facts. It’s best practice but optional to have calendar date dimensions, which allow the end-user to navigate the fact table. Data can either be loaded when there is a new sale, or daily; this is where the inserted date or load date comes in handy.

We load the fact table using a stored procedure and use a date parameter.

  1. Create the stored procedure with the following code. To keep the same data integrity that we applied in the dimension load, we replace NULL values, if any, with more business appropriate values:
create or replace procedure SalesMart.FactSaleTransactionsLoad(loadate datetime)
language plpgsql
as
    $$
begin
--------------------------------------------------------------------
/*** Delete records loaded for the day, should there be any ***/
--------------------------------------------------------------------
Delete from SalesMart.FactSaleTransactions
where cast(InsertedDate as date) = CAST(loadate as date);
RAISE INFO 'Deleted rows for load date: %', loadate;
--------------------------------------------------------------------
/*** Insert records ***/
--------------------------------------------------------------------
INSERT INTO SalesMart.FactSaleTransactions (
CalendarDate    
,SaleTransactionTime    
,VenueSkey  
,QuantitySold  
,Salecomission
)
SELECT DISTINCT
    ISNULL(c.caldate, '1900-01-01') as CalendarDate
    ,ISNULL(a.saletime, '1900-01-01') as SaleTransactionTime
    ,ISNULL(b.VenueSkey, -1) as VenueSkey
    ,ISNULL(a.qtysold, 0) as QuantitySold
    ,ISNULL(a.commission, 0) as SaleComission
FROM
    public.sales as a
 
LEFT JOIN SalesMart.DimVenue as b
on a.venueid = b.venueid
 
LEFT JOIN public.DimCalendar as c
on to_char(a.saletime,'YYYYMMDD') = to_char(c.caldate,'YYYYMMDD');
--Optional filter, should you want to load only the latest data from source
--where cast(a.saletime as date) = cast(loadate as date);
  
end;
$$;

  1. Load the data by calling the procedure with the following command:
call SalesMart.FactSaleTransactionsLoad(getdate())

Schedule the data load

We can now automate the modeling process by scheduling the stored procedures in Amazon Redshift Query Editor V2. Complete the following steps:

  1. We first call the dimension load and after the dimension load runs successfully, the fact load begins:
BEGIN;
----Insert Dim Loads
call SalesMart.DimVenueLoad();

----Insert Fact Loads. They will only run if the DimLoad is successful
call SalesMart.FactSaleTransactionsLoad(getdate());
END;

If the dimension load fails, the fact load will not run. This ensures consistency in the data because we don’t want to load the fact table with outdated dimensions.

  1. To schedule the load, choose Schedule in Query Editor V2.

  1. We schedule the query to run every day at 5:00 AM.
  2. Optionally, you can add failure notifications by enabling Amazon Simple Notification Service (Amazon SNS) notifications.

Report and analysis the data in Amazon Quicksight

QuickSight is a business intelligence service that makes it easy to deliver insights. As a fully managed service, QuickSight lets you easily create and publish interactive dashboards that can then be accessed from any device and embedded into your applications, portals, and websites.

We use our data mart to visually present the facts in the form of a dashboard. To get started and set up QuickSight, refer to Creating a dataset using a database that’s not autodiscovered.

After you create your data source in QuickSight, we join the modeled data (data mart) together based on our surrogate key skey. We use this dataset to visualize the data mart.

Our end dashboard will contain the insights of the data mart and answer critical business questions, such as total commission per venue and dates with the highest sales. The following screenshot shows the final product of the data mart.

Clean up

To avoid incurring future charges, delete any resources you created as part of this post.

Conclusion

We have now successfully implemented a data mart using our DimVenue, DimCalendar, and FactSaleTransactions tables. Our warehouse is not complete; as we can expand the data mart with more facts and implement more marts, and as the business process and requirements grow over time, so will the data warehouse. In this post, we gave an end-to-end view on understanding and implementing dimensional modeling in Amazon Redshift.

Get started with your Amazon Redshift dimensional model today.


About the Authors

Bernard Verster is an experienced cloud engineer with years of exposure in creating scalable and efficient data models, defining data integration strategies, and ensuring data governance and security. He is passionate about using data to drive insights, while aligning with business requirements and objectives.

Abhishek Pan is a WWSO Specialist SA-Analytics working with AWS India Public sector customers. He engages with customers to define data-driven strategy, provide deep dive sessions on analytics use cases, and design scalable and performant analytical applications. He has 12 years of experience and is passionate about databases, analytics, and AI/ML. He is an avid traveler and tries to capture the world through his camera lens.