Optimized for Star Schemas on Amazon Redshift




Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift offers you fast query performance when analyzing virtually any size dataset using the same business intelligence applications you use today. Amazon Redshift uses many techniques to achieve fast query performance at scale, including multi-node parallel operations, hardware optimization, network optimization, and data compression. At the same time, Amazon Redshift minimizes operational overhead by freeing you from the hassle associated with provisioning, patching, backing up, restoring, monitoring, securing, and scaling a data warehouse cluster. For a detailed architecture overview of the Amazon Redshift service and optimization techniques, see the Amazon Redshift system overview.


 For example, a product dimension may have the brand in a separate table. For more information, see star schema and snowflake schema.



Optimizations for Star Schemas

Amazon Redshift automatically detects star schema data structures and has built-in optimizations for efficiently querying this data. You also have a number of optimization options under your control that affect query performance whether you are using a star schema or another data model. The following sections explain how to apply these optimizations in the context of a star schema.


Distribution Styles


A good distribution key distributes data relatively evenly across nodes while also collocating joined data to improve cluster performance. If you are using a star schema, follow these distribution key guidelines to optimize Amazon Redshift performance across nodes:

  • Define a distribution key for your fact table as the foreign key to one of the larger dimension tables.
  • Identify frequently joined dimension tables with slowly changing data that are not joined with the fact table on a common distribution key. These are good candidates for the distribution style of ALL.
  • Choose the primary (or surrogate) key as the distribution key for remaining dimension tables.
In the preceding example, the SalesOrderLineItemFact table has a distribution key of customer_key, and the Customer dimension table has a distribution key of customer_id. The data where customer_key equals customer_id will be located on the same slice in a node.




Typically a star schema has a large fact table and numerous, comparatively smaller dimension tables. You can follow these steps to optimize the distribution choices for optimized table joins:

  1. Identify large dimension tables used in joins.
  2. Choose the large dimension that is most frequently joined as indicated by your query patterns.
  3. Follow the guidelines in the preceding section to make sure that the foreign key to the dimension in the fact table gives relatively even distribution across your facts and that the primary/surrogate key for the dimension table is also relatively evenly distributed. If distribution is not relatively even, choose a different dimension.
  4. Use the foreign key to the identified dimension as your fact table distribution key, and use the primary key in the dimension table as your dimension table distribution key.
  5. Identify dimension tables that are suitable for a distribution style of ALL as described earlier in this article.
  6. For the remainder of your dimension tables, define the primary key as the distribution key. The primary key does not typically exhibit significant data skew. If it does, then don't define a distribution key.
  7. Test your common queries and use the EXPLAIN command for any queries that need further tuning. For information about the EXPLAIN command, see the EXPLAIN command documentation.

Uneven Distribution

Unevenly distributed data resulting from a poor distribution key is a common problem with query performance. Double-check the distribution of the data for your large tables. 

EXPLAIN Command

The EXPLAIN command will help you determine if data movement during the query is an issue. The EXPLAIN command shows steps and the relative costs of each step in the query plan. The cost has no meaning in terms of actual query time taken, but it shows the relative execution time of a particular step in a query plan compared to other steps in the plan. If data movement is an issue, the EXPLAIN plan will expose a join attribute, such as DS_BCAST_INNER or DS_DIST_BOTH, with a high cost. 





Comments

Popular posts from this blog