Optimized for Star Schemas on Amazon Redshift
Optimizations for Star Schemas
Amazon Redshift automatically detects star schema data structures and has built-in optimizations for efficiently querying this data. You also have a number of optimization options under your control that affect query performance whether you are using a star schema or another data model. The following sections explain how to apply these optimizations in the context of a star schema.
Distribution Styles
A good distribution key distributes data relatively evenly across nodes while also collocating joined data to improve cluster performance. If you are using a star schema, follow these distribution key guidelines to optimize Amazon Redshift performance across nodes:
- Define a distribution key for your fact table as the foreign key to one of the larger dimension tables.
- Identify frequently joined dimension tables with slowly changing data that are not joined with the fact table on a common distribution key. These are good candidates for the distribution style of ALL.
- Choose the primary (or surrogate) key as the distribution key for remaining dimension tables.
Typically a star schema has a large fact table and numerous, comparatively smaller dimension tables. You can follow these steps to optimize the distribution choices for optimized table joins:
- Identify large dimension tables used in joins.
- Choose the large dimension that is most frequently joined as indicated by your query patterns.
- Follow the guidelines in the preceding section to make sure that the foreign key to the dimension in the fact table gives relatively even distribution across your facts and that the primary/surrogate key for the dimension table is also relatively evenly distributed. If distribution is not relatively even, choose a different dimension.
- Use the foreign key to the identified dimension as your fact table distribution key, and use the primary key in the dimension table as your dimension table distribution key.
- Identify dimension tables that are suitable for a distribution style of ALL as described earlier in this article.
- For the remainder of your dimension tables, define the primary key as the distribution key. The primary key does not typically exhibit significant data skew. If it does, then don't define a distribution key.
- Test your common queries and use the EXPLAIN command for any queries that need further tuning. For information about the EXPLAIN command, see the EXPLAIN command documentation.
Uneven Distribution
Unevenly distributed data resulting from a poor distribution key is a common problem with query performance. Double-check the distribution of the data for your large tables.
EXPLAIN Command
The EXPLAIN command will help you determine if data movement during the query is an issue. The EXPLAIN command shows steps and the relative costs of each step in the query plan. The cost has no meaning in terms of actual query time taken, but it shows the relative execution time of a particular step in a query plan compared to other steps in the plan. If data movement is an issue, the EXPLAIN plan will expose a join attribute, such as DS_BCAST_INNER or DS_DIST_BOTH, with a high cost.
Comments
Post a Comment