Amazon Redshift also comes with a few known limitations as a result of the aforementioned performance factors. One of those includes the native support for constraints. Although one can define constraints on columns, they are not strictly enforced. This means that the constraints will only be used for query planning but not to define the uniqueness or referential relationships.
Preventing duplicates in analytical tables can be quite tricky when there are no constraints in place. In this article, we will guide you through ways to avoid duplicates in your Redshift tables.
Mimicking Upsert in Amazon Redshift
AWS is aware of this shortcoming and has ways to overcome this problem, which involves a MERGE operation by using temporary or staging tables. There are two ways to mimic the UPSERT operation in Redshift:
- Merge By Rows - Delete the matching records in the target table and push all the records from staging.
- Merge By Columns - Delete the matching records in the staging table after updating them in the target table. Push the remaining new records to the target table.
Strategy 1 - Merge By Rows
- Create a staging table containing the incoming records.
- Delete the rows in the target table that need to be updated by scanning the target table once (all matching rows).
- Merge the entire contents of the staging table to the target table.
- Drop the staging table to mark completion.
Pitfalls
- Cannot be used for large tables, since this method involves purging and replacing all columns of the matching rows.
- A huge deletion operation occurs as part of this Merge operation, and a mandatory Vacuum operation needs to be performed upon completion.
Strategy 2 - Merge By Columns
- Create a staging table containing the incoming records.
- Update the columns of the matching records in the target table.
- Delete the updated records from the staging table.
- Merge the entire contents of the remaining records in the staging table to the target table.
- Drop the staging table to mark completion.
Pitfalls
Takes longer to complete compared to the previous strategy.
Examples
Based on the strategies discussed above, let's delve into the details by working on some of the examples.
CREATE TABLE users ( user_id bigint, first_name varchar(100), last_name varchar(100), address varchar(150), city varchar(100), created_at timestamp ) DISTSTYLE AUTO SORTKEY AUTO; INSERT INTO users (user_id, first_name, last_name, address, city, created_at) VALUES (1, 'John', 'Doe', '4725 Whiteman Street', 'Manahawkin', '2023-01-01 03:25:15'), (2, 'Joanna', 'Nadja', '4942 Douglas Dairy Road', 'Bassett', '2023-01-01 07:26:30'), (3, 'Tabatha', 'Deanna', '2792 Hickory Heights Drive', 'Hanover', '2023-01-02 19:11:38'), (4, 'Therese', 'Fay', '434 Oak Lane', 'Elmer', '2023-01-02 22:09:25'), (5, 'Araminta', 'Hilary', '3781 Zappia Drive', 'Winchester', '2023-01-02 04:05:53'), (6, 'Camille', 'Noelene', '992 Rollins Road', 'Republican City', '2023-01-03 11:25:36'); SELECT * FROM users;
Merge By Rows
As the first step, a temporary table ( users_staging
) is created similar to the target table ( users
).
CREATE TEMP TABLE users_staging (LIKE users); INSERT INTO users_staging (user_id, first_name, last_name, address, city, created_at) VALUES (5, 'Araminta', 'Hilary', '705 Williams Lane', 'Wichita', '2023-01-02 04:05:53'), (7, 'George', 'Guzman', '2093 Hillside Street', 'Cerulean City', '2023-01-06 15:15:27'); SELECT * FROM users_staging;
users
table that are matching with the users-staging
table and inserting all the records from the users_staging
table into the users
table.BEGIN TRANSACTION; DELETE FROM users USING users_staging WHERE users.user_id = users_staging.user_id; INSERT INTO users SELECT * FROM users_staging; END TRANSACTION; SELECT * FROM users;
users-staging
table can be dropped and the users
table should be vacuumed to re-sort rows and reclaim space.DROP TABLE users_staging; VACUUM users;
Merge By Columns
To validate the merge-by-columns strategy, we once again create a temporary table called users-staging
. However, this time, the table is created with the column to compare and only includes those columns that have been modified.
CREATE TEMP TABLE users_staging ( user_id bigint, address varchar(150), city varchar(100), created_at timestamp ); INSERT INTO users_staging (user_id, address, city, created_at) VALUES (5, '705 Williams Lane', 'Wichita', '2023-01-02 04:05:53'), (7, '2093 Hillside Street', 'Chandler', '2023-01-06 15:15:27'); SELECT * FROM users_staging;
user_id 5
has a set of columns that require updating, while user_id 7
is a completely new record.We will create a separate query transaction with the following steps:
- Accommodate updates to the
users
table based on the changes in theusers_staging
table. - Delete the updated records from the
users_staging
table. - Insert the new records from the
users_staging
table into theusers
table. - Complete the transaction.
BEGIN TRANSACTION; UPDATE users SET address = users_staging.address, city = users_staging.city, created_at = users_staging.created_at FROM users_staging WHERE users.user_id = users_staging.user_id; DELETE FROM users_staging USING users WHERE users_staging.user_id = users.user_id; INSERT INTO users (user_id, address, city, created_at) (SELECT user_id, address, city, created_at FROM users_staging); END TRANSACTION; SELECT * FROM users;
users_staging
table is dropped, and the users
table is vacuumed to reclaim the space.DROP TABLE users_staging; VACUUM users;