My database career took off when I was sent on the two SQL Server 6.5 training courses. As a developer and later as a DBA I stayed with SQL Server until 2014.
Since then I have been exposed to a variety of database platforms and technologies. I have reached the conclusion that, for me, it is hardest to retain new knowledge when the technology is familiar. It is not the differences that trip me up, it is the similarities. I should like to show you what I mean by giving you some details of a project I worked on where the chosen database was an AWS RDS instance of PostgreSQL.
Background for the project
The project was to present a website visitor with a view of products based on a data science model of their propensity to buy. Those products that the model showed they were likely to buy were emphasised in some way. That could affect their position on the screen or the size of the tile on which they were presented. The goal was to make sure the website visitor didn't miss the product they were looking for. Like I miss things in the house that turn out to be right where I was told they were.
From a technical perspective what we knew were as follows.
- The data science model represented approximately 100 million records and rising
- The model had a high degree of change so a refresh could represent up to 60% change. These would be either inserts or updates.
- The model was produced by a regular batch process
- The model was represented as a simple tabular structure.
- The personalisation system, like the website, would have to be available 24/7/365
Quite early on we had worked out that we had to control the size of the model. The cost of producing and ingesting and processing a 100+ million record and growing data set was going to be prohibitive. We also found that we would need to split the model file into a set of smaller files to prevent a constipated ingestion pipe. We did look to see if Amazons DynamoDb would fit the bill (pun intended) but the characteristics described above ruled that out quite quickly.
Solution design
Look at the design for these requirements in the diagram shown below.
As model files became available, their fully qualified pathname would be stored in the ModelFileHopper table. Everything else would be handled by a batch process.
At the start of the batch process the first check would be to see if there were any files still remaining in the ModelWIP table. We abort the process if this were the case. This is to allow for the overall process running longer than a repeating schedule would allow. If the ModelWIP file were empty then every data model file name would be moved from the ModelFileHopper to the ModelWIP table.
The ModelWIP table contains a list of files we wish to bulk insert into the Model_Load table. Regardless of whether there is 1 or 100 file names listed in ModelWIP we bulk load them all into Model_Load.
Having populated Model_Load then where its records represent an update on the Model_Stage table records we perform that update. We do this before inserting new records because we do not want to evaluate records we know to be new in the update. Once both the update and insert process have completed Model_Stage now represents what we wish Model_Live to be. We do the equivalent of sp_rename to swap Model_Stage with Model_Live.
Model_Live is now the most up-to-date model and Model_Stage is stale so we run the Update/Insert again to make sure they are the same. Only after this has completed successfully do we truncate ModelWIP so the process is ready to receive new files.
Security considerations
The service that reads Model_Live does not need access to any of the preceding tables. Only the batch process needs the rights to insert, update, truncate and rename tables.
Permissions should be granted to a role and not an individual account. If ever an account is compromised then we can change the account and assign it to a role without worrying about migrating permissions.
Load considerations
Experience with SQL Server suggests that we are probably don't want to run a massive update and massive insert to bring Model_Stage up-to-date. We would want to break these actions up into smaller chunks. Through experimentation we found that the best batch size was 250,000.
Ease of development considerations
The process fits an ELT pattern. With that in mind the development team asked whether the various steps could be represented as stored procedures. If possible marshalled by a single stored procedure calling all the others. From their perspective they wanted to put the names of files to be processed into the ModelFileHopper table and let a scheduled batch process take it from there.
Where the similarities between SQL Server and PostgreSQL tripped us up
These are a few areas that were similar between the SQL Server and PostgreSQL platforms.
Database roles
Both SQL Server and PostgreSQL have the concept of user defined roles. You grant permissions on objects to these roles in much the same way in both products. However, we missed a crucial difference between the way the two treat database roles
We created the equivalent of a db_datareader role in PostgreSQL for a data science database application. We discovered we were blocked from tearing down/rebuilding the database because our role was active in another database
- In SQL Server a database role exists within the database
- In PostgreSQL a database role is at the PostgreSQL instance level
Stored Procedure permissions
Both SQL Server and PostgreSQL have stored procedures. This allows us encapsulate complex set based operations as a securable unit of code.
In SQL Server you don't need to grant permissions to the underlying tables or views. We can demonstrate this by setting up an experiment with a low privilege user.
USE master GO -- Create a login for our experiment CREATE LOGIN Binky WITH PASSWORD='Binky', DEFAULT_DATABASE = Master, CHECK_POLICY = OFF, CHECK_EXPIRATION = OFF ; /* Switch to a copy of Adventureworks 2014 */USE AW2014 GO CREATE ROLE proc_execute authorization dbo; -- Create our low user, Binky, and make them a member of our proc_execute role CREATE USER Binky FROM Login Binky; EXECUTE sp_addrolemember 'proc_execute', 'Binky'; GO -- Create a simple procedure that we want Binky to execute CREATE PROCEDURE Person.ListContactType AS SET NOCOUNT ON SELECT ContactTypeId,Name FROM Person.ContactType GO -- As Binky is a member of our proc_execute role they will gain the ability -- to execute our Person.ListContactType role GRANT EXECUTE ON Person.ListContactType TO proc_execute;
Binky can run Person.ListContactType but if they try to perform the SELECT directly then they get an error message
Msg 229, Level 14, State 5, Line 24 The SELECT permission was denied on the object 'ContactType', database 'AW2014', schema 'Person'.
In PostgreSQL this is not the case. The permissions of the person calling a stored procedure are those that are used. You have two options to resolve this.
- Grant permissions to the underlying tables/views. This negates a major security advantage of stored procedures
- Explicitly create the stored procedure with the SECURITY DEFINER clause. The stored procedure executes with the permissions of the creator/owner of the procedure.
It isn't much work to specify SECURITY DEFINER stored procedures (and functions), however, you do need to follow certain practises to make sure you don't accidentally create a security hole. The article Abusing SECURITY DEFINER functions explains the problem and the steps taken to protect yourself from those problems.
Default schemas and objects
We qualify objects with their schema in SQL Server. You should continue to do this in PostgreSQL, too.
In the Abusing SECURITY DEFINER functions article it description the SET search_path function. You can tell PostgreSQL the precedence in which database schemas should be searched for a given object if no schema is specified and you can use multiple schemas in the command. You can specify this at four levels
- Database level
- Role level
- User level
- In stored procedures in functions
You cannot alter SQL Server's precedence. When SQL Server receives a call to an unqualified object it will follow a precedence
- Try and use the object in the user's default schema if it exists
- Try and use the object in the dbo schema if it exists
This means we can get an unpleasant surprise in the results if an object in different schemas shares a name. We can demonstrate this in the script below.
USE AW2014 go CREATE PROC dbo.SimplePrint AS SET NOCOUNT ON SELECT 'This is the dbo schema' GO CREATE PROC Person.SimplePrint AS SET NOCOUNT ON SELECT 'This is the Person schema' GO GRANT EXECUTE ON dbo.SimplePrint TO proc_execute; GRANT EXECUTE ON Person.SimplePrint TO proc_execute; DROP USER Binky; CREATE USER Binky WITH DEFAULT_SCHEMA=Person; exec sp_AddRoleMember 'proc_execute','binky'; exec SimplePrint exec Person.SimplePrint
When we execute the stored procedures as database owner the stored procedure calls gives the following
This is the dbo schema This is the Person schema
When Binky executes the same stored procedures they get the following
This is the Person schema This is the Person schema
If we were to drop Person.SimplePrint then if Binky tried to run exec SimplePrint they would get the This is the dbo schema version.
Stored procedures and transactions
In our solution design we update and insert the Model_Stage table from Model_Load.
- Update first because there are fewer records to update
- Insert last
The number of inserts and updates are quite large so we want to carry out those operations in batches. In SQL Server our stored procedure for inserting records would be akin to the following
CREATE PROC Propensity.Populate_Model_Stage @BatchSize INT=50000 AS SET NOCOUNT ON DECLARE @InsertedRecords INT =1, @TotalRecords INT = 0 WHILE @InsertedRecords > 0 BEGIN INSERT INTO Propensity.Model_Stage({Column names}) SELECT TOP (@BatchSize) SRC.{Column names} FROM Propensity.Model_Load AS SRC LEFT JOIN Propensity.Model_Stage AS DEST ON SRC.model_id = DEST.model_id WHERE DEST.model_id IS NULL SET @InsertedRecords = @@ROWCOUNT SET @TotalRecords = @TotalRecords + @InsertedRecords RAISERROR('Total records inserted = %i',10,1,@TotalRecords) WITH NOWAIT END GO
By looking at the SQL Server transaction log we would see that we would have a transaction for each batch size chunk of records.
CHECKPOINT exec Propensity.Populate_Model_Stage 50000 SELECT [Current LSN],[Previous Page LSN],[Operation],[Transaction ID] ,AllocUnitName FROM sys.fn_dblog(NULL,NULL)
The first time we ran the equivalent stored procedure in PostgreSQL the stored procedure took far longer to run and produced nothing until the entire procedure had finished. PostgreSQL stored procedure automatically opens a transaction when it is called. Without an explicit commit statement after each loop iteration everything in the stored procedure will be wrapped up in a single transaction.
The equivalent PostgreSQL stored procedure is as shown below.
CREATE OR REPLACE PROCEDURE Propensity.Populate_Model_Stage(batch_size integer DEFAULT 50000) LANGUAGE plpgsql SECURITY DEFINER AS $$ SET search_path = 'Propensity'; DECLARE inserted_records INTEGER := 1; BEGIN loop if inserted_records > 0 then INSERT INTO Propensity.Model_Stage({Column Names}) SELECT SRC.{Column Names}{Column Names} FROM Propensity.Model_Load AS SRC LEFT JOIN Propensity.Model_Stage AS DEST ON SRC.model_id = DEST.model_id WHERE Dest.model_id IS NULL LIMIT batch_size; GET DIAGNOSTICS inserted_records := ROW_COUNT; RAISE WARNING '% BATCH INSERT: Propensity.Model_Stage: Records inserted % on batch size %', clock_timestamp(),inserted_records, batch_size; COMMIT; if inserted_records = 0 then exit; end if; end if; end loop; END; $$;
As a result of the stored procedure transaction behaviour our team found it was not possible to have a main calling procedure to call all other procedures. In order to overcome this they wrote a small scheduled task to call the stored procedures in the desired order.
Stored procedure language
In SQL Server SQL is SQL. In Postgres an equivalent stored procedure can be SQL or plpgsql.
- SQL for simple DML
- plpgsql for procedural processes
Closing thoughts
The examples given show where a database concept is implemented differently in two different database platforms. I just happened to pick SQL Server and PostgreSQL. Similar challenges would be faced if context switching from Google BigQuery to SnowflakeDB, PostgreSQL to MySQL or most other pairs of databases.
I found features in each database platform that I wish were available in the other. That is the nature of products in a competitive marketplace.