I recently worked on a project for fraud. This project needed to relate 67 million accounts to one another “Kevin Bacon” style. This means that each unique account may tie to another account on ssn, email, home phone, business phone, or any other PII type metric.
The goal is to catch accounts that may be fraud by relating the data together in levels of separation from a point of entry such as a single email address.
This process took more than 3 days to execute, relating each account to one another but didn’t didn’t provide levels and was somewhat inaccurate. To top this off it also only included a small subset of accounts (4.7 mil) from a single region.
This process needed to relate all 67 million accounts, be dynamic, and include the levels of separation.
I was able to get the initial matching process down to 5 hours for all regions. This was down from unfinished after 3 days for one region. After quite a bit of tuning the dynamic matching was taking 15 seconds which I didn’t feel good enough about. It needed to be faster.
How we go from days to seconds
I thought I’d do an experiment with Hekaton (Memory Optimized Tables) to see if I could speed up this process. Large groups with many levels of separation were taking minutes to match with the largest taking more than 30 minutes. Could Hekaton make this process faster?
It’s just for OLTP right? It’s not useful for reporting’ or, is it?
Experimenting paid off in a big way! The memory optimized table was able to gobble up the code and spit out a result in sub-second times!! My report query that took 30 minutes now only took 3 seconds.
The Demo setup
I’ve recreated my own demo scenario so the client’s code is not revealed. In this demo I created two tables: Customer & CustomerMO.
CREATE TABLE Customer
( accountnum varchar(17) PRIMARY KEY CLUSTERED,
companyname varchar(50),
Fname varchar(20),
lname varchar(20),
SSN char(9),
Homephone varchar(15),
Businessphone varchar(15),
Mobilephone varchar(15),
EmailAddress varchar(100)
)
Creating a memory optimized table is a lot different and there are quite a few limitations.
First you need a filegroup that supports memory optimized data
–Add MEMORY_OPTIMIZED_DATA filegroup to the database.
ALTER DATABASE EpicFail
ADD FILEGROUP fgMO CONTAINS MEMORY_OPTIMIZED_DATA–Add file to the MEMORY_OPTIMIZED_DATA filegroup.
ALTER DATABASE EpicFail
ADD FILE
( NAME = moFile,
FILENAME = N’C:\Data\moFile.mdf’)
TO FILEGROUP fgMO
Next we create the table
–Create memory optimized table and indexes on the memory optimized table.
CREATE TABLE CustomerMO( accountnum varchar(17) PRIMARY KEY NONCLUSTERED,
companyname varchar(50),
Fname varchar(20),
lname varchar(20),
SSN char(9),
Homephone varchar(15),
Businessphone varchar(15),
Mobilephone varchar(15),
EmailAddress varchar(100)
INDEX ix_moh_CustomerMo HASH (accountnum) WITH (BUCKET_COUNT = 5000000)) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA)
Next I used dbForge Data Generator to load 1,000,000 realistic rows of data. I attempted to do 5 million at first but my Surface with only 8 GB of RAM couldn’t handle the memory optimized table that large. Not enough RAM.
The data was loaded into Customer and then directly copied into CustomerMO to ensure the data was the same for fairness.
The comparison
A quick comparison shows some interesting results.
- The emailaddress non-clustered index was created with only “emailaddress”. Note that “select *” results in a key lookup for the traditional table but does not for the memory optimized.
- The memory optimized table is blazingly fast comparatively.
Next we add accountnum to the query and the performance gap is even wider.
For the matching process itself, it’s required that each entry point be matched with rows and those rows be matched with rows until no more are matched. Again, matching is based off PII data; so, for this example We start with loading all records with an email address of ‘Soares848@example.com’ into #accts.
Next we loop over queries that join the accounts based on homephone = homephone, homephone = businessphone, businessphone = homephone, email address = email address, and SSN = SSN. Looping is an expensive process but for this it is a must. Windowing functions are too limited to show what was matched on what at which level, remember that we’re matching on many things for each level.
As you can see by the example image the cost for the Memory Optimized query is much less and as data is added to #accts for each new level of separation discovered the top query becomes slower and slower.
I wouldn’t have thought that Hekaton could take my report query down from 30+ min to 3 seconds but in the end it did. *Note that the source data is static and repopulated just twice a week. With that said I didn’t bother looking into any limitations that “report style” queries may cause OLTP operations. I’ll leave that to you.
What are some key concerns/limitations of memory optimized tables?
- Cross database transactions are not allowed.
- Some complex queries seem to confuse Memory Optimized tables. Consider simplifying your queries for the best performance.
- Cannot truncate Memory Optimized tables.
- Identity columns are limited.
- FK’s are limited.
For a full list of limitations check out the following MSDN resource: https://msdn.microsoft.com/en-us/library/dn246937.aspx
I am an independent consultant. If you’d like to work with me or need help tuning your environment, please reach out on LinkedIn, Twitter, or daniel@austindatapros.com.