What ate all my DB resources?

  • Environment: Azure SQL managed instance with 6 VCores

    Normal Usage: 2-3 % at the time of the event

    Metrics: in the attachments

    The story:

    I am trying to understand what has happened on my database, a few days ago. We have normalized the users table and created a new UsersExternalGroups out of it. Doing so we have added about 9 million rows to the new table. Then 2 minutes later swapped the web nodes to use the new table. We ran the data migration in batches of 10,000 and every 10 minutes stopped it for a couple of minutes and then ran it again. 2 strange things happened in this time:

    1. The CPU percentage stayed on 90% in those 2 minutes gaps that we've had between each ~2 million rows that we have inserted
    2. After finishing the data migration, the CPU max dropped to 10%, but hit the 100% right after we've started using them. Then our users weren't able to use the application until we have doubled the resources, which caused the IO process goes up for 1 min and then the CPU came back to the usual 2-3% !

    The UserExternalGroups table structure is as below (there were some design limitations because of the framework we needed to use, it is not ideal). Before, the list of the External groups was a part of json data in the user table.

    CREATE TABLE [dbo].[UserExternalGroups] (
        [TenantId]               VARCHAR (25)   NOT NULL,
        [Id]                     BIGINT         NOT NULL,
        [Json]                   NVARCHAR (MAX) NOT NULL,
        [UserId]                 AS             (CONVERT([BIGINT], JSON_VALUE([Json], '$.userId'))),
        [ExternalGroupId]        AS             (CONVERT([BIGINT], JSON_VALUE([Json], '$.externalGroupId'))),
        CONSTRAINT [PK_UserExternalGroups] PRIMARY KEY CLUSTERED ([TenantId] ASC, [Id] ASC),
        CONSTRAINT [AK_UserExternalGroups] UNIQUE([TenantId], [UserId], [ExternalGroupId])
        );

    GO
    CREATE NONCLUSTERED INDEX [IX_UserExternalGroups_User]
        ON [dbo].[UserExternalGroups]([TenantId] ASC, [UserId] ASC);
    GO
    CREATE NONCLUSTERED INDEX [IX_UserExternalGroups_ExternalGroup]
        ON [dbo].[UserExternalGroups]([TenantId] ASC, [ExternalGroupId] ASC);

    So basically, I have 3 problems:

    • First, why did the CPU remain on 90% when there was no actions? (Data IO is at 0, but not CPU nor the logs) I've also checked the index fragmentation right after and it was on 99% so I wouldn't expect any re-building to happen that I didn't notice.
    • Second, what could be the cause of the CPU getting to 100%? The normal usage was minimal (2-3%) before and after the migration and then the incident. So what did use the resources for over an hour? I was thinking maybe there was a deadlock or something, but then increasing the number of VCores to double shouldn't have fixed the problem.
    • Third, where do you look to figure this?

    --

    Looking at the data, the query that eats the resources by far is this:

    (@__TenantId_0 varchar(25),@__userId_1 bigint)SELECT .[TenantId], .[Id], .[CreatedAt], .[ExternalGroupId], .[Json], .[ModifiedAt], .[UserId]
    FROM [UserExternalGroups] AS 
    WHERE (.[TenantId] = @__TenantId_0) AND (.[UserId] = @__userId_1)

    After a few days, we've got a recommendation from Azure to create a new index based on our usage.

     

    CREATE NONCLUSTERED INDEX [nci_wi_UserExternalGroups_9D402E5B1DA0CFF7343EE7F91E08FCD5] ON [dbo].[UserExternalGroups] ([TenantId], [UserId]) INCLUDE ([ExternalGroupId], [Json]) WITH (ONLINE = ON)
    Attachments:
    You must be logged in to view attached files.

    Best Regards,
    Ashkan

  • You indicated you split a table into two tables, but showed the DDL for only one table. Is this the before or after?  If after, what's the second table?

    What are the queries like that use the table? Are queries, perhaps, having to scan the whole table rather than seeking because indexes don't support the queries with the new schema? Have you looked at execution plans/performance following the change? How do they compare to before?

  • On top of what ratbak indicated - when you say "Azure SQL", do you mean it is an Azure SQL Instance, an Azure SQL Managed Instance , or an Azure Windows VM running SQL Server?  The reason I ask is if it is a VM, it could be that the CPU was in use by the OS?

    Now, assuming this is an Azure SQL Instance (no OS for you to manage), my guess about that high CPU in the first question would be that it is updating statistics as the data was inserted since your indexes still show high fragmentation.  That being said, I would expect that high CPU for statistics to be short lived.

    As for why the web application went goofy - that I would expect is due to the web application.  Likely it needed to do some stuff on its end which resulted in grabbing a lot of data and potentially doing a lot of new calculations on the data as to the application, this is all "new" data.  If you have something like a search tool (like elastic search for example), it may need to reindex the data in the database in its own database to help with searching.  Or you may see higher CPU as you are now pulling data from 2 tables rather than 1 and you need to do lookups between the two tables for the joins.  And depending on the structure of your new tables such as the PK/FK setup and indexes, you may be doing table scans across 2 tables instead of seeks like you might have had on the old setup.

    As for how to investigate, there are a lot of resources related to this.  I would recommend looking at what is running on the SQL instance when the CPU is high.  What queries are running?  Do your indexes support those queries?  What does the web application do at startup that could cause problem on the SQL side?  You would need to investigate both the web application AND the SQL instance.  If you are the DBA, your role would be investigating the instance.  The web app admin should investigate the web app to determine what it does at startup and what can be optimized there.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Without the queries that caused the CPU to stick to the ceiling, there's really no way to help here.

    I will tell you that if you didn't do your inserts in the same order as the PK, you can have massive page splits and fragmentation of the Clustered Index unless you prepared for and correctly handled that.

    Even if you did do things in the correct order or avoided all fragmentation, the Nickle'n'Dime inserts you did might have left a whole lot of data with no statistics on it, especially the latter stuff.  Before you do anything else, I recommend you rebuild the statistics on all the indexes.

    If you didn't actually force the JSON to go "out of row", you may be wasting a huge amount of memory because of "Trapped Short Rows".  I don't know if any of the Azure products actually have the out-of-row option, though.  Be advised that it would also require an empty string as a default on the column if the column is "ExpAnsively" updated (which causes the size of the row(s) to grow) or you'll end up with massive fragmentation of the Clustered Index.

    Neither of your computed columns are persisted.  Worse than that, they use scalar functions which will typically cause all queries to be single threaded even if you don't reference them in a given query.  They should be maintained with a trigger instead of being calculated even as a persisted computed column.

    You asked where to find out what the problem is... The answer is in the "Actual Execution Plan" but, again, I don't even know if that functionality is available in Azure.  I just haven't used it.

    For how to post what we need to help you with performance issues of this nature, please study the article at the 2nd link in my signature line below with the understanding that it was written before a lot of people even knew what a "cloud" was and may not be the same for Azure.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • The problem is very likely in the queries, as has been stated. You can capture execution plans from Azure SQL Database. You can do it pretty much all the same ways as regular SQL Server except for Trace. You can run a query from SQL Server Management Studio and enable "Actual Execution Plan" to see the execution plan plus runtime metrics. You can get the execution plan through "Estimated Plan." You can also query the DMVs to get the plan. For queries and such, they're the exact same DMVs as in the boxed product. Query Store is on by default, so you can get the plans from there. You can also use Extended Events to capture the plans as they get compiled, recompiled, or after execution so you can get runtime metrics. I'd suggest that.

    If you're unsure which query may be causing the pain, then you can use Query Performance Insight there on the Azure portal. It'll give you a view. You could look to the Query Store to see aggregated behavior. Or, you can get detailed metrics by creating an Extended Events session.

    However, as several people have said, it's probably the queries, so focus there. Get the metrics. Identify the poor performers or the most frequently executed. Get the execution plan. Understand what's going wrong. Fix it.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks a lot everyone for the awesome guidance. I have added the missed information in the question. I have asked for permission to look into production database's query store so hopefully I should be able to give an update on that soon.

    Regarding the queries, we don't have elastic search or any other tools to read from the data. It is a pretty new platform that we are building and event the user data is not big (only about 200k users). The queries run on this data are the simplest you could expect. Before the changes, we were only querying from 1 user's point of view so that was why de-normalized data made sense. We were doing something like:

    SELECT [TenantId],[Id], [Name], [Email], [Json]
    FROM Users
    Where [TenantId]= @tId, [Id]= @Id

    I'd say pretty efficient and on the clustered index. With the changes, we are supposed to support the queries from the Groups as well, for both Select and update. With the new structure we have the middle table that is supposed to run these 2 queries:

    SELECT [TenantId],[Id], [UserId], [ExterenalGroupId], [Json] 
    FROM UsersExternalGroups
    Where [TenantId]= @tId, [UserId]= @uId
     

    As well as

    SELECT [TenantId],[Id], [UserId], [ExterenalGroupId], [Json] 
    FROM UsersExternalGroups
    Where [TenantId]= @tId, [ExternalGroupId]= @egId

    The primary key in the new table (UsersExternalGroups) is the composite key (TenantId, Id) and the Id is generated by time ticks in the web app. When doing the migration, we have started from a lower point in time and went with an incremental approach (lastIndex+1).

    This way I don't expect to have a big fragmentation (if any) in the clustered index, but I am concerned that it is not used by anything after all. Looking at the 2 queries above, we are only using the non-clustered index.

    I am also not that much concerned  regarding the size of the JSON column either. We don't have that much data there so I am sure there is no rows over 1KB.

    Regarding calculated column VS persisted triggered data, I don't have enough knowledge so I would like to read more. I was expecting the indexes to come to play already since they are persisted. Isn't that the case here?

     

    Best Regards,
    Ashkan

  • ashkan sirous wrote:

    Regarding calculated column VS persisted triggered data, I don't have enough knowledge so I would like to read more. I was expecting the indexes to come to play already since they are persisted. Isn't that the case here?

    It doesn't matter if it's persisted or not, indexed or not, if you use a UDF in a computed column, queries against the table won't go parallel even if the column is never referenced in the queries.  See the following article and, yes... I did follow up testing in more ways than one and Mr. Ozar is spot on.

    Here's the link to his article that explains it.

    https://www.brentozar.com/archive/2020/10/using-triggers-to-replace-scalar-udfs-on-computed-columns/

    Here are couple of other things in the same vein...

    https://www.brentozar.com/archive/2016/01/another-reason-why-scalar-functions-in-computed-columns-is-a-bad-idea/

    https://www.brentozar.com/archive/2016/01/still-serial-after-all-these-years/

    https://www.brentozar.com/archive/2016/04/another-hidden-parallelism-killer-scalar-udfs-check-constraints/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply