Availability Group - Data Import problems

  • We have a two node WFCS with an SQL Availability Group and we are having a performance issue with data import being very slow

    Specification

    -Server OS: Windows 2019 Standard

    -SQL Version: 2019 Enterprise

    -Servers are in the different data Centers

    Currently the database is on a Windows 2016 SQL 2016 stand alone server and we want to migrate to a Server 2019 SQL 2019 Availability Group

    The problem we are having is that the program imports about 120K records on a daily basis and the performance while using AG is very bad.

    More Specifically, while the application is on the 2016 Server and 2016 SQL Stand alone the imports run in a few hours.  When we migrated the database to the Server 2019 SQL 2019 AG the import times tanked - taking up to 12 hours to perform the same process

    We found that on the 2019 system we take the database out of AG and the import is three times faster - an acceptable rate for the application.  We also found small improvement when switching from Synchronous Mode to Asynchronous mode, but still very slow using Asynchronous

    I did many of the performance checks that I could, but really the main problem is when the DB is in AG.

    The two nodes are in geographically separated data centers, but network throughput should be fine.  I am thinking of adding another node in the same data center and making it a three node cluster.

    Has anybody had the same problem before? For an application that imports a truckload of data is AG not a good solution?

    Do you have any suggestions on what I can do for this issue any feedback is appreciated.

     

    Jeff

  • In synchronous mode - data needs to be hardened on the secondary before it can be committed on the primary.  That could have a significant impact on the performance - and since both servers are in separate data centers that latency would definitely have an impact.

    In asynchronous mode - the data can be committed to the primary, but then you could be having issues with the transaction log needing to grow to accommodate the import process.  If you are shrinking the log file before/after the import process and/or shrinking the data files - that is going to have an impact on the performance as well.

    Now - with all that said, loading 120K rows shouldn't take more than a couple of minutes at most.  The fact that it already takes several hours I would be focusing on why it takes that long.  Once you understand why it is taking that amount of time you can focus on fixing that process.  My guess is the new systems are misconfigured somewhere - don't have the same resources available - or missing statistics - or something else.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Jeffrey Williams wrote:

    In synchronous mode - data needs to be hardened on the secondary before it can be committed on the primary.  That could have a significant impact on the performance - and since both servers are in separate data centers that latency would definitely have an impact.

    In asynchronous mode - the data can be committed to the primary, but then you could be having issues with the transaction log needing to grow to accommodate the import process.  If you are shrinking the log file before/after the import process and/or shrinking the data files - that is going to have an impact on the performance as well.

    Now - with all that said, loading 120K rows shouldn't take more than a couple of minutes at most.  The fact that it already takes several hours I would be focusing on why it takes that long.  Once you understand why it is taking that amount of time you can focus on fixing that process.  My guess is the new systems are misconfigured somewhere - don't have the same resources available - or missing statistics - or something else.

    I'm with Jeffery.  The fact that it originally took 4 hours is what I would focus on.

    I have an identical setup, the remote data center was set to synchronous mode.  That is fine for the normal OLTP operations.  It became an issue when there were batch operations that needed to occur.

    I changed it to async, and the execution times went back to normal.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Thank you very much for the feedback

    I will first test the database on a cluster where there are two nodes in the same datacenter and while trying Synchronous and Asychronous modes.

    I will also try several other performance improvements and let you know the outcome.

     

    Jeff

  • Thank you very much for the reply on this.  I am still working on this project.

    I have build a new server and added it to the cluster in the primary data center and Availability group.  I am doing as much Performance upgrades as I can to include:

    • Formatting all the disks to 64K allocation units
    • Adding ndf's and putting them on different LUN's - I did read the article from Steve that doing this does not help, but I am following the recommendations of the developer
    • I added more tempdb ndf files and again on separate LUN's
    • Change the compatibility Level to 2016 per Developer request
    • Setting up a rebuild index job to run every two days per the developer request

    The server has 8 Sockets with 8 Core and 64GB of memory, so along with all the I am hoping the testing goes well.  I will let you know what happens.

    Jeff

  • Jeffrey Williams wrote:

    Now - with all that said, loading 120K rows shouldn't take more than a couple of minutes at most. The fact that it already takes several hours I would be focusing on why it takes that long.

    Amen to that.  I was importing 1.2 million rows of 40 columns, doing the data validation, and the merges on 32 bit machines 15 years ago and it took 56 seconds.  120K rows should complete in seconds with today's machines.  There's something wrong even with the 2016 machine and then to add AG on top of that and have it take hours is a bit crazy to me UNLESS there a huge LOBs also being imported.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • jayoub wrote:

    Thank you very much for the reply on this.  I am still working on this project.

    I have build a new server and added it to the cluster in the primary data center and Availability group.  I am doing as much Performance upgrades as I can to include:

    • Formatting all the disks to 64K allocation units
    • Adding ndf's and putting them on different LUN's - I did read the article from Steve that doing this does not help, but I am following the recommendations of the developer
    • I added more tempdb ndf files and again on separate LUN's
    • Change the compatibility Level to 2016 per Developer request
    • Setting up a rebuild index job to run every two days per the developer request

    The server has 8 Sockets with 8 Core and 64GB of memory, so along with all the I am hoping the testing goes well.  I will let you know what happens.

    What facts did the developer provide that would make you think that adding LUNS and .Ndf's would help? Are these physical machines?  Or virtual machines?  What kind of SAN do you have?  iSCSI? Fiber? Is it configured properly for a database workload? Have you measured latency? If the "Steve" you are referring to is "Steve Jones", I think I know where my money would be.

    How many tempdb files did you create, and have they been configured properly to use the proportional fill algorithm?

    There is no difference in the query optimizer between 2016 and 2019 that would indicate changing the compatibility level would help performance. Again, what facts indicated that this should be done?  What steps did you take after you migrated?  Were statistics updated?  When you built the server, did you do any load testing?  If so, what were the results?

    What facts would lead you to reindex every 2 days? Are you limiting your rebuild to indexes that actually have fragmentation? Are you reorganizing any indexes?  If you are doing a large set of imports after reindexing, you are likely causing a significant amount of page splits which may have a much larger performance impact than a fragmented index.

    It really sounds like the developer is making recommendations based upon something he heard someplace, without facts to back up any of these recommendations.

    And, more importantly, do YOU understand the reasons to make these changes?

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Thank you very much and I like your feedback.

    I was pulled off the project for a bit and will resume on Monday and will be able to address your comments.

    Jeff

  • Thank you for the feedback on this topic.

    The developer tested after making these change and the performance has increased.  The job that imports the 120K records are only taking 2 hours now and running at 1K records per minute – before it was about 30 records per minute.  Since we placed two servers in one data center, we are seeing the increased performance and able to configure the Availability Group with Synchronous Commit and Automatic failover.  The Server in the remote data center is configured with Asynchronous Commit and manual failover.  The original configuration was not done using best practice and caused the performance problems.

    While dealing with this problem I realized that using Availability Groups for this type of database processing is not the best solution, but we were able to make it work with the SQL changes as well as changes the developer is making on the applicaton side.

    Below are answers to your questions

    What facts did the developer provide that would make you think that adding LUNS and .Ndf's would help?

    We were both under the impression that adding more files would improve performance and after reading the article from Steve Jones it may not do anything for performance.  I also pre sized the files with plenty of space hoping it would help as well. 

    Are these physical machines or virtual machines? Virtual 

    What kind of SAN do you have?  iSCSI? Fiber? I do not know

    Is it configured properly for a database workload? I do not know

    Have you measured latency? I used the Glen Berry Scripts to measured I\O latency, and everything was under 20ms while doing the data loads.  I also use those scripts to checked memory and CPU pressure as well and they were also low while processing the loads. 

    If the "Steve" you are referring to is "Steve Jones", I think I know where my money would be.

    Yup

    How many tempdb files did you create, and have they been configured properly to use the proportional fill algorithm?

    So, I create eight files because I have eight CPU’s and from what I read, and could be wrong, the proportional fill is set by default.  I made no change to the TempDBs after creating them.  If you have any suggestions, I am open

    There is no difference in the query optimizer between 2016 and 2019 that would indicate changing the compatibility level would help performance.

    I understand.  In this case the vendor documentation is what the developer used when making this suggestion

    Again, what facts indicated that this should be done? Vendor Docs and developer recommendations

    What steps did you take after you migrated?

    ·         Adding a node in the first data center and set it up with Synchronous commit and automatic failover.  There is one node in a different data center set to Asynchronous commit and manual failover 

    ·         Formatting all the disks to 64K allocation units

    ·         Adding ndf's and putting them on different LUN's

    ·         I added more tempdb ndf files and again on separate LUN's

    ·         Change the compatibility Level to 2016 per Developer request

    ·         Setting up a rebuild index job to run every two days per the developer request

    Were statistics updated?  Yes, only while doing the index rebuild, but not in its own job

    When you built the server, did you do any load testing? The original servers were not built using best practice recommendation and there was no specific load testing.  We were able to try the application on the new servers before they go live and we discovered the problems, so we decided to make the improvements

    If so, what were the results? N/A

     

    What facts would lead you to reindex every 2 days? We looked at index fragmentation after a load and it was high, so we decided to rebuild more frequently. 

    Are you limiting your rebuild to indexes that actually have fragmentation? No, since the database is very small I do everything.  The job only takes a few minutes

    Are you reorganizing any indexes?  No, and likely will not set up a reorganizing job.

    If you are doing a large set of imports after reindexing, you are likely causing a significant amount of page splits which may have a much larger performance impact than a fragmented index.

    Understood, I will work on this with the developer during the remaining testing and see if we want to alter the index rebuild schedule or change up the fill factor. 

    It really sounds like the developer is making recommendations based upon something he heard someplace, without facts to back up any of these recommendations.

    Yes, that could be true.  For me it has been a few years since I had to deal with performance issues and these types of database settings, so I am a little rusty. 

    And, more importantly, do YOU understand the reasons to make these changes? Yes, its all coming back to me with your help and SQLServerCentral as a resource 

    Jeff

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply