Best way to create report databases which are highly transactional

  • Hello,

    This has been a week of moving forward and then taking steps backward.

    I have production databases (3) which two are highly transactional.  Every few seconds changes are being inserted into the databases between 7:00 a.m. and 3:30 a.m. (unless during breaks and lunch hour).

    I don't want to impact the production database performance.  So, I thought to create a report server with reporting databases.  Cool.

    I didn't want to create another clustered server so availability groups was out.
    I tried log shipping but that kicks people out of the system.
    I tried to set up database mirroring, but couldn't get it to work and the database is not available for reports.  Unless, I do database snapshots which are static and you can't do too many.
    Transactional Replication sounds like overkill, but if I need it, I need it.

    I was hoping to knock this project out this weekend, but it is not working.

    What is the best way to create report databases on a separate server?

    Thanks.

    Things will work out.  Get back up, change some parameters and recode.

  • At a high level, a good portion of reporting databases are loaded once a day unless there is a strong need for real time publishes or at higher frequency than once a day. For example, once a hour where you have 23 hourly exports each day. 

    I guess the question I would have is that I assume this database is accessible for reads. What's the major problem with finding a good low impact way to read the data to a format that can be imported into a reporting database? For example, can you create a series of exports for the last 24 hours during a specific time period that will have the lowest impact to production where you can then load those exports to a separate database?

    In one example I dealt with, is taking 3 major transactional tables and exporting them as 3 separate files at the core as opposed to one big file. Then exporting them in small hourly increments over the day. Slowly building those files. Then summarizing the data in such a way where the indexes of the source system were used and the exported results per hour were not long nor took a great deal of time. The end result was a series of files that were all loaded once a day into a reporting system.

    There is a number of ways to skin a cat. What would be some of the business requirements that prevent you from just reading the data? Can you really not treat a reporting database like any other user and support that users need to read the transactional data? May be hard to find a solution here if you have a database you can't read the data you are storing.

  • This is one of those questions where the answer is "it depends", do you need to report from the full content of all the databases? Can you pre-aggregate the extraction? What is the batch request / sec. count? How many concurrent users? Are the databases designed to allow for delta calculation / extraction? 
    😎

    Updates every few seconds sounds like a relatively quiet system, one would think that any replication or auditing measures shouldn't bog down those databases.

  • Good questions and I appreciate them.

    I am building a data warehouse that will have aggregated data.  That I would think can be upload once a day.
    Then there are warehouse oltp reports.  The IT Director says the reports should show up to the moment with the latest sales, the latest production line metrics and where product is within the warehouse.

    I can easily create my dimension and fact tables for aggregated data which is uploaded once a day.

    However, my challenge is reporting on "what is happening right now in the warehouse" and "this is where we are with sales and customers orders".  The three warehouse databases are highly transactional.  To create a secondary database which reflects up to date information is where I am stuck.  The users want to use reports and power bi to see where we are at "right now" or the last hour.  The customer orders and sales tables are not as highly transactional as marketing would like, but they do a lot of inserts throughout the day.

    Copies of databases that reflect what is going on in the production databases?  As I mentioned, it didn't seem like database mirroring with database snapshots or log shipping would work.

    Any thoughts?

    Things will work out.  Get back up, change some parameters and recode.

  • Sounds like you may have to get clarity with them on the frequency. You will really need to speak to why you cannot take any numbers of other approaches you have not mentioned like a full extract of the table every X point in time during the day to incremental extracts between X and Y points of day and so forth. 

    As I mentioned before, outside of using triggers on the source table, the easiest way is to do a incremental extract for the last X amount of time every Y periods of the day. For example, once every hour for the last hour of insert and updated records using the source change capturing fields. Then slowly populate that data warehouse used for reporting throughout the 24 hour period until completed. Such an extract can be a simple SELECT -> CSV file where your reporting data warehouse is picking up that data via ETL process and loading it in.

    Some reasons that may not work may entail not having a way to track changes within the last hour or at all, data is on average 100 million records within a hour and won't be extracted within a reasonable amount of time (i.e.: may take additional hour to process per hour), and so forth. When knowing that, you may alter your approach based on the performance implications and say, instead of extracting the entire hour every hour at once, let's break up the hour into 15 minute blocks and incrementally build the hourly extract until completed. That still might be too much of an issue to where you may consider running 2 processes in parallel every 30 minutes that target two 15 minute blocks of changes where instead of dumping the last two blocks of 15 minutes in one big dump, you use a while loop and iterate over 1 million records until each 15 minute block is full to formulate a 30 minute block. Rinse and repeat once more to complete 4 files that represent 1 hour block.

    Thus, why would approaches like that be ruled out and the reasons behind them?

  • This discussion has been more helpful than me creating some quick solution over the weekend.

    1.  The warehouse supervisors / managers running a report throughout the day will not impact performance as much as transactional inserts that happen every few seconds.  I can create a view for the 3 reports used by management and which can be brought into power bi.  Problem solved.  After thinking of the various points, I really don't need to recreate the wheel.

    2.  The other reports can be viewed and updated every hour.  I can do an csv extract and update them to the secondary database once every 30 minutes and they will be happy.  Either that or the log shipping will due.

    3.  The data warehouse aggregation can happen each night and then brought into the SSAS cubes.

    I don't think transactional replication is needed.  I have a clear plan now.  Just needed another DBA's perspective.

    Things will work out.  Get back up, change some parameters and recode.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply