December 18, 2017 at 11:01 am
I thank you for the test, Jeff. However, I still don't believe that someone will be updating 10 million rows at a time. The most common scenario would be to update one row at a time (no RBAR just different operations).
When updating one row the reads, writes and times are very similar and only vary for less than 5ms (which I feel is insignificant).
/*****************************************************************************/
/********** DANGER!!! DO NOT RUN THIS CODE ON A PRODUCTION SYSTEM!!! *********/
/*****************************************************************************/
--=============================================================================
-- Test 1: Update the NarrowTable
--=============================================================================
--===== Identify the test in the Messages tab.
SET STATISTICS TIME,IO OFF;
PRINT REPLICATE('=',80);
PRINT SPACE(8)+'Test 1: Update NarrowTable';
PRINT REPLICATE('=',80);
--===== Clear all data and cache from the system.
-- DANGER!!! DON'T EVEN THINK OF RUNNING THIS ON A PRODUCTION BOX!!!!
SET STATISTICS TIME,IO OFF;
CHECKPOINT;
DBCC FREESYSTEMCACHE ('ALL');
DBCC FREESESSIONCACHE;
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;
WAITFOR DELAY '0:00:10' -- Wait for everything to "calm down".
SET STATISTICS TIME,IO ON;
GO
PRINT REPLICATE('-',80)
DECLARE @Now DATETIME, @ID BIGINT = ABS(CHECKSUM(NEWID()))%1000000;
SELECT @Now = GETDATE();
UPDATE dbo.NarrowTable
SET SomeDT = @Now
WHERE SomeID = @ID;
CHECKPOINT;
GO 3
/*****************************************************************************/
/********** DANGER!!! DO NOT RUN THIS CODE ON A PRODUCTION SYSTEM!!! *********/
/*****************************************************************************/
--=============================================================================
-- Test 2: Update the WideTable
--=============================================================================
--===== Identify the test in the Messages tab.
SET STATISTICS TIME,IO OFF;
PRINT REPLICATE('=',80);
PRINT SPACE(8)+'Test 2: Update WideTable';
PRINT REPLICATE('=',80);
--===== Clear all data and cache from the system.
-- DANGER!!! DON'T EVEN THINK OF RUNNING THIS ON A PRODUCTION BOX!!!!
SET STATISTICS TIME,IO OFF;
CHECKPOINT;
DBCC FREESYSTEMCACHE ('ALL');
DBCC FREESESSIONCACHE;
DBCC FREEPROCCACHE;
DBCC DROPCLEANBUFFERS;
WAITFOR DELAY '0:00:10' -- Wait for everything to "calm down".
SET STATISTICS TIME,IO ON;
GO
PRINT REPLICATE('-',80)
DECLARE @Now DATETIME, @ID BIGINT = ABS(CHECKSUM(NEWID()))%1000000;
SELECT @Now = GETDATE();
UPDATE dbo.WideTable
SET SomeDT = @Now
WHERE SomeID = @ID;
CHECKPOINT;
GO 3
--===== End of test
SET STATISTICS TIME,IO OFF;
December 18, 2017 at 11:08 am
Grant Fritchey - Friday, December 15, 2017 2:46 AMPure speculation and more than a couple of assumptions... The chance for a page split is higher when updating the wider table, so over time, accumulated, the update cost is going to be higher for the wider table than for the narrow one.However, 100 column table? Are we talking fact table or a "normalized" system? If the latter, U R DOIN IT RONG.
We are doing it wrong, but I have no control over what was done before I came here nor after as no one asks for my opinion. I just have to deal with what is. Hey, at least I have a job.
December 18, 2017 at 12:36 pm
koustav_1982 - Monday, December 18, 2017 10:52 AM...the question was related to a fact table design...
...we can update the column that has changed...
...To update a flag column, we have to search for an existing record in the fact table , and then update the flag / expiration date for that column. - This is what we are doing now and which is where the question comes.
Yikes! This raises some red flags with me as someone who's followed the Kimball methodology for a number of years. Fact tables really shouldn't have 100 columns in them. Fact tables are typically highly normalized with a few measure columns and foreign keys to the dimension tables. Dimension tables are more de-normalized and hold all the descriptive information of the data so can have large numbers of columns.
Another issue I see is that you typically want to avoid modifying fact records if possible. If there are changes in the OLTP system, you would typically add another transaction record to the fact table that represented the differences from the original values to the new values, not change or invalidate the original fact record.
If you're stuck using this methodology now, then the thing to consider as others have asked is how many records will be updated and how frequently?
December 18, 2017 at 12:55 pm
Luis Cazares - Monday, December 18, 2017 11:01 AMI thank you for the test, Jeff. However, I still don't believe that someone will be updating 10 million rows at a time. The most common scenario would be to update one row at a time (no RBAR just different operations).
When updating one row the reads, writes and times are very similar and only vary for less than 5ms (which I feel is insignificant).
Understood. The whole table update was meant to demonstrate that table width does matter. As for there being "less than 5ms" difference for a singleton, that's fine but only if you're not updating thousands of rows per hour. The other problem is that the table might not be used only for that one "thing".
While "It Depends" is certainly true, I've found that (most of the time) super-wide tables cause more trouble than they're worth. Another example of problems I've seen (and had to repair) is that someone thought they were doing the right thing by making a CLR for "field" level audits that could be copied to any table. It would automatically figure things out based on the content of the INSERTED and DELETED logical tables. It worked absolutely great... until they applied it to an insane 137 column wide table. The people that installed it didn't understand the two logical trigger tables had to be fully materialized to work in the CLR because they were out of scope for the CLR. It was taking more than 4 minutes to update just 4 columns on just 10,000 rows.
I'm also not trying to convince anyone to totally avoid super-wide tables (we fixed the trigger problem rather than "sister" the table because of other requirements)... Rather I'm trying to make folks understand that they can be quite the problem and that they need to test "insitu" to make sure the problems that can be present, aren't. For example, a lot of folks never test for the inevitable flush to disk. Although some would consider it to be some form of "death by pre-optimization", this type of thing is a really important consideration at design time because it's either a huge pain or an impossibility to change further down the road.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 18, 2017 at 10:30 pm
Chris Harshman - Monday, December 18, 2017 12:36 PMkoustav_1982 - Monday, December 18, 2017 10:52 AM...the question was related to a fact table design...
...we can update the column that has changed...
...To update a flag column, we have to search for an existing record in the fact table , and then update the flag / expiration date for that column. - This is what we are doing now and which is where the question comes.Yikes! This raises some red flags with me as someone who's followed the Kimball methodology for a number of years. Fact tables really shouldn't have 100 columns in them. Fact tables are typically highly normalized with a few measure columns and foreign keys to the dimension tables. Dimension tables are more de-normalized and hold all the descriptive information of the data so can have large numbers of columns.
Another issue I see is that you typically want to avoid modifying fact records if possible. If there are changes in the OLTP system, you would typically add another transaction record to the fact table that represented the differences from the original values to the new values, not change or invalidate the original fact record.
If you're stuck using this methodology now, then the thing to consider as others have asked is how many records will be updated and how frequently?
Well, that depends on who you ask. Many promote Kimball as denormalization for facts when it comes to reads and normalization for fact when it comes to updates, inserts and deletes. In that same breath, there is also many promotions around the idea that if you're building denormalized facts, you avoid transactions as much as possible in the data warehouse and instead find ways of doing it outside the warehouse such as doing them on disk for speed and low overhead. Normalization is something many regard for proper OLTP systems supporting an application, not necessarily something supporting large reads.
For example, I have a large 40 field table with billions of records. In a normal row-based system where you have row-based indexing, constraints, and maybe triggers or whatever, then operations on this table is going to be harsh. However, in cases of a highly compressed column store where data is hashed across multiple nodes within a very low overhead data model with no constraints/triggers and where updates never happen -- only inserts, then having a large billion record table with a lot of columns works fine.
That's not to say normalization is bad for data warehousing. There are plenty of normalized data warehouses with denormalized views. But, adding new data sources, creating workable datasets from it and so forth become a pain with the complexity versus something flat and easy. Even then it just depends... Especially when considering everything that has to happen before data even lands in said table to begin with.
December 18, 2017 at 11:09 pm
xsevensinzx - Monday, December 18, 2017 10:30 PMChris Harshman - Monday, December 18, 2017 12:36 PMkoustav_1982 - Monday, December 18, 2017 10:52 AM...the question was related to a fact table design...
...we can update the column that has changed...
...To update a flag column, we have to search for an existing record in the fact table , and then update the flag / expiration date for that column. - This is what we are doing now and which is where the question comes.Yikes! This raises some red flags with me as someone who's followed the Kimball methodology for a number of years. Fact tables really shouldn't have 100 columns in them. Fact tables are typically highly normalized with a few measure columns and foreign keys to the dimension tables. Dimension tables are more de-normalized and hold all the descriptive information of the data so can have large numbers of columns.
Another issue I see is that you typically want to avoid modifying fact records if possible. If there are changes in the OLTP system, you would typically add another transaction record to the fact table that represented the differences from the original values to the new values, not change or invalidate the original fact record.
If you're stuck using this methodology now, then the thing to consider as others have asked is how many records will be updated and how frequently?
Well, that depends on who you ask. Many promote Kimball as denormalization for facts when it comes to reads and normalization for fact when it comes to updates, inserts and deletes. In that same breath, there is also many promotions around the idea that if you're building denormalized facts, you avoid transactions as much as possible in the data warehouse and instead find ways of doing it outside the warehouse such as doing them on disk for speed and low overhead. Normalization is something many regard for proper OLTP systems supporting an application, not necessarily something supporting large reads.
For example, I have a large 40 field table with billions of records. In a normal row-based system where you have row-based indexing, constraints, and maybe triggers or whatever, then operations on this table is going to be harsh. However, in cases of a highly compressed column store where data is hashed across multiple nodes within a very low overhead data model with no constraints/triggers and where updates never happen -- only inserts, then having a large billion record table with a lot of columns works fine.
That's not to say normalization is bad for data warehousing. There are plenty of normalized data warehouses with denormalized views. But, adding new data sources, creating workable datasets from it and so forth become a pain with the complexity versus something flat and easy. Even then it just depends... Especially when considering everything that has to happen before data even lands in said table to begin with.
Not a challenge... just a question. Column store, more correctly, properly working column store, is a fairly recent innovation in SQL Server. Did you have to work the large table with row store prior to column store? I'd also genuinely be interested in how you managed data integrity without any constraints.
Shifting gears a bit and extending my curiosity about that large table, what is the byte size of the table and what kind of index maintenance did you find necessary? Also, was it ever a partitioned table of any kind?
--Jeff Moden
Change is inevitable... Change for the better is not.
December 18, 2017 at 11:16 pm
koustav_1982 - Monday, December 18, 2017 10:52 AMHi AllThanks for your replies and thanks Jeff Moden for the extensive research.
Yes the question was related to a fact table design.
The background :
In Data warehousing , in the fact table, revisions of the same record keeps getting posted , with minor changes in a few columns. Say , an order fact record has a quantity of 10 initially and then with the next revision it was updated to 20.
We have 2 options here , 1 . to update the same record , and 2 . to insert a new record which will save the update cost , but adds an additional cost to update the older revision with a flag - marking the record is no longer valid or to put an expiration date.The first approach :
To update , we can update the column that has changed - this adds to the cost of comparing all the columns in a fact table with the incoming record and find out what changed or update all the columns using the incoming record - both are costly operationsThe second approach :
To update a flag column, we have to search for an existing record in the fact table , and then update the flag / expiration date for that column. - This is what we are doing now and which is where the question comes.
If we separate the PK and the flag to a separate table, we still have to go through the searching section to see if a record already exists, but then we come back and update a smaller table with the flag value.
The downside is that queries on the fact table also has to join this outrigger table at run time.
At what rate will you be "modifying" rows with updated data? I say "modifying" in quotes to include the possibility of INSERTing new rows rather than updating existing rows, neither of which are particularly arduous to the system when done correctly.
p.s. Never update a flag to indicate a row is no longer current. Instead, update an "EndDate" and you can establish a Type 6 SCD that will allow you to do incredibly simple "state of the union" queries for ANY given point in time.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 19, 2017 at 8:30 am
Jeff Moden - Monday, December 18, 2017 11:09 PMNot a challenge... just a question. Column store, more correctly, properly working column store, is a fairly recent innovation in SQL Server. Did you have to work the large table with row store prior to column store? I'd also genuinely be interested in how you managed data integrity without any constraints.Shifting gears a bit and extending my curiosity about that large table, what is the byte size of the table and what kind of index maintenance did you find necessary? Also, was it ever a partitioned table of any kind?
Well, I should clarify. I am using Azure Data Warehouse. This is so I can get proper column stores that are distributed across 60 databases and across X nodes depending on the current scale behind the scenes. I wasn't digging the SMP nature of column store indexes and due to the size of the data, went this route.
That being said, prior to that I was in SQL Server with the same table, just with millions, not billions. It still worked out pretty decently as clustered index was on Date/Time and data loads were only increments of the last 24 hours. Little to no updates on the table itself (outside of rare issues with fixing bad data, but that involved removing the specific date and all dates after where all data could be republished without fragmentation).
As most queries utilized date/time filters with other non-clustered indexes being used a great deal with queries, even in row-based wide table, it worked great for most queries/pushes. The issues cropped up when trying to aggregate extremely large sets of the table, essentially doing full scans for certain fields, not all. MPP with columnstores seemed like the way to go.
In terms of data integrity, it's all ETL controlled as no one outside the data warehouse can insert records. Being it's only incremental records, as long as the loads are deduplicated and it's not inserting records that do not already exist, then all is well. When it comes to foreign keys, those are all conformed by the ETL itself in the sense, keys are made from the data and pushed with the fact.
December 19, 2017 at 9:35 am
Jeff Moden - Monday, December 18, 2017 11:16 PMp.s. Never update a flag to indicate a row is no longer current. Instead, update an "EndDate" and you can establish a Type 6 SCD that will allow you to do incredibly simple "state of the union" queries for ANY given point in time.
If we were talking about a dimension table, I'd agree, however I don't think I've ever seen a successful slowly changing fact table. I could be misunderstanding things, but at one point on page 2 of this thread the original poster stated:
In Data warehousing , in the fact table, revisions of the same record keeps getting posted , with minor changes in a few columns. Say , an order fact record has a quantity of 10 initially and then with the next revision it was updated to 20.
December 19, 2017 at 10:57 am
Posting some more updates on one of the the fact tables :
Some of the properties about the fact table are
December 19, 2017 at 1:46 pm
xsevensinzx - Tuesday, December 19, 2017 8:30 AMJeff Moden - Monday, December 18, 2017 11:09 PMNot a challenge... just a question. Column store, more correctly, properly working column store, is a fairly recent innovation in SQL Server. Did you have to work the large table with row store prior to column store? I'd also genuinely be interested in how you managed data integrity without any constraints.Shifting gears a bit and extending my curiosity about that large table, what is the byte size of the table and what kind of index maintenance did you find necessary? Also, was it ever a partitioned table of any kind?
Well, I should clarify. I am using Azure Data Warehouse. This is so I can get proper column stores that are distributed across 60 databases and across X nodes depending on the current scale behind the scenes. I wasn't digging the SMP nature of column store indexes and due to the size of the data, went this route.
That being said, prior to that I was in SQL Server with the same table, just with millions, not billions. It still worked out pretty decently as clustered index was on Date/Time and data loads were only increments of the last 24 hours. Little to no updates on the table itself (outside of rare issues with fixing bad data, but that involved removing the specific date and all dates after where all data could be republished without fragmentation).
As most queries utilized date/time filters with other non-clustered indexes being used a great deal with queries, even in row-based wide table, it worked great for most queries/pushes. The issues cropped up when trying to aggregate extremely large sets of the table, essentially doing full scans for certain fields, not all. MPP with columnstores seemed like the way to go.
In terms of data integrity, it's all ETL controlled as no one outside the data warehouse can insert records. Being it's only incremental records, as long as the loads are deduplicated and it's not inserting records that do not already exist, then all is well. When it comes to foreign keys, those are all conformed by the ETL itself in the sense, keys are made from the data and pushed with the fact.
Cool. Thanks for the feedback.
--Jeff Moden
Change is inevitable... Change for the better is not.
December 19, 2017 at 4:14 pm
koustav_1982 - Tuesday, December 19, 2017 10:57 AMPosting some more updates on one of the the fact tables :
Some of the properties about the fact table are
- wrong - columns were added over time without thinking much
- unavoidable - based on the requirements from the products and BI teams
- have to live with it
- The fact table is 324 columns wide with 126 dimension columns [Yes , might be a bad design or the only design based on the requirements but thats what we have ]
- It has got 60 million rows
- ETLs [Informatica] runs 3 times a day bringing in 300,000 records every run
- Not all of the 300,000 records are new records, its probably just 100,000 . Rest are updates to the existing records, like I mentioned, change in quantity , location , other attributes
- We follow an insert only model , the incoming 300,000 records are inserted
- Older versions of the records are inactivated by a flag update
- Inactive records are purged with a different weekly maintenance job to keep the table crisp [ Dont scream , I know its bad , but again have to live with that due to numerous other constraints]
- The table has 33 indexes , maintained by DBA jobs
- The load time for the table including the active flag update is around 20 mins
Good lord, that's a lot........
How old are the new records versus the existing records? For example, is the new and existing records only from the past 24 hours? Or is the new records from the past 24 hours and the existing records could be from the past 6 months?
December 21, 2017 at 2:18 pm
The revisios
xsevensinzx - Tuesday, December 19, 2017 4:14 PMkoustav_1982 - Tuesday, December 19, 2017 10:57 AMPosting some more updates on one of the the fact tables :
Some of the properties about the fact table are
- wrong - columns were added over time without thinking much
- unavoidable - based on the requirements from the products and BI teams
- have to live with it
- The fact table is 324 columns wide with 126 dimension columns [Yes , might be a bad design or the only design based on the requirements but thats what we have ]
- It has got 60 million rows
- ETLs [Informatica] runs 3 times a day bringing in 300,000 records every run
- Not all of the 300,000 records are new records, its probably just 100,000 . Rest are updates to the existing records, like I mentioned, change in quantity , location , other attributes
- We follow an insert only model , the incoming 300,000 records are inserted
- Older versions of the records are inactivated by a flag update
- Inactive records are purged with a different weekly maintenance job to keep the table crisp [ Dont scream , I know its bad , but again have to live with that due to numerous other constraints]
- The table has 33 indexes , maintained by DBA jobs
- The load time for the table including the active flag update is around 20 mins
Good lord, that's a lot........
How old are the new records versus the existing records? For example, is the new and existing records only from the past 24 hours? Or is the new records from the past 24 hours and the existing records could be from the past 6 months?
The revisions can be from any point in time. It could be from the last run , from a version few days back or even months older.
Viewing 13 posts - 16 through 27 (of 27 total)
You must be logged in to reply to this topic. Login to reply