March 10, 2014 at 12:33 pm
I have an ETL that imports readings from a flat file. The ReadingDate on the data is always good, it's the ReadingHours that are suspect. How do I flag and remove errant readings?
Here's the code to build a sample table.
USE [tempdb];
SET NOCOUNT ON;
SET XACT_ABORT ON;
GO
BEGIN TRANSACTION;
USE [tempdb]
GO
CREATE TABLE [dbo].[Readings](
[mrid] [bigint] NULL,
[AssetId] [int] NOT NULL,
[ReadingHours] [decimal](18, 2) NULL,
[ReadingDate] [date] NULL,
[ReadingOrderByReadingDate] [bigint] NULL,
[ReadingOrderByReadingHours] [bigint] NULL
);
INSERT INTO [dbo].[Readings]([mrid], [AssetId], [ReadingHours], [ReadingDate], [ReadingOrderByReadingDate], [ReadingOrderByReadingHours])
SELECT 1, 84019, 1227.00, '20100518 00:00:00.000', 1, 1 UNION ALL
SELECT 2, 84019, 1256.00, '20100519 00:00:00.000', 2, 2 UNION ALL
SELECT 3, 84019, 1395.00, '20100915 00:00:00.000', 3, 4 UNION ALL
SELECT 4, 84019, 1440.00, '20100930 00:00:00.000', 4, 5 UNION ALL
SELECT 5, 84019, 1490.00, '20101119 00:00:00.000', 5, 6 UNION ALL
SELECT 6, 84019, 1302.00, '20110228 00:00:00.000', 6, 3 UNION ALL
SELECT 7, 84019, 1602.00, '20110309 00:00:00.000', 7, 7 UNION ALL
SELECT 8, 84019, 1687.00, '20110630 00:00:00.000', 8, 8 UNION ALL
SELECT 9, 84019, 8935.00, '20110914 00:00:00.000', 9, 27 UNION ALL
SELECT 10, 84019, 1795.00, '20111014 00:00:00.000', 10, 9 UNION ALL
SELECT 11, 84019, 1894.00, '20120321 00:00:00.000', 11, 10 UNION ALL
SELECT 12, 84019, 1901.00, '20120330 00:00:00.000', 12, 11 UNION ALL
SELECT 13, 84019, 1919.00, '20120425 00:00:00.000', 13, 12 UNION ALL
SELECT 14, 84019, 1942.00, '20120606 00:00:00.000', 14, 13 UNION ALL
SELECT 15, 84019, 2615.00, '20120921 00:00:00.000', 15, 26 UNION ALL
SELECT 16, 84019, 2047.00, '20121023 00:00:00.000', 16, 14 UNION ALL
SELECT 17, 84019, 2057.00, '20121031 00:00:00.000', 17, 16 UNION ALL
SELECT 18, 84019, 2054.00, '20121108 00:00:00.000', 18, 15 UNION ALL
SELECT 19, 84019, 2070.00, '20121126 00:00:00.000', 19, 17 UNION ALL
SELECT 20, 84019, 2102.00, '20130211 00:00:00.000', 20, 18 UNION ALL
SELECT 21, 84019, 2106.00, '20130214 00:00:00.000', 21, 19 UNION ALL
SELECT 22, 84019, 2108.00, '20130219 00:00:00.000', 22, 20 UNION ALL
SELECT 23, 84019, 2120.00, '20130315 00:00:00.000', 23, 21 UNION ALL
SELECT 24, 84019, 2156.00, '20130520 00:00:00.000', 24, 22 UNION ALL
SELECT 25, 84019, 2162.00, '20130530 00:00:00.000', 25, 23 UNION ALL
SELECT 26, 84019, 2172.00, '20130610 00:00:00.000', 26, 24 UNION ALL
SELECT 27, 84019, 2213.00, '20131121 00:00:00.000', 27, 25
COMMIT;
RAISERROR (N'[dbo].[Readings]: Insert Batch: 1.....Done!', 10, 1) WITH NOWAIT;
GO
In this instance, readings on lines: 6, 9, & 15 should be removed. Line six is bad, period. Line 9 is bad based on the fact that lines 10 - 14 are ever increasing and probably good and would represent better data. If I imported line 9, the rest of the data would be skipped. Line 15 would be considered bad because 16 - 27 are ever increasing and would be considered good readings (and the same problem with line 9, If i import line 15, lines 16 through 27 would be skipped) They are bad values, but the only way to know they are bad are to look at all of the values in the group.
I've found a million different ways that don't work, but I'm to the point where I'm rehashing bad ideas.
Can I get some fresh ones?
March 10, 2014 at 2:10 pm
what happens for line 18?
SELECT 17, 84019, 2057.00, '20121031 00:00:00.000', 17, 16 UNION ALL
SELECT 18, 84019, 2054.00, '20121108 00:00:00.000', 18, 15 UNION ALL
SELECT 19, 84019, 2070.00, '20121126 00:00:00.000', 19, 17 UNION ALL
is this an errant row?
________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day
March 10, 2014 at 2:27 pm
This is an option using 2 deletes. I'm assuming that you don't have gaps on mrid, if you have gaps, you'll need to create an artificial key.
DELETE r
FROM Readings r
JOIN Readings r1
ON r1.mrid = r.mrid - 1
AND r1.ReadingHours > r.ReadingHours
JOIN Readings r2
ON r2.mrid = r.mrid + 1
AND r1.ReadingHours < r2.ReadingHours
DELETE r
FROM Readings r
JOIN Readings r1
ON r1.mrid = r.mrid + 1
AND r1.ReadingHours < r.ReadingHours
I'm not sure if you can avoid importing the rows instead of importing all rows and cleaning afterwards.
March 10, 2014 at 2:31 pm
Yep, line 18 was supposed to be bad as well. I missed it.
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply