March 8, 2015 at 2:38 pm
I have recently taken over a database, and I have been tasked at cleaning up the data. There are several tables that have been appended to daily for the last 3 years. The relevant table in the table is only about .7% unique. I have been trying to clean up the data in the tables with a starting and ending date for the records for reporting purposes. An example of the table as it currently exists in the live database:
Table A
ID bigint identity
Col1 biging
Col2 bigint
Col3 bigint
Statement_Date date
FileDate Date
The issue is that the files that have been loaded for the last several years are a data dump with historical (but not all at all time) along with the latest date. My desired outcome is to have a table like this:
Table A
Col1 biging
Col2 bigint
Col3 bigint
Statement_Date date
FirstFileDate Date
EndFileDate Date
Where the first file date is the first time that the record combination came into the table, and the end file date is when that record value is not the current. My first attempts at it seemed to work but do not take into account if the record was removed from the source file. Below is my script.
SELECT IDENTITY(int,1,1) as [id]
,[ACCOUNT_NUMBER]
,[ACCOUNT_CHARGE_TKN]
,[BILLING_CHARGE_TKN]
,[CURRENT_STMT_DATE]
,MIN([File_Date]) First_File_Date
INTO #billing_charge_dates_New
FROM [dbo].[billing_charge_dates_NEW]
GROUP BY [ACCOUNT_NUMBER]
,[ACCOUNT_CHARGE_TKN]
,[BILLING_CHARGE_TKN]
,[CURRENT_STMT_DATE]
order by [ACCOUNT_NUMBER]
,[ACCOUNT_CHARGE_TKN]
,[BILLING_CHARGE_TKN]
,[CURRENT_STMT_DATE]
GO
Select [ACCOUNT_NUMBER]
,[ACCOUNT_CHARGE_TKN]
,[BILLING_CHARGE_TKN]
,[CURRENT_STMT_DATE]
, StartFileDate
,cast(lead(StartFileDate) over ( partition by [ACCOUNT_NUMBER]
,[ACCOUNT_CHARGE_TKN]
,[BILLING_CHARGE_TKN]
order by [ACCOUNT_NUMBER],ID ) as date ) EndFileDate
into BillingCharge_Dates_New_2a
from #billing_charge_dates_Ner
This took my initial table from 3.2 billion to 14 milling. As I said my problem was that I was not taking into account if a record was removed from the data set. I need to set the end date for the last date that the record is valid in the table.
Any hints for advice will be greatly appreciated.
March 9, 2015 at 5:07 am
Why can't you just use MAX([File_Date]) as End_File_Date?
March 9, 2015 at 5:25 am
Are you looking to replace the existing table with a cleaned up version or do you intend to leave the existing table as is and create a separate table or view containing the sumarized data ?
The SQL is straight forward to sumarise with first and last dates.
SELECT col1, col2, col2, MIN(date), MAX(date) GROUP by col1, col2, col3
Can you explain what you mean when you say a record was removed from the source file ?
If you could provide sample data for a single account number showing the raw data and the desired output it would help us figure out the context of your query. Ideally keep it simple with col1, col2, col3 rather than all the columns. If you invest the time and provide a script to generate the table structure and the sample data you will get a response quicker.
March 11, 2015 at 6:09 am
The problem is that the source data that has been loaded into the table for the last 4 years is a csv, that contains all the records each day. What I mean is that the file that was loaded on 3/2/2015 has data from 3/1/2015 back to 6/7/2011, and then the file loaded on 3/3/2015 has all the data from 3/2/2015 back to 6/7/2011. Whoever set this up is not only loading the new records they are appending the entire file to the end of the table. BUT for some reason I am finding that there are records that drop out in the middle so they were not provided in all of the source files for some reason.
March 16, 2015 at 3:00 pm
CELKO (3/12/2015)
Since BIGINT is larger than the number of atoms in the universe, we do not use it very often.
No, not even close. There are at least 10^78 atoms, whereas bigint only covers just under 10^20.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
March 17, 2015 at 2:02 pm
CELKO (3/17/2015)
No, not even close. There are at least 10^78 atoms, whereas bigint only covers just under 10^20.
My bad :rolleyes:
A quick Google: it is believed that between 120 to 300 sextillion (that’s 1.2 x 10²³ to 3.0 x 10²³) stars exist within our observable universe.
So if your company is doing business with the observable universe, then you need BIGINT for a key. And probably a stardate instead of a Commoin Era Calendar 😉
According to the UK Government, BIGINT is needed to cover the number of foreigners that want to migrate to the UK, guess they are coming from all these places in the observable universe
😎
March 17, 2015 at 2:15 pm
ScottPletcher (3/16/2015)
CELKO (3/12/2015)
Since BIGINT is larger than the number of atoms in the universe, we do not use it very often.
No, not even close. There are at least 10^78 atoms, whereas bigint only covers just under 10^20.
That would be closer to the age of the universe (in seconds), one can easily confuse those numbers;-)
😎
March 18, 2015 at 5:03 pm
Eirikur Eiriksson (3/17/2015)
ScottPletcher (3/16/2015)
CELKO (3/12/2015)
Since BIGINT is larger than the number of atoms in the universe, we do not use it very often.
No, not even close. There are at least 10^78 atoms, whereas bigint only covers just under 10^20.
That would be closer to the age of the universe (in seconds), one can easily confuse those numbers;-)
😎
I love it when a forum topic drifts completely off-topic 😛
March 19, 2015 at 6:34 am
Since BIGINT is larger than the number of atoms in the universe, we do not use it very often. What math requires this size?
The reason for using BIGINT probably has something to do with the maximum size that INT can handle.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply