April 22, 2010 at 12:20 am
eseosaoregie (4/22/2010)
How did you replace my indexes with a clustered index? Were they added to the apllication and user table?
DROP INDEX [IX_USER] ON [dbo].[dimUser]
GO
DROP INDEX [IX_Application] ON [dbo].[dimApplication]
GO
DROP INDEX [IX_FirstFour] ON stg_LogFiles
--
CREATE CLUSTERED INDEX [IX_USER] ON [dbo].[dimUser]([UserName]) ON [PRIMARY]
GO
CREATE CLUSTERED INDEX [IX_Application] ON [dbo].[dimApplication]([Application]) ON [PRIMARY]
GO
CREATE CLUSTERED INDEX [IX_FirstFour] ON stg_LogFiles ([Firstfour]) ON [PRIMARY]
--
-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done;-)
April 22, 2010 at 12:26 am
Why does the clustered index improve performance. More specifically, what is the clustered index doing that is different? I will try this once I get into work.
Also Jeff Moden earlier in the post suggested using a split function. I am just interested in how that might improve perfromance. Does it prevent the optimizer from perfoming a table scan?
April 22, 2010 at 12:32 am
eseosaoregie (4/22/2010)
Why does the clustered index improve performance. More specifically, what is the clustered index doing that is different? I will try this once I get into work.
Create clus index
then
SET STATISTICS TIME ON
your query
SET STATISTICS TIME OFF
see if you can find some improvement.
-------Bhuvnesh----------
I work only to learn Sql Server...though my company pays me for getting their stuff done;-)
April 22, 2010 at 6:30 am
eseosaoregie (4/21/2010)
I have now added sample data along with all the code for tables and indexes etc. Hopefully this will make things clearer
---tables
CREATE TABLE [stg_LogFiles] (
[LineRead] [varchar] (100) NULL ,
[Importeddate] [datetime] DEFAULT (getdate()),
[Firstfour] AS (substring([LineRead],2,4))
) ON [PRIMARY]
GO
CREATE TABLE [fct_Logs] (
[LogDate] [datetime] NULL ,
[Application] [int] NULL ,
[Database] [int] NULL ,
[User] [int] NULL ,
[Retrieval] [int] NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[dimApplication] (
[Application] [varchar] (5) NULL ,
[ApplicationID] [int] IDENTITY (1, 1) NOT NULL ,
[Database] [varchar] (6) NULL ,
[DatabaseID] [int] NULL
) ON [PRIMARY]
--user table
CREATE TABLE [dbo].[dimUser] (
[UserID] [int] IDENTITY (1, 1) NOT NULL ,
[UserName] [nvarchar] (10) NULL ,
[Country] [nvarchar] (20) NULL ,
[CountryID] [int] NULL
) ON [PRIMARY]
GO
--index for user table
CREATE INDEX [IX_USER] ON [dbo].[dimUser]([UserName]) ON [PRIMARY]
GO
--index for application table
CREATE INDEX [IX_Application] ON [dbo].[dimApplication]([Application]) ON [PRIMARY]
GO
--index for computed column on stg_LogFiles table
CREATE INDEX [IX_FirstFour] ON stg_LogFiles ([Firstfour])
ON [PRIMARY]
--data for application table
INSERT INTO [dimApplication]
SELECT 'GMR08','1','OLTCHK','1' UNION ALL
SELECT 'GMR09','2','OLTCHK','1' UNION ALL
SELECT 'GMR10','3','OLTCHK','1'
--data for user table
INSERT INTO [dimUser]
SELECT '1','KRRHEJ','Venezuela','43' UNION ALL
SELECT '2','SGTAYL','Venezuela','43' UNION ALL
SELECT '3','PHGANI','Venezuela','43' UNION ALL
SELECT '4','ROBAIR','Venezuela','43' UNION ALL
SELECT '5','silven','Venezuela','43' UNION ALL
SELECT '6','COAGUJ','Venezuela','43' UNION ALL
SELECT '7','USFITT','Venezuela','43' UNION ALL
SELECT '8','admin','Netherlands','21'
--data for stg_LogFiles table
INSERT INTO [dbo].[stg_LogFiles] (LineRead)
SELECT '[Mon Jan 05 13:51:05 2009]Local/GMR09///Info(1013205)' UNION ALL
SELECT '[Mon Jan 05 13:51:18 2009]Local/GMR09///Info(1013210)' UNION ALL
SELECT '[Mon Jan 05 13:51:30 2009]Local/GMR09/OTLCHK/admin/Info(1021000)' UNION ALL
SELECT '[Mon Jan 05 13:52:59 2009]Local/GMR09/OTLCHK/admin/Info(1021000)' UNION ALL
SELECT '[Mon Jan 05 13:53:19 2009]Local/GMR09/OTLCHK/admin/Info(1021000)' UNION ALL
SELECT '[Mon Jan 05 13:53:28 2009]Local/GMR09///Info(1013210)' UNION ALL
SELECT '[Mon Jan 05 13:53:40 2009]Local/GMR09///Info(1013210)'
---the query
INSERT INTO [ESSBASE_TEST_DW].[dbo].[fct_Logs]([LogDate], [Application], [Database], [User], [Retrieval])
(SELECT
CONVERT(datetime,SUBSTRING(B.[Date],5,7)+ RIGHT(B.[Date],4) + SUBSTRING(B.[Date],11,9),108),
A.ApplicationID,
A.DatabaseID,
U.UserID,
1
FROM
(SELECT
SUBSTRING(LineRead,2,24) AS [Date]
,SUBSTRING(
LineRead,
CHARINDEX('/',LineRead)+1,
CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1)-1 - CHARINDEX('/',LineRead)) AS [Application]
,SUBSTRING(
LineRead,
CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1)+1,
CHARINDEX('/',LineRead,CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1)+1)-1
- CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1)) AS [Database]
,SUBSTRING(
LineRead,
CHARINDEX('/',LineRead,CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1)+1) +1,
CHARINDEX('/',LineRead,CHARINDEX('/',LineRead,CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1)+1)+1)- 1 -
CHARINDEX('/',LineRead,CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1) +1)) AS [User]
FROM dbo.stg_LogFiles
WHERE
--CHARINDEX('[',LineRead) = 1
--AND
SUBSTRING(LineRead,
CHARINDEX('/',LineRead,CHARINDEX('/',LineRead)+ 1)
,2) <> '//' -- ONLY SELECT records where user id is present
AND
(Firstfour = 'mon ' or Firstfour = 'tue ' or Firstfour = 'wed ' or Firstfour = 'thu ' or Firstfour = 'fri ' or Firstfour = 'sun ' or Firstfour = 'sat ')
)B
LEFT JOIN dbo.dimApplication A ON A.Application = B.Application
LEFT JOIN dbo.dimUser U ON U.UserName = B.[User]
)
A fair number of clock cycles have been expended in code trying to determine if characters 2 through 4 are a 3 letter character representation of a week day so let me ask this... are there any rows in the stg_LogFiles table where those characters DON'T actually contain the 3 letter character representation of a week day at characters 2 through 4??? If so, would you include those in the test data, as well, please? And, yes... it's very important so I can set up a large amount of test data to find the most efficient method for the split you're trying to do.
I also notice that all of the "rows of interest" that need to be split seem to have precisely 4 slashes in them. Will that ALWAYS be the case for these "rows of interest"?
--Jeff Moden
Change is inevitable... Change for the better is not.
April 22, 2010 at 6:40 am
bc_ (4/21/2010)
try:
comp_column IN ('mon ','tue ','wed ','thu ','fri ','sat ','sun ')
Why? comp_column only contains 3 characters according to the OP's posted code.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 22, 2010 at 6:42 am
Ah... almost forgot... I also need to know how many rows are in the staging table that you've been testing with and how long the split runs have been.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 22, 2010 at 7:14 am
There are rows which do not contain the 3 letter character representation of a week day at characters 2 through 4. They are:
'[Spreadsheet Extractor Big Block Allocs -- Dyn.Calc.Cache : [15] non-Dyn.Calc.Cache : [0]]'
'[Clear Active on User [NLMAJS] Instance [1]'
'[The sheet contains an unknown member: IS: charged by R-share - Manufacturing.]'
'[The sheet contains an unknown member: A7057120.]'
I have been testing with rought 17million rows. The query is in a stored procedure which takes bout 7 1/2mins
April 22, 2010 at 7:16 am
Jeff Moden (4/22/2010)
bc_ (4/21/2010)
try:
comp_column IN ('mon ','tue ','wed ','thu ','fri ','sat ','sun ')
Why? comp_column only contains 3 characters according to the OP's posted code.
eseosaoregie (4/21/2010)
I created the computed column and the index on that column. The computed column is as follows
--Add Computed Column
ALTER TABLE MyTable
ADD comp_column AS substring(LineRead,2,4) varchar
--Create Column
CREATE INDEX IX_COMPColumn ON MyTable(comp_column)
However when I run the query it takes the same time. How can I alter the where clause syntax such that it can utilize the index on the computed column?
I take that to be 4 characters in length.
The original post had spaces after each day and he labeled the column as "Firstfour" later on as well. Not sure what other data is present in the column, but possibly to prevent picking up other strings like 'sunny','money','thus', etc.
[font="Arial Narrow"]bc[/font]
April 22, 2010 at 9:48 am
eseosaoregie (4/22/2010)
There are rows which do not contain the 3 letter character representation of a week day at characters 2 through 4. They are:'[Spreadsheet Extractor Big Block Allocs -- Dyn.Calc.Cache : [15] non-Dyn.Calc.Cache : [0]]'
'[Clear Active on User [NLMAJS] Instance [1]'
'[The sheet contains an unknown member: IS: charged by R-share - Manufacturing.]'
'[The sheet contains an unknown member: A7057120.]'
I have been testing with rought 17million rows. The query is in a stored procedure which takes bout 7 1/2mins
Thanks... I'm at work so I can't get to this until tonight (ie: after work). I could do it on a break but I need to write code to gen a couple of million rows and I don't believe they'd appreciate me hammering even on the test box. 😛
--Jeff Moden
Change is inevitable... Change for the better is not.
April 22, 2010 at 10:10 am
Thanks. I will wait till later.
April 22, 2010 at 8:07 pm
{edit} Just noticed a boo-boo... had to take the code I posted down. I'll be back soon.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 22, 2010 at 8:35 pm
Let's try this again...
Here's the test harness I used... make sure you read the comments because it builds 17 million test rows in something less than 10 minutes...
-- WARNING... USE THIS LINE AT YOUR OWN RISK, drop table [stg_LogFiles],[fct_Logs],[dimApplication],[dimUser]
---tables (Time to build test table on my machine... 00:09:35)
-- Note that this test setup build 17 million rows.
go
--===== I changed the calculated column with a little "Modecan Tweekin" here.
-- As a side bar, it also calculates the day of the week in case you need it.
-- second part of the query makes the DoW negative if any information is skipped
-- according to having two slashes together. And, it's NASTY FAST.
CREATE TABLE dbo.stg_LogFiles
(
LineRead varchar (100) NULL ,
Importeddate datetime DEFAULT (getdate()),
DoW AS ((CHARINDEX(LEFT(LineRead,5),'[Mon [Tue [Wed [Thu [Fri [Sat [Sun ')-1)/5+1)
* SIGN(CHARINDEX('/',LineRead)-CHARINDEX('//',LineRead)) PERSISTED
)
GO
CREATE TABLE [fct_Logs] (
[LogDate] [datetime] NULL ,
[Application] [int] NULL ,
[Database] [int] NULL ,
[User] [int] NULL ,
[Retrieval] [int] NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[dimApplication] (
[Application] [varchar] (5) NULL ,
[ApplicationID] [int] NOT NULL PRIMARY KEY CLUSTERED, --I added this PK, took IDENTITY off because of the inserts
[Database] [varchar] (6) NULL ,
[DatabaseID] [int] NULL
) ON [PRIMARY]
--user table
CREATE TABLE [dbo].[dimUser] (
[UserID] [int] NOT NULL PRIMARY KEY CLUSTERED, --I added this PK, took IDENTITY off because of the inserts
[UserName] [nvarchar] (10) NULL ,
[Country] [nvarchar] (20) NULL ,
[CountryID] [int] NULL
) ON [PRIMARY]
GO
--index for user table
CREATE INDEX [IX_USER] ON [dbo].[dimUser]([UserName]) ON [PRIMARY]
GO
--index for application table
CREATE INDEX [IX_Application] ON [dbo].[dimApplication]([Application]) ON [PRIMARY]
GO
--index for computed column on stg_LogFiles table
--===== Note that we don't need this any more.
--CREATE INDEX [IX_FirstFour] ON stg_LogFiles ([Firstfour])
--ON [PRIMARY]
--data for application table
INSERT INTO [dimApplication]
SELECT 'GMR08','1','OLTCHK','1' UNION ALL
SELECT 'GMR09','2','OLTCHK','1' UNION ALL
SELECT 'GMR10','3','OLTCHK','1'
--data for user table
INSERT INTO [dimUser]
SELECT '1','KRRHEJ','Venezuela','43' UNION ALL
SELECT '2','SGTAYL','Venezuela','43' UNION ALL
SELECT '3','PHGANI','Venezuela','43' UNION ALL
SELECT '4','ROBAIR','Venezuela','43' UNION ALL
SELECT '5','silven','Venezuela','43' UNION ALL
SELECT '6','COAGUJ','Venezuela','43' UNION ALL
SELECT '7','USFITT','Venezuela','43' UNION ALL
SELECT '8','admin','Netherlands','21'
--data for stg_LogFiles table
--===== I modified this a bit so I could build a 17 million rows in under 10 minutes.
-- I also included the "odd" lines that have no DoW at characters 2-4.
INSERT INTO dbo.stg_LogFiles (LineRead)
SELECT TOP (17000000) d.LineRead
FROM (
SELECT '[Mon Jan 05 13:51:05 2009]Local/GMR09///Info(1013205)' UNION ALL
SELECT '[Mon Jan 05 13:51:18 2009]Local/GMR09///Info(1013210)' UNION ALL
SELECT '[Spreadsheet Extractor Big Block Allocs -- Dyn.Calc.Cache : [15] non-Dyn.Calc.Cache : [0]]' UNION ALL
SELECT '[Mon Jan 05 13:51:30 2009]Local/GMR09/OTLCHK/admin/Info(1021000)' UNION ALL
SELECT '[Clear Active on User [NLMAJS] Instance [1]' UNION ALL
SELECT '[Tue Jan 06 13:52:59 2009]Local/GMR09/OTLCHK/admin/Info(1021000)' UNION ALL
SELECT '[Wed Jan 07 13:53:19 2009]Local/GMR09/OTLCHK/admin/Info(1021000)' UNION ALL
SELECT '[Clear Active on User [NLMAJS] Instance [1]' UNION ALL
SELECT '[Thu Jan 08 13:53:28 2009]Local/GMR09///Info(1013210)' UNION ALL
SELECT '[Sun Jan 11 13:53:40 2009]Local/GMR09///Info(1013210)' UNION ALL
SELECT '[The sheet contains an unknown member: IS: charged by R-share - Manufacturing.]' UNION ALL
SELECT '[The sheet contains an unknown member: A7057120.]'
) d (LineRead),
Master.sys.All_Columns ac1,
Master.sys.All_Columns ac2
GO
... and here's the code to solve the problem. It runs in about 4 minutes on my 8 year old single 1.8Ghz 1GB RAM desktop. It should scream on your server. Again, read the comments... they're important for similar problems in the future... hint... "Divide'n'Conquer" makes code shorter, more readable, more maintainable, and faster... a lot faster. 😉
--the query (This runs in just under 4 minutes on my machine which is an 8 year old desktop with a single 1.8GHz
-- CPU and 1GB of RAM. This should scream on your server. Notice how the "Divide'n'Conquer" method
-- has made this code a lot easier to read and maintain. It's also made it quite a bit faster.
-- Notice that I didn't have to do a bunch of juggling on the date conversion.
-- You've just gotta try everything instead of assuming that something doesn't work.
WITH --===== This is where I made ALL the changes except for the calculated column on the staging table
cteFirstSplit AS
( --=== This does the proper split upto the first slash and carries the rest forward
SELECT CAST(SUBSTRING(LineRead,6,20) AS DATETIME) AS [LogDate],
SUBSTRING(LineRead, CHARINDEX('/',LineRead)+1, 8000) AS [TheRest]
FROM dbo.stg_LogFiles
WHERE DoW > 0 --Also filters out NULLs which can't be compared this way
)
,
cteSecondSplit AS
( --=== This splits out Application and carries the rest forward
SELECT [LogDate],
SUBSTRING(TheRest, 1, CHARINDEX('/',TheRest)-1) AS [Application],
SUBSTRING(TheRest, CHARINDEX('/',TheRest)+1, 8000) AS [TheRest]
FROM cteFirstSplit
)
,
cteThirdSplit AS
( --=== This splits out DataBase and carries the rest forward
SELECT [LogDate], [Application],
SUBSTRING(TheRest, 1, CHARINDEX('/',TheRest)-1) AS [DataBase],
SUBSTRING(TheRest, CHARINDEX('/',TheRest)+1, 8000) AS [TheRest]
FROM cteSecondSplit
)
,
cteFourthSplit AS
( --==== This splits out the User and forgets "TheRest"
SELECT [LogDate], [Application], [DataBase],
SUBSTRING(TheRest, 1, CHARINDEX('/',TheRest)-1) AS [User]
FROM cteThirdSplit
)
INSERT INTO [dbo].[fct_Logs]
([LogDate], [Application], [Database], [User], [Retrieval])
SELECT B.LogDate,
A.ApplicationID,
A.DatabaseID,
U.UserID,
1 AS Retreival
FROM cteFourthSplit AS B
LEFT JOIN dbo.dimApplication A ON A.Application = B.Application --Original Join
LEFT JOIN dbo.dimUser U ON U.UserName = B.[User] --Original Join
As a side bar, notice that I used NO indexes on the staging table. Even a clustered index didn't add much value and it took several minutes to build.
--Jeff Moden
Change is inevitable... Change for the better is not.
April 23, 2010 at 11:05 am
Thanks jeff. Will try it this weekend and let you know how it runs. Much appreciated
April 23, 2010 at 11:47 am
Hi Jeff,
Just ran this at work on some of the data and it the first CTE failed because it could convert the string type to datetime.
After running some queries to find the row I found a row containing
'[AAB Locations / Hyperion Enterprise Structure ] is an invalid member name in database [OTLCHK]'
I hadn't come across text this before and I am sure there are probably more amongst the millions of rows. How can I amend the computed column to cater for rows like this?
April 23, 2010 at 12:55 pm
eseosaoregie (4/23/2010)
Hi Jeff,Just ran this at work on some of the data and it the first CTE failed because it could convert the string type to datetime.
After running some queries to find the row I found a row containing
'[AAB Locations / Hyperion Enterprise Structure ] is an invalid member name in database [OTLCHK]'
I hadn't come across text this before and I am sure there are probably more amongst the millions of rows. How can I amend the computed column to cater for rows like this?
The problem is that I tried to get too fancy. We have a couple of changes to make in anticipation of other data "anomolies"...
First, change the calculated column in the staging table to the following (instead of Dow)...
LineTest AS CHARINDEX(LEFT(LineRead,5),'[Mon [Tue [Wed [Thu [Fri [Sat [Sun ')
* SIGN(CHARINDEX('/',LineRead)-CHARINDEX('//',LineRead)) PERSISTED
Second, change the final code to the following (again, a change away from DoW and using LineTest, instead)...
--the query (This runs in just under 4 minutes on my machine which is an 8 year old desktop with a single 1.8GHz
-- CPU and 1GB of RAM. This should scream on your server. Notice how the "Divide'n'Conquer" method
-- has made this code a lot easier to read and maintain. It's also made it quite a bit faster.
-- Notice that I didn't have to do a bunch of juggling on the date conversion.
-- You've just gotta try everything instead of assuming that something doesn't work.
WITH --===== This is where I made ALL the changes except for the calculated column on the staging table
cteFirstSplit AS
( --=== This does the proper split upto the first slash and carries the rest forward
SELECT CAST(SUBSTRING(LineRead,6,20) AS DATETIME) AS [LogDate],
SUBSTRING(LineRead, CHARINDEX('/',LineRead)+1, 8000) AS [TheRest]
FROM dbo.stg_LogFiles
WHERE LineTest > 0 --Also filters out NULLs which can't be compared this way
)
,
cteSecondSplit AS
( --=== This splits out Application and carries the rest forward
SELECT [LogDate],
SUBSTRING(TheRest, 1, CHARINDEX('/',TheRest)-1) AS [Application],
SUBSTRING(TheRest, CHARINDEX('/',TheRest)+1, 8000) AS [TheRest]
FROM cteFirstSplit
)
,
cteThirdSplit AS
( --=== This splits out DataBase and carries the rest forward
SELECT [LogDate], [Application],
SUBSTRING(TheRest, 1, CHARINDEX('/',TheRest)-1) AS [DataBase],
SUBSTRING(TheRest, CHARINDEX('/',TheRest)+1, 8000) AS [TheRest]
FROM cteSecondSplit
)
,
cteFourthSplit AS
( --==== This splits out the User and forgets "TheRest"
SELECT [LogDate], [Application], [DataBase],
SUBSTRING(TheRest, 1, CHARINDEX('/',TheRest)-1) AS [User]
FROM cteThirdSplit
)
INSERT INTO [dbo].[fct_Logs]
([LogDate], [Application], [Database], [User], [Retrieval])
SELECT B.LogDate,
A.ApplicationID,
A.DatabaseID,
U.UserID,
1 AS Retreival
FROM cteFourthSplit AS B
LEFT JOIN dbo.dimApplication A ON A.Application = B.Application --Original Join
LEFT JOIN dbo.dimUser U ON U.UserName = B.[User] --Original Join
Let me know if you have any other problems.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 16 through 30 (of 33 total)
You must be logged in to reply to this topic. Login to reply