June 4, 2021 at 5:39 am
Not yet... It's 1:35 AM and I just got done with a work related crunch. I have to be awake, "lively and aware", for scrum at 9 and so I'm going to bed. 😀 Apologies for tonight.
My suggestions would be to looking into the 3 parameter convention use of the undocumented DIRTREE command in TSQL and how to sequester errors in a separate file without halting a load of a file in BULK Insert and how to create a BCP format file to make loading your double-quoted data easier. There might be something else I'm forgetting to suggest but I just drooled on myself so it's time for me to quit for the day.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 5, 2021 at 3:44 am
Ok... I created a file that had one row for every second of the year 2020 (31,622,400 rows plus the header row and weighs in at 956MB (almost a gigabyte)). I created a "staging" table to load the file into and set things up so that it would check every row as "unique" and every row for datatype with a Clustered PK. I also created a BCP format file to direct the shredding so we don't need to fart around with the double quotes once the data is loaded into the 2 column staging table. It will also tolerate up to 10 data faults by default. We can make it do more if you need to.
The import to the staging table with all that data checking takes just a little over 34 seconds on my laptop.
Is something like that worth you time continuing to develop or not? I have no idea how long it would take PowerShell to do the same thing. I also have to clue how long it would take SSIS (although, done correctly, can operate at roughly the same speed for the import part because it'll also use the "bulk loader").
Lemme know and I'll hammer out the rest... it's not difficult it's just I don't want to waste any time if the performance of my test isn't good enough.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 6, 2021 at 12:45 pm
THat sounds like something to keep developing, sounds like a good solution.
Many Thanks!!!
June 6, 2021 at 1:15 pm
Ok... We need some details...
--Jeff Moden
Change is inevitable... Change for the better is not.
June 7, 2021 at 12:54 am
CREATE TABLE [dbo].[LineSpeed](
[Date_Recorded] [datetime] NULL,
[Speed] [int] NULL,
[Oven_Line] [nvarchar](35) NULL
) ON [PRIMARY]
GO
5. SQL2012 STD.
6 No.
7. Sure will use it.
THanks.
June 7, 2021 at 12:55 am
I was thinking on the table and I would like to make separate columns for Oven and Line..
June 7, 2021 at 3:29 pm
I was thinking on the table and I would like to make separate columns for Oven and Line..
Funny thing there... I figured that early on (part of "column normalization) and was already setup to do just exactly that and make that same recommendation to you. Thanks for the feedback. I'll try to finish it tonight.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 8, 2021 at 5:50 am
Hang in there. I made some good progress (fully testing along the way) but I had another late work night and it's time for bed (01:49 AM). I'll work on this more tomorrow night. It'll be worth the wait.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 11, 2021 at 1:42 am
Still on the radar?
Thanks.
June 14, 2021 at 1:15 am
Did you have an free time to complete?
Thanks,
June 14, 2021 at 3:32 am
Sorry. I've not had much time to work on this. Working on it now. Having some fairly good success. I'm also making sure that it's fully documented so that you can figure out what it does.
--Jeff Moden
Change is inevitable... Change for the better is not.
Ok. Mostly done. We still need to add some "pretty" error handling and the file move-after-imported thing but this is working quite nicely and it does tolerate errors along with a little self reporting is on the error(s).
First, you need a BCP format file. Save it where SQL Server can get at it. I normally recommend saving it where you keep the files it represents that you're going to import. This isn't strictly needed but it makes things nasty fast with BULK INSERT. Here's the contents of the file...
9.0
3
1 SQLCHAR 0 1 "" 0 LeadDQuote ""
2 SQLCHAR 0 100 "\",\"" 1 stamp_time ""
3 SQLCHAR 0 100 "\"\r\n" 2 wirespeed ""
Where ever you name it as and where ever you save it, you need to change the related string in the code below. It should be obvious but here's the line that you need to change...
,@FileSourceBCPFMT VARCHAR(500) = 'D:\Temp\OvenLine_Staging\OvenLine.BCPFMT'
You also need to make a change in the code to point to the full path (which may be a UNC but, either way, SQL Server needs READ/WRITE privs to the path). Here's the line in the code you need to change...
DECLARE @FileSourceDirCmd VARCHAR(500) = 'DIR "D:\Temp\OvenLine_Staging\????-line*.csv" /s /a-d /b'
Without much effort, we could parameterize those two things as inputs to a stored procedure if you want to convert the code below to a stored procedure.
This is the "LineSpeed" table I created as the ultimate target table. The code imports to a staging table first, just to be on the safe side and to allow for any additional checks you might want to make on the imported data before inserting it into the final table.
-- DROP TABLE IF EXISTS dbo.LineSpeed;
CREATE TABLE dbo.LineSpeed
(
Date_Recorded DATETIME NOT NULL
,Speed INT NOT NULL
,Oven NVARCHAR(35) NOT NULL
,Line INT NOT NULL
,CONSTRAINT PK_LineSpeed
PRIMARY KEY CLUSTERED (Date_Recorded,Oven,Line)
--WITH (FILLFACTOR = 71) --Optional FillFactor to prevent page splits as the table gets bigger.
)
;
Speaking of "the code", here it is below. You'll need sysadmin privs to run it because of the use of xp_CmdShell. If you decide to use this, I can show you how to setup the stored procedure so that folks with lesser privs can execute the code them needing such high-diety privs without actually giving them any privs to use xp_CmdShell directly. It's a very safe method provided that we make any parameterization of paths "injection proof", which is also not difficult to do. Note that if you absolutely cannot use xp_CmdShell because of folks that don't actually know how to use it safely, we can do some work arounds. We just won't be able to do file moves (when it's time) through the stored procedure.
And yeah, it does actually work. I'll show you the run results right after this...
--=====================================================================================================================
-- Presets
--=====================================================================================================================
--===== Environment
SET NOCOUNT ON
;
--===== Configuration.
-- Set the string constants to the correct paths, with may be UNCs.
-- Note that these could be reworked a tiny bit to become parameters for a stored procedure.
DECLARE @FileSourceDirCmd VARCHAR(500) = 'DIR "D:\Temp\OvenLine_Staging\????-line*.csv" /s /a-d /b'
,@FileSourceBCPFMT VARCHAR(500) = 'D:\Temp\OvenLine_Staging\OvenLine.BCPFMT'
;
--===== Constants
DECLARE @DBar CHAR(119) = REPLICATE('=',119)
,@DisplaySQL TINYINT = 0 --0 means don't display, anything else means display.
,@DLine CHAR(119) = REPLICATE('-',119)
;
--===== Variables
DECLARE
@Dummy INT
,@AdvOptStatus TINYINT
,@BulkCmd VARCHAR(1000)
,@CmdShellStatus TINYINT
,@Duration CHAR(12)
,@FileCount INT
,@FileCounter INT = 1
,@FullFileName VARCHAR(500)
,@FullFilePath VARCHAR(500)
,@Line INT
,@Oven VARCHAR(20)
,@RowCount INT
,@StartDT CHAR(24) = CONVERT(CHAR(24),GETDATE(),113)
;
--=====================================================================================================================
-- Mark the run start
--=====================================================================================================================
RAISERROR('%s',0,0,@DBar);
RAISERROR('Run Start Time = %s',0,0,@StartDT);
RAISERROR('@FileSourceDirCmd = %s',0,0,@FileSourceDirCmd);
RAISERROR('@FileSourceBCPFMT = %s',0,0,@FileSourceBCPFMT);
RAISERROR('%s',0,0,@DBar);
--=====================================================================================================================
-- Temp Tables
--=====================================================================================================================
--===== Drop any Temp Tables that exist to make reruns in SSMS easier.
IF OBJECT_ID('tempdb..#FilesToLoad') IS NOT NULL DROP TABLE #FilesToLoad;
IF OBJECT_ID('tempdb..#Staging' ) IS NOT NULL DROP TABLE #Staging;
--===== Create any Temp Tables that need to be created first.
-- This table is used to control the imports
CREATE TABLE #FilesToLoad
(
FileNum INT IDENTITY(1,1) PRIMARY KEY CLUSTERED
,FullFilePath VARCHAR(500)
,FullFileName AS SUBSTRING(FullFilePath
,LEN(FullFilePath)-CHARINDEX('\',REVERSE(FullFilePath))+2
,500) PERSISTED
)
;
-- This table is what we initially load a file into for a bit of "self error checking".
-- If there's a datatype violation when we insert, it'll cause an error without us having to check manually.
-- It's very fast and it also makes sure that we don't have any duplicate per-second time stamps.
CREATE TABLE #Staging
(
stamp_time DATETIME2(0) NOT NULL PRIMARY KEY CLUSTERED
,wirespeed SMALLINT NOT NULL
)
;
--=====================================================================================================================
-- Setup advanced options and xp_cmdshell
--=====================================================================================================================
--===== Remember the current status of 'Show Advanced Options'
SELECT @AdvOptStatus = CONVERT(INT,ISNULL(value,value_in_use))
FROM sys.configurations
WHERE name = 'show advanced options'
;
--===== Remember the current status of xp_CmdShell
SELECT @CmdShellStatus = CONVERT(INT,ISNULL(value,value_in_use))
FROM sys.configurations
WHERE name = 'xp_cmdshell'
;
--===== If xp_CmdShell is disabled, temporarily enable it.
IF @CmdShellStatus = 0
BEGIN
EXEC sp_configure 'show advanced options',1; RECONFIGURE;
EXEC sp_configure 'xp_cmdshell' ,1; RECONFIGURE;
END
;
--===== Print a display separator
RAISERROR('%s',0,0,@DLine);
--=====================================================================================================================
-- Get the filepath information we need (ie. Load the file names we want to import)
--=====================================================================================================================
--===== Get the full path information for the files we want to load.
INSERT INTO #FilesToLoad
(FullFilePath)
EXEC xp_CmdShell @FileSourceDirCmd
;
--===== Remember how many files we need to import
SELECT @FileCount = MAX(FileNum)
FROM #FilesToLoad
WHERE FullFilePath > '' --Eliminates nulls at the end and any blanks (don't expect any blanks, though).
;
--===== Display the file information that we'll be loading from
RAISERROR('Loading the following files (see the grid for the list)...',0,0);
SELECT *
FROM #FilesToLoad
WHERE FileNum <= @FileCount
;
--===== Print a display separator
RAISERROR('%s',0,0,@DLine)
;
--=====================================================================================================================
-- Load each file that appears in the #FilesToLoad table
--=====================================================================================================================
WHILE @FileCounter <= @FileCount
BEGIN
--===== Start the timer
SELECT @StartDT = CONVERT(CHAR(24),GETDATE(),113) --dd mon yyyy hh:mi:ss:mmm (24h)
;
--===== Print the start date time for this file load.
RAISERROR('StartDT: %s',0,0,@StartDT)
;
--====== Make sure the staging table is empty.
TRUNCATE TABLE #Staging
;
--===== Create the BULK INSERT command for the current file.
-- Not that this is NOT indented just to make the output to the screen pretty.
-- Also note that an unexpected end-of-file error may still cause an abort.
-- We need to work on that but other errors provide an alert and the file still loads
-- except for the bad line and, possibly, the line after the bad line.
SELECT @FullFilePath = FullFilePath
,@FullFileName = FullFileName
,@BulkCmd = REPLACE(REPLACE(REPLACE(REPLACE('
BULK INSERT #Staging
FROM <<qFullFilePath>>
WITH (
CODEPAGE = "RAW"
,FIRSTROW = 2
,FORMATFILE = <<q@FileSourceBCPFMT>>
,MAXERRORS = 2000000000 --Basically, keep loading no matter how many errors we get.
,TABLOCK
);'
-- Other end of REPLACES
,'<<FullFilePath>>' ,FullFilePath)
,'<<qFullFilePath>>' ,QUOTENAME(FullFilePath,''''))
,'<<q@FileSourceBCPFMT>>',QUOTENAME(@FileSourceBCPFMT,''''))
,'"' ,'''') --Replaces double quotes with single quotes.
FROM #FilesToLoad
WHERE FileNum = @FileCounter
;
--===== Do the import (load) to the staging table.
-- Note that any errors in this area are displayed but they don't stop the import of the file.
RAISERROR('Working on file %u of %u files: %s...',0,0,@FileCounter,@FileCount,@FullFilePath) WITH NOWAIT;
IF @DisplaySQL <> 0 RAISERROR('%s',0,0,@BulkCmd) WITH NOWAIT;
EXEC(@BulkCmd)
;
--===== Display how many rows we loaded into the staging table.
-- Don't put anything between this SELECT and the EXEC(@BulkCmd) above.
SELECT @RowCount = @@ROWCOUNT;
RAISERROR('%u Rows loaded from %s into #Staging.',0,0,@RowCount,@FullFilePath)
;
--===== Get the Oven and Line info from the file name.
SELECT @Oven = PARSENAME(ol.OvenLine,3)
,@Line = PARSENAME(ol.OvenLine,2)
FROM (SELECT REPLACE(@FullFileName,'-Line','.'))ol(OvenLine)
;
--===== Copy the cleaned staging table data into the final LineSpeed table
-- including the Oven and Line notation.
INSERT INTO dbo.LineSpeed WITH(TABLOCK) --Tablock is to take advantage of any "Minimal Logging" available
(Date_Recorded,Speed,Oven,Line)
SELECT Date_Recorded = stamp_time
,Speed = wirespeed
,Oven = @Oven
,Line = @Line
FROM #Staging
ORDER BY Date_Recorded, Oven, Line -- To take advantage of any "Minimal Logging" Available
OPTION (RECOMPILE,MAXDOP 1) -- To take advantage of any "Minimal Logging" Available
;
--===== Display how many rows we loaded from the staging table to the Linespeed table.
-- Don't put anything between this SELECT and the INSERT/SELECT above.
-- We'll also display the duration in hh:mi:ss:mmm (24h).
SELECT @RowCount = @@ROWCOUNT
,@Duration = CONVERT(CHAR(12),GETDATE()-@StartDT,114) --hh:mi:ss:mmm (24h)
;
RAISERROR('%u Rows loaded from #Staging into dbo.LineFeed.',0,0,@RowCount);
RAISERROR('Duration: %s',0,0,@Duration);
RAISERROR('%s',0,0,@DLine)
;
--===== Bump the loop counter
SELECT @FileCounter += 1
;
END
;
--=====================================================================================================================
-- Change the advanced options and xp_CmdShell back to their original state.
--=====================================================================================================================
--===== Set both options back to their original condition just to keep everyone happy.
EXEC sp_configure 'xp_cmdshell' ,@CmdShellStatus; RECONFIGURE;
EXEC sp_configure 'show advanced options',@AdvOptStatus ; RECONFIGURE;
--===== Display the "Run Complete" message
RAISERROR('%s',0,0,@DBAR);
RAISERROR('RUN COMPLETE.',0,0);
GO
And here are the run results. I made a much larger first file just do demonstrate how fast all this is (more than 31 million rows womb-to-tomb in just over a minute). The second file has two intentional errors in it just to show what happens and that the load continues despite the error. The third file is a normal good file.
Both the second and third file each have 50,000 rows and each loads in less that 200 milliseconds womb-to-tomb unless there's an error in the file. Errors will, of course, take a little longer but it's still nasty fast. You should have no problems even if you have hundreds of such files unless you have hundreds of errors in the files.
=======================================================================================================================
Run Start Time = 14 Jun 2021 02:00:55:277
@FileSourceDirCmd = DIR "D:\Temp\OvenLine_Staging\????-line*.csv" /s /a-d /b
@FileSourceBCPFMT = D:\Temp\OvenLine_Staging\OvenLine.BCPFMT
=======================================================================================================================
-----------------------------------------------------------------------------------------------------------------------
Loading the following files (see the grid for the list)...
-----------------------------------------------------------------------------------------------------------------------
StartDT: 14 Jun 2021 02:00:55:317
Working on file 1 of 3 files: D:\Temp\OvenLine_Staging\f300-line1.csv...
31622400 Rows loaded from D:\Temp\OvenLine_Staging\f300-line1.csv into #Staging.
31622400 Rows loaded from #Staging into dbo.LineFeed.
Duration: 00:01:07:787
-----------------------------------------------------------------------------------------------------------------------
StartDT: 14 Jun 2021 02:02:03:103
Working on file 2 of 3 files: D:\Temp\OvenLine_Staging\f300-line2.csv...
Msg 4864, Level 16, State 1, Line 16
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 15, column 2 (stamp_time).
Msg 4864, Level 16, State 1, Line 16
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 25, column 2 (stamp_time).
49997 Rows loaded from D:\Temp\OvenLine_Staging\f300-line2.csv into #Staging.
49997 Rows loaded from #Staging into dbo.LineFeed.
Duration: 00:00:00:597
-----------------------------------------------------------------------------------------------------------------------
StartDT: 14 Jun 2021 02:02:03:700
Working on file 3 of 3 files: D:\Temp\OvenLine_Staging\f300-line3.csv...
50000 Rows loaded from D:\Temp\OvenLine_Staging\f300-line3.csv into #Staging.
50000 Rows loaded from #Staging into dbo.LineFeed.
Duration: 00:00:00:190
-----------------------------------------------------------------------------------------------------------------------
Configuration option 'xp_cmdshell' changed from 1 to 1. Run the RECONFIGURE statement to install.
Configuration option 'show advanced options' changed from 1 to 1. Run the RECONFIGURE statement to install.
=======================================================================================================================
RUN COMPLETE.
Notice that the code remembers whether or not advanced options and xp_CmdShell were enabled or not, then enables them, uses them, and changes them back to their original condition at the end of the run.
If you have any questions, please read the comments in the code first. If you can't science out an answer for your question after that, then by all mean, fire a flare on this thread and we'll see what we can do.
And, like I said, we still have to do the file moves but I figured this would be a really good start and you can let me know if you want to continue in this direction. Including creating the test data and writing all of the code and testing the bejeezus out of it, and writing it all up for good understanding, this only took about 5 hours to do. Changes will be fairly simple now that the code has a keel.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 14, 2021 at 3:17 pm
Awesome!!!!...
I'll get some time today to run it thru it's paces ...
Many Thanks
June 15, 2021 at 2:12 am
That was a super fast load process... Just wanted to thanks again ... very impressive and well documented SP
Just a side question, what process do you choose to load xls files to SQL.
Thanks
June 15, 2021 at 2:37 am
That was a super fast load process... Just wanted to thanks again ... very impressive and well documented SP
Just a side question, what process do you choose to load xls files to SQL. Thanks
It would be quite similar except I use the ACE drivers with OpenRowSet. Some folks use a linked server for each file but I've found that to be a fair bit slower.
There's also the subject of what form the source spreadsheet is in. I have some code that auto-magically "figures that out" and unpivots the typical "horizontal spreadsheet" into Eav Tables with having to know the names of the columns, etc. I wouldn't post anything on that, though. Instead, I'd likely give you the PowerPoint presentation I made on the subject of "Excel Hell" and the code and say "have at it".
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 16 through 30 (of 45 total)
You must be logged in to reply to this topic. Login to reply