June 29, 2017 at 10:52 am
Lynn Pettis - Thursday, June 29, 2017 9:31 AMChris Harshman - Wednesday, June 28, 2017 2:53 PMunfortunately had to write one this week... look through rows of a document table for the last 48 hours and compare the FileName column to the results of xp_fileexist to try to find missing documents (without waiting for the users to say "hey... the file doesn't open!")Just thinking that there may be a better way but would need more details.
I considered getting a list of all files at once, then comparing to the document table, but the problem is there's 5 years or more of the various documents out in the file share, with several thousand generated in an average day. I suppose I could hack together something in .Net code but it would still be loop oriented instead of set based.
June 29, 2017 at 11:14 am
Chris Harshman - Thursday, June 29, 2017 10:52 AMLynn Pettis - Thursday, June 29, 2017 9:31 AMChris Harshman - Wednesday, June 28, 2017 2:53 PMunfortunately had to write one this week... look through rows of a document table for the last 48 hours and compare the FileName column to the results of xp_fileexist to try to find missing documents (without waiting for the users to say "hey... the file doesn't open!")Just thinking that there may be a better way but would need more details.
I considered getting a list of all files at once, then comparing to the document table, but the problem is there's 5 years or more of the various documents out in the file share, with several thousand generated in an average day. I suppose I could hack together something in .Net code but it would still be loop oriented instead of set based.
That's exactly what came to mind when you described the original problem, but thousands of files per day * 5 years = a whole lot of files. Does the folder even behave moderately in Windows Explorer or is unusable yet?
June 29, 2017 at 12:10 pm
Ed Wagner - Thursday, June 29, 2017 11:14 AMThat's exactly what came to mind when you described the original problem, but thousands of files per day * 5 years = a whole lot of files. Does the folder even behave moderately in Windows Explorer or is unusable yet?
there are several sub-folders within the main document folder on the file share, but some of those are a bit painful to open, you get the scrolling green bar for a couple of minutes on some. :crazy: I'm just glad the person who designed this originally did split out the documents by department and process, and then kept them on a separate file server away from the database!
June 29, 2017 at 12:24 pm
Chris Harshman - Thursday, June 29, 2017 10:52 AMLynn Pettis - Thursday, June 29, 2017 9:31 AMChris Harshman - Wednesday, June 28, 2017 2:53 PMunfortunately had to write one this week... look through rows of a document table for the last 48 hours and compare the FileName column to the results of xp_fileexist to try to find missing documents (without waiting for the users to say "hey... the file doesn't open!")Just thinking that there may be a better way but would need more details.
I considered getting a list of all files at once, then comparing to the document table, but the problem is there's 5 years or more of the various documents out in the file share, with several thousand generated in an average day. I suppose I could hack together something in .Net code but it would still be loop oriented instead of set based.
Quick question, do you only need to compare the filename or the entire file path and name?
June 29, 2017 at 12:30 pm
Lynn Pettis - Thursday, June 29, 2017 12:24 PMQuick question, do you only need to compare the filename or the entire file path and name?
Since there are subfolders, and there is a chance (slim but could happen) that the same filename may exist in more than one subfolder, I should verify the whole path and name.
June 29, 2017 at 1:18 pm
Chris Harshman - Thursday, June 29, 2017 12:30 PMLynn Pettis - Thursday, June 29, 2017 12:24 PMQuick question, do you only need to compare the filename or the entire file path and name?Since there are subfolders, and there is a chance (slim but could happen) that the same filename may exist in more than one subfolder, I should verify the whole path and name.
I assume that if a file won't open it is because the file exists in the table but not on disk, correct?
June 29, 2017 at 1:29 pm
Chris Harshman - Thursday, June 29, 2017 12:30 PMLynn Pettis - Thursday, June 29, 2017 12:24 PMQuick question, do you only need to compare the filename or the entire file path and name?Since there are subfolders, and there is a chance (slim but could happen) that the same filename may exist in more than one subfolder, I should verify the whole path and name.
Just a thought about how this could be done:
CREATE TABLE dbo.Test(
Id INT IDENTITY(1,1)
, DocumentFileName NVARCHAR(4000)
, FileHash as CAST(HASHBYTES('MD5',[DocumentFileName]) AS VARBINARY(16)) PERSISTED -- MD5 is a 16 byte value
);
INSERT INTO dbo.Test([DocumentFileName])
EXEC xp_cmdshell 'dir/b/s c:\*.*'; -- not sure if this will work with a UNC, but give it a shot
DELETE FROM [dbo].[Test] WHERE [DocumentFileName] IS NULL;
CREATE NONCLUSTERED INDEX idx1 ON [dbo].[Test](FileHash);
SELECT * FROM [dbo].[Test] AS [t];
SELECT *
FROM
dbo.Media m -- or what ever you table is named
WHERE
m.DateCreated >= DATEADD(HOUR,-48,GETDATE()) -- Or what ever column in the table you use for reducing the data set
AND NOT EXISTS(SELECT 1
FROM [dbo].[Test] AS [t]
WHERE [t].[FileHash] = [m].[FileHash] -- assumes you have or can put a hash on your table
);
DROP TABLE dbo.Test;
June 29, 2017 at 2:04 pm
Lynn Pettis - Thursday, June 29, 2017 1:18 PMI assume that if a file won't open it is because the file exists in the table but not on disk, correct?
Yes, that is what we are testing for. FYI, the DIR /B /S command does work with a UNC name, but I cancelled it after it took longer just to run that DIR command as to run the entire CURSOR approach to only look at rows from document table from the past 48 hours. Thanks for the effort though, I do appreciate you and Brandie actually considering this.
June 29, 2017 at 2:29 pm
Chris Harshman - Thursday, June 29, 2017 2:04 PMLynn Pettis - Thursday, June 29, 2017 1:18 PMI assume that if a file won't open it is because the file exists in the table but not on disk, correct?Yes, that is what we are testing for. FYI, the DIR /B /S command does work with a UNC name, but I cancelled it after it took longer just to run that DIR command as to run the entire CURSOR approach to only look at rows from document table from the past 48 hours. Thanks for the effort though, I do appreciate you and Brandie actually considering this.
Lynn, great idea with the hash.
Chris, I've found a similar situation. I wrote a process to audit (size, date, etc.) directories of files that sound a lot like yours - zillions of files in thousands of directories. I do a dir command and then import the text file for parsing and organizing. I find that the dir command takes a very long time because of the sheer volume of files.
I don't know if it'll help or not, but you might want to have a look at the forfiles command. The /D switch allows filtering by date, either an absolute date or a number of days. You could shell out to that in Lynn's example above and then run the check against the smaller list.
June 29, 2017 at 2:33 pm
Ed Wagner - Thursday, June 29, 2017 2:29 PMChris Harshman - Thursday, June 29, 2017 2:04 PMLynn Pettis - Thursday, June 29, 2017 1:18 PMI assume that if a file won't open it is because the file exists in the table but not on disk, correct?Yes, that is what we are testing for. FYI, the DIR /B /S command does work with a UNC name, but I cancelled it after it took longer just to run that DIR command as to run the entire CURSOR approach to only look at rows from document table from the past 48 hours. Thanks for the effort though, I do appreciate you and Brandie actually considering this.
Lynn, great idea with the hash.
Chris, I've found a similar situation. I wrote a process to audit (size, date, etc.) directories of files that sound a lot like yours - zillions of files in thousands of directories. I do a dir command and then import the text file for parsing and organizing. I find that the dir command takes a very long time because of the sheer volume of files.
I don't know if it'll help or not, but you might want to have a look at the forfiles command. The /D switch allows filtering by date, either an absolute date or a number of days. You could shell out to that in Lynn's example above and then run the check against the smaller list.
Another idea would be to see if PowerShell could be used to compare has values between the table in SQL and the files in the file system. I don't know enough about PowerShell to give it a try. I really need to learn PowerShell.
June 29, 2017 at 3:14 pm
Lynn Pettis - Thursday, June 29, 2017 2:33 PMEd Wagner - Thursday, June 29, 2017 2:29 PMChris Harshman - Thursday, June 29, 2017 2:04 PMLynn Pettis - Thursday, June 29, 2017 1:18 PMI assume that if a file won't open it is because the file exists in the table but not on disk, correct?Yes, that is what we are testing for. FYI, the DIR /B /S command does work with a UNC name, but I cancelled it after it took longer just to run that DIR command as to run the entire CURSOR approach to only look at rows from document table from the past 48 hours. Thanks for the effort though, I do appreciate you and Brandie actually considering this.
Lynn, great idea with the hash.
Chris, I've found a similar situation. I wrote a process to audit (size, date, etc.) directories of files that sound a lot like yours - zillions of files in thousands of directories. I do a dir command and then import the text file for parsing and organizing. I find that the dir command takes a very long time because of the sheer volume of files.
I don't know if it'll help or not, but you might want to have a look at the forfiles command. The /D switch allows filtering by date, either an absolute date or a number of days. You could shell out to that in Lynn's example above and then run the check against the smaller list.
Another idea would be to see if PowerShell could be used to compare has values between the table in SQL and the files in the file system. I don't know enough about PowerShell to give it a try. I really need to learn PowerShell.
CREATE TABLE dbo.Test(
Id INT IDENTITY(1,1)
, DocumentFileName NVARCHAR(4000)
, FileHash as CAST(HASHBYTES('MD5',[DocumentFileName]) AS VARBINARY(16)) PERSISTED -- MD5 is a 16 byte value
);CREATE TABLE [dbo].[Test2](DocumentFileName NVARCHAR(4000));EXEC xp_cmdshell 'dir/b/s c:\*.* > C:\files.txt', no_output; -- not sure if this will work with a UNC, but give it a shot
BULK INSERT [dbo].[Test2]
FROM 'c:\files.txt';INSERT INTO [dbo].[Test](DocumentFileName)
SELECT * FROM [dbo].[Test2] AS [t]
CREATE NONCLUSTERED INDEX idx1 ON [dbo].[Test](FileHash);
SELECT * FROM [dbo].[Test] AS [t];
SELECT *
FROM
dbo.Media m -- or what ever you table is named
WHERE
m.DateCreated >= DATEADD(HOUR,-48,GETDATE()) -- Or what ever column in the table you use for reducing the data set
AND NOT EXISTS(SELECT 1
FROM [dbo].[Test] AS [t]
WHERE [t].[FileHash] = [m].[FileHash] -- assumes you have or can put a hash on your table
);
DROP TABLE dbo.Test;
DROP TABLE dbo.Test2;
Edit: Copy paste issues
June 29, 2017 at 4:14 pm
Lynn Pettis - Thursday, June 29, 2017 2:33 PMEd Wagner - Thursday, June 29, 2017 2:29 PMChris Harshman - Thursday, June 29, 2017 2:04 PMLynn Pettis - Thursday, June 29, 2017 1:18 PMI assume that if a file won't open it is because the file exists in the table but not on disk, correct?Yes, that is what we are testing for. FYI, the DIR /B /S command does work with a UNC name, but I cancelled it after it took longer just to run that DIR command as to run the entire CURSOR approach to only look at rows from document table from the past 48 hours. Thanks for the effort though, I do appreciate you and Brandie actually considering this.
Lynn, great idea with the hash.
Chris, I've found a similar situation. I wrote a process to audit (size, date, etc.) directories of files that sound a lot like yours - zillions of files in thousands of directories. I do a dir command and then import the text file for parsing and organizing. I find that the dir command takes a very long time because of the sheer volume of files.
I don't know if it'll help or not, but you might want to have a look at the forfiles command. The /D switch allows filtering by date, either an absolute date or a number of days. You could shell out to that in Lynn's example above and then run the check against the smaller list.
Another idea would be to see if PowerShell could be used to compare has values between the table in SQL and the files in the file system. I don't know enough about PowerShell to give it a try. I really need to learn PowerShell.
I just do a BCP of the file into a staging table and do an EXCEPT or two between the staging table and the target table. As for PowerShell, I've not found much utility in it that I can't do with SQL Server and xp_CmdShell, which isn't the security risk that so many people think it is.
Now, if you want to learn a really cool tool, learn more about WMI/WMIC. It does a whole lot of what people use PowerShell for but you don't usually don't need to write any loops.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 30, 2017 at 6:53 am
Brandie Tarvin - Thursday, June 29, 2017 4:14 AMChris Harshman - Wednesday, June 28, 2017 2:53 PMJeff Moden - Wednesday, June 28, 2017 6:55 AMHeh... what's the last cursor you wrote and what did it do?unfortunately had to write one this week... look through rows of a document table for the last 48 hours and compare the FileName column to the results of xp_fileexist to try to find missing documents (without waiting for the users to say "hey... the file doesn't open!")
Okay, I'm going to pretend to be a little obtuse, but couldn't you run the results of xp_fileexists into a temp table, pull the document table into another temp table with a "FileExists" bit column set to 0 (or 1 if you prefer your falses to be 1), then update that column in a set-based update for all file names that match?
So, guys, I'm curious why this post (one of the first responses to this file thing) is being completely ignored. Does no one have a comment on it?
It just seems a little weird to me that people are starting code based on this exact idea but no one has noticed that I posted about it before the entire discussion really kicked off.
June 30, 2017 at 8:28 am
Brandie Tarvin - Friday, June 30, 2017 6:53 AMBrandie Tarvin - Thursday, June 29, 2017 4:14 AMOkay, I'm going to pretend to be a little obtuse, but couldn't you run the results of xp_fileexists into a temp table, pull the document table into another temp table with a "FileExists" bit column set to 0 (or 1 if you prefer your falses to be 1), then update that column in a set-based update for all file names that match?So, guys, I'm curious why this post (one of the first responses to this file thing) is being completely ignored. Does no one have a comment on it?
It just seems a little weird to me that people are starting code based on this exact idea but no one has noticed that I posted about it before the entire discussion really kicked off.
No offense, I guess I just didn't understand your approach. I thought xp_FileExists only produces output as an INT parameter, or with a 3 column 1 row resultset that doesn't include the filename. I didn't think there was a set based approach with it.
June 30, 2017 at 8:51 am
OK, I've even tried using XCOPY /L /S /D to try to limit the filenames returned by date, so I only get a few thousand instead of the millions and millions of files, but this command still seems to take 3 or 4 minutes to just get this list. Thanks again to everyone who tried to look at this, but the cursor based approach runs entirely in under a minute for me, so I'm sticking with this:
SET NOCOUNT ON
DECLARE
@StartDate datetime = DATEADD(hour, -48, Getdate()), --DATEADD(day, DATEDIFF(day, 7, GETDATE()), 0),
@DocumentPrintID int, @AppName varchar(80), @DocumentFileName varchar(250), @Exists int;
CREATE TABLE #MissingDocs (DocumentPrintID int, AppName varchar(80), DocumentFileName varchar(250));
DECLARE cur_Docs CURSOR FAST_FORWARD FOR
SELECT DocumentPrintID, OriginalInsertAppName, DocumentFileName
FROM dbo.DocumentPrint
WHERE PrintDateTime >= @StartDate
AND DocumentFileName IS NOT NULL;
OPEN cur_Docs;
FETCH NEXT FROM cur_Docs INTO @DocumentPrintID, @AppName, @DocumentFileName;
WHILE @@FETCH_STATUS = 0
BEGIN
SET @Exists = 0;
EXEC master.dbo.xp_fileexist @DocumentFileName, @Exists OUTPUT;
IF @Exists = 0
INSERT INTO #MissingDocs (DocumentPrintID, AppName, DocumentFileName)
VALUES (@DocumentPrintID, @AppName, @DocumentFileName);
FETCH NEXT FROM cur_Docs INTO @DocumentPrintID, @AppName, @DocumentFileName;
END
CLOSE cur_Docs;
DEALLOCATE cur_Docs;
SELECT * FROM #MissingDocs;
DROP TABLE #MissingDocs;
Viewing 15 posts - 59,071 through 59,085 (of 66,742 total)
You must be logged in to reply to this topic. Login to reply