remove partial duplicate records

  • I have put the results of dir *.* /s into a file and need to database the file, for a discrepancy report between external stored data and the database.

    Due to the large size of the file, I need o remove the partial duplicate records pro-grammatically so the records are transformed from raw format shown in **1 below to **2

    **1

    ------

    z:\images\12346

    z:\images\12346\09132010_1151

    09/13/2010 12:46 PM 14,934 12346_09132010_1151_5_i.bmp

    09/13/2010 12:46 PM 14,934 12346_09132010_1151_6_i.bmp

    z:\images\505722

    z:\images\505722\10252011_0849

    10/25/2011 11:36 AM 56,200 505722_10252011_0849_35_0.d

    10/25/2011 11:36 AM 2,032,930 505722_10252011_0849_21_0.d

    10/25/2011 11:36 AM 14,934 505722_10252011_0849_12_i.b

    z:\images\8921

    z:\images\8921\04192011_1022

    04/19/2011 01:36 PM 14,934 8921_04192011_1022_92_i.bmp

    **2

    --------

    z:\images\12346\09132010_1151

    09/13/2010 12:46 PM 14,934 12346_09132010_1151_5_i.bmp

    09/13/2010 12:46 PM 14,934 12346_09132010_1151_6_i.bmp

    z:\images\505722\10252011_0849

    10/25/2011 11:36 AM 56,200 505722_10252011_0849_35_0.d

    10/25/2011 11:36 AM 2,032,930 505722_10252011_0849_21_0.d

    10/25/2011 11:36 AM 14,934 505722_10252011_0849_12_i.b

    z:\images\8921\04192011_1022

    04/19/2011 01:36 PM 14,934 8921_04192011_1022_92_i.bmp

    As the final step, I would have to transform the file from **2 to format below, so it can be imported into database

    z:\images\12346\09132010_1151 09/13/2010 12:46 PM 14,934 12346_09132010_1151_5_i.bmp

    z:\images\12346\09132010_1151 09/13/2010 12:46 PM 14,934 12346_09132010_1151_6_i.bmp

    z:\images\505722\10252011_0849 10/25/2011 11:36 AM 56,200 505722_10252011_0849_35_0.d

    z:\images\505722\10252011_0849 10/25/2011 11:36 AM 2,032,930 505722_10252011_0849_21_0.d

    z:\images\505722\10252011_0849 10/25/2011 11:36 AM 14,934 505722_10252011_0849_12_i.b

    z:\images\8921\04192011_1022 04/19/2011 01:36 PM 14,934 8921_04192011_1022_92_i.bmp

    I tried some excel Macros that reformats the data, and works well as long the number of records are in thousands , however when trying the macro on my raw data due to the file size, after six hours or so, the execution of macros was never completed and I gave up on the idea of using Excel . I am now thinking that it may be possible to reformat the files using SQL tools , however wanted to discuss the idea in this forum and see if there are any suggestions on the approach.

  • Try DIR *.* /s /n /b

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks for your helpful suggestion, using the Dir *.* /s /b /n , solves part of my problem, and redefines the problem as " how to remove partially duplicate records, using tsql?"

    The result of the command looks something like this, any recommendations on how to remove the dups?

    z:\images ==> need to remove (duplicate)

    z:\images\12346 ==> need to remove (duplicate)

    z:\images\12346\09132010_1151 09/13/2010 12:46 PM 14,934 12346_09132010_1151_5_i.bmp

    z:\images\12346\09132010_1151 09/13/2010 12:46 PM 14,934 12346_09132010_1151_6_i.bmp

    z:\images\505722 ==> need to remove (duplicate)

    z:\images\505722\10252011_0849 10/25/2011 11:36 AM 56,200 505722_10252011_0849_35_0.d

    z:\images\505722\10252011_0849 10/25/2011 11:36 AM 2,032,930 505722_10252011_0849_21_0.d

    z:\images\505722\10252011_0849 10/25/2011 11:36 AM 14,934 505722_10252011_0849_12_i.b

    z:\images\8921 ==> need to remove (duplicate)

    z:\images\8921\04192011_1022 04/19/2011 01:36 PM 14,934 8921_04192011_1022_92_i.bmp

  • The problem is that you are thinking of them as "partial duplicates," which itself is impossible (IMHO). Why not look for the '.' in the file name, since that is what I presume is a defining characteristic of what you would like to keep in your set.

    Jared
    CE - Microsoft

  • Have a look at the output from this, it will point you in the direction. Come back if you have any trouble with it:

    DROP TABLE #Filelist

    CREATE TABLE #Filelist (DIRoutput VARCHAR(254))

    INSERT INTO #Filelist

    SELECT 'z:\images' UNION ALL

    SELECT 'z:\images\12346' UNION ALL

    SELECT 'z:\images\12346\09132010_1151 09/13/2010 12:46 PM 14,934 12346_09132010_1151_5_i.bmp' UNION ALL

    SELECT 'z:\images\12346\09132010_1151 09/13/2010 12:46 PM 14,934 12346_09132010_1151_6_i.bmp' UNION ALL

    SELECT 'z:\images\505722' UNION ALL

    SELECT 'z:\images\505722\10252011_0849 10/25/2011 11:36 AM 56,200 505722_10252011_0849_35_0.d' UNION ALL

    SELECT 'z:\images\505722\10252011_0849 10/25/2011 11:36 AM 2,032,930 505722_10252011_0849_21_0.d' UNION ALL

    SELECT 'z:\images\505722\10252011_0849 10/25/2011 11:36 AM 14,934 505722_10252011_0849_12_i.b' UNION ALL

    SELECT 'z:\images\8921' UNION ALL

    SELECT 'z:\images\8921\04192011_1022 04/19/2011 01:36 PM 14,934 8921_04192011_1022_92_i.bmp'

    SELECT a.*, b.*

    FROM #Filelist a

    LEFT JOIN #Filelist b

    ON a.DIRoutput LIKE b.DIRoutput+'\%'

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • SQLKnowItAll (7/3/2012)


    The problem is that you are thinking of them as "partial duplicates," which itself is impossible (IMHO). Why not look for the '.' in the file name, since that is what I presume is a defining characteristic of what you would like to keep in your set.

    Risky, even if the search is confined to the tail end of the string.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi Jared, thanks for your suggestion. The issue is that, there are no set patterns with the name of the data files. Some files have extensions and some don't. For example the file names below:

    i:\PSDATA\Images\Nov2010\11_01_10

    i:\PSDATA\Images\Nov2010\11_01_10\bake1

    Z:\PSDATA\Images\Nov2010\11_01_10\bake1 11/01/2010 08:54 AM 20,449,708 file1 ==>file name w no ext.

    Z:\PSDATA\Images\Nov2010\11_01_10\bake1 11/01/2010 08:54 AM 20,974,558 file10 ==>file name w no ext.

    Z:\PSDATA\Images\Nov2010\11_01_10\bake1 11/01/2010 09:17 AM 20,712,502 file11 ==>file name w no ext.

    And there is a different variation of file names that go like:

    Z:\PSDATA\Images\Sep2005\09_29_05\jone1\thumbnails 05/12/2008 09:08 AM 4,039 file1.jpg

    Z:\PSDATA\Images\Sep2005\09_29_05\jone1\thumbnails 05/12/2008 09:08 AM 4,175 file10.jpg

    Here is the macro I used in excel that does the work but I run into a brick wall, when running it for big files, (do I need to write a program with a cursor to do the same?)

    Option Explicit

    Sub DeletePartialDups()

    Dim rSrc As Range

    Dim c As Range, r As Range

    Dim vRes() As Range

    Dim sFirstAdr As String

    Dim i As Long

    Set rSrc = Range("A1", Cells(Rows.Count, "A").End(xlUp))

    ReDim vRes(0)

    'Get list of cells to delete

    For Each c In rSrc

    sFirstAdr = c.Address

    Set r = rSrc.Find(what:=c, after:=c, LookIn:=xlValues, _

    lookat:=xlPart, MatchCase:=False)

    If r.Address <> sFirstAdr Then

    Set vRes(UBound(vRes)) = c

    ReDim Preserve vRes(0 To UBound(vRes) + 1)

    End If

    Next c

    Application.ScreenUpdating = False

    For i = UBound(vRes) - 1 To 0 Step -1

    vRes(i).Delete xlShiftUp

    Next i

    Application.ScreenUpdating = True

    End Sub

  • jaytaba (7/3/2012)


    Hi Jared, thanks for your suggestion. The issue is that, there are no set patterns with the name of the data files. Some files have extensions and some don't. For example the file names below:

    i:\PSDATA\Images\Nov2010\11_01_10

    i:\PSDATA\Images\Nov2010\11_01_10\bake1

    Z:\PSDATA\Images\Nov2010\11_01_10\bake1 11/01/2010 08:54 AM 20,449,708 file1 ==>file name w no ext.

    Z:\PSDATA\Images\Nov2010\11_01_10\bake1 11/01/2010 08:54 AM 20,974,558 file10 ==>file name w no ext.

    Z:\PSDATA\Images\Nov2010\11_01_10\bake1 11/01/2010 09:17 AM 20,712,502 file11 ==>file name w no ext.

    And there is a different variation of file names that go like:

    Z:\PSDATA\Images\Sep2005\09_29_05\jone1\thumbnails 05/12/2008 09:08 AM 4,039 file1.jpg

    Z:\PSDATA\Images\Sep2005\09_29_05\jone1\thumbnails 05/12/2008 09:08 AM 4,175 file10.jpg

    Here is the macro I used in excel that does the work but I run into a brick wall, when running it for big files, (do I need to write a program with a cursor to do the same?)

    Option Explicit

    Sub DeletePartialDups()

    Dim rSrc As Range

    Dim c As Range, r As Range

    Dim vRes() As Range

    Dim sFirstAdr As String

    Dim i As Long

    Set rSrc = Range("A1", Cells(Rows.Count, "A").End(xlUp))

    ReDim vRes(0)

    'Get list of cells to delete

    For Each c In rSrc

    sFirstAdr = c.Address

    Set r = rSrc.Find(what:=c, after:=c, LookIn:=xlValues, _

    lookat:=xlPart, MatchCase:=False)

    If r.Address <> sFirstAdr Then

    Set vRes(UBound(vRes)) = c

    ReDim Preserve vRes(0 To UBound(vRes) + 1)

    End If

    Next c

    Application.ScreenUpdating = False

    For i = UBound(vRes) - 1 To 0 Step -1

    vRes(i).Delete xlShiftUp

    Next i

    Application.ScreenUpdating = True

    End Sub

    Fair enough. Though now I see you have more problems than just importing and sorting this data :hehe:

    Jared
    CE - Microsoft

  • Try the following to get the list of files:

    dir *.* /a: -D /b /s /n

    NOTE: No space between the a: and the -D. I had to put the space there or you got this a:-D

  • Lynn Pettis (7/3/2012)


    Try the following to get the list of files:

    dir *.* /a: -D /b /s /n

    NOTE: No space between the a: and the -D. I had to put the space there or you got this a:-D

    +1 get the source data right! Nice Lynn!

    Jared
    CE - Microsoft

  • Lynn Pettis (7/3/2012)


    Try the following to get the list of files:

    dir *.* /a: -D /b /s /n

    NOTE: No space between the a: and the -D. I had to put the space there or you got this a:-D

    Spot on, nice job Lynn 😎

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Lynn, nice just what I wanted.

    Just curious, is there a switch that allows capturing of the file attributes (i.e. size, date)

    Thanks again for saving me a lot of trouble.

    Regards

    Jay

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply