Data Shuffle

  • Howdy

    I need to shuffle data within our production database so we can provide it as a test database to a customer purchasing our software.

    I was hoping a statement such as below would help, but I just end up with the every row containing the exact same data.

    update customers

    set

    firstname = (select top 1 firstname from customers order by newid()),

    middlename = (select top 1 middlename from customers order by newid()),

    lastname = (select top 1 lastname from customers order by newid()),

    birthdate = (select top 1 birthdate from customers order by newid())

    Any help would be greatly appreciated. I can solve it using cursors but we have far too much data for this to be a feasible solution.

    Cheers,

    Craig

  • What is the primary key of the table?

    Also, since this is your first post, you might want to take a look at the link in my signature below... folks will be more likey to help if you help them a bit.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok... I borrowed some data and assume that you don't actually have a primary key... this should give you and idea...

    --drop table #shuffle

    DECLARE @NameCountPlus1 INT

    SELECT @NameCountPlus1 = COUNT(*)+1 FROM NorthWind.dbo.Employees

    SELECT IDENTITY(INT,1,1) AS OriginalOrder,

    LastName, FirstName, BirthDate,

    LastName AS NewLastName,

    FirstName AS NewFirstName,

    BirthDate AS NewBirthDate,

    ABS(CHECKSUM(NEWID()))%@NameCountPlus1 AS Order1,

    ABS(CHECKSUM(NEWID()))%@NameCountPlus1 AS Order2,

    ABS(CHECKSUM(NEWID()))%@NameCountPlus1 AS Order3

    INTO #Shuffle

    FROM NorthWind.dbo.Employees

    UPDATE s2

    SET NewLastName = s1.LastName

    FROM #Shuffle s1, #Shuffle s2

    WHERE s1.OriginalOrder = s2.Order1

    UPDATE s2

    SET NewFirstName = s1.FirstName

    FROM #Shuffle s1, #Shuffle s2

    WHERE s1.OriginalOrder = s2.Order2

    UPDATE s2

    SET NewBirthDate = s1.BirthDate

    FROM #Shuffle s1, #Shuffle s2

    WHERE s1.OriginalOrder = s2.Order3

    select * from #shuffle

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff

    I've created a test script below.

    IF OBJECT_ID('TempDB..#customers', 'U') IS NOT NULL

    DROP TABLE #customers

    CREATE TABLE #customers

    (

    CustomerID INT IDENTITY(1, 1) PRIMARY KEY CLUSTERED,

    Title VARCHAR(10) NOT NULL,

    FirstName VARCHAR(30) NOT NULL,

    MiddleName VARCHAR(30) NULL,

    LastName VARCHAR(30) NOT NULL,

    BirthDate SMALLDATETIME NULL

    )

    INSERT INTO #customers VALUES ('MR.', 'CHRISTOPHER', NULL, 'PETERSON', '1967-01-25')

    INSERT INTO #customers VALUES ('MS.', 'RENAE', NULL, 'DAMIANI', '1984-09-21')

    INSERT INTO #customers VALUES ('MR.', 'GARY', NULL, 'MAY', '1974-05-12')

    INSERT INTO #customers VALUES ('MISS', 'ELIZABETH', 'JANE', 'FLAHERTY', '1956-12-02')

    INSERT INTO #customers VALUES ('MR.', 'SAMUEL', 'NATHAN', 'HILL', '1968-07-14')

    INSERT INTO #customers VALUES ('MR.', 'MATTHEW', 'J', 'HORNE', '1971-03-30')

    INSERT INTO #customers VALUES ('MR.', 'ADAM', 'MAX', 'REID', '1979-09-16')

    INSERT INTO #customers VALUES ('MR.', 'JOSHUA', 'DAVID', 'ROUTLEY', '1982-02-02')

    INSERT INTO #customers VALUES ('MR.', 'TUTONE', NULL, 'KALONIHEA', '1958-11-15')

    INSERT INTO #customers VALUES ('MRS.', 'MUNA', NULL, 'NAJJAR', '1963-08-18')

  • Heh... I was hoping you had an IDENTITY PK... I won't have to make many changes to the script I just wrote above...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks a million Jeff

    Using your script as a basis, you turned a 26+ hour process into 32 second process.

    This is the code I'm now using

    -- Add Column for Sequential Order

    ALTER TABLE Customers

    ADD tmpOrder1INT

    GO

    -- Add Sequential Order

    UPDATE c1

    SET tmpOrder1 = RowNum

    FROM Customers c1

    INNER JOIN

    (

    SELECT CustomerID, ROW_NUMBER() OVER (ORDER BY CustomerID) 'RowNum' FROM Customers

    ) c2 ON c2.CustomerID = c1.CustomerID

    GO

    -- Add Column for Random Order

    ALTER TABLE Customers

    ADD tmpOrder2INT

    GO

    -- Add Random Order

    UPDATE Customers

    SET tmpOrder2 = ABS(CHECKSUM(NEWID())) % (SELECT COUNT(*) FROM Customers)

    GO

    -- Update FirstName

    UPDATE c1

    SET FirstName = c2.FirstName

    FROM Customers c1

    INNER JOIN Customers c2 ON c2.tmpOrder1 = c1.tmpOrder2

    GO

    -- Re-randomise

    UPDATE Customers

    SET tmpOrder2 = ABS(CHECKSUM(NEWID()))%(SELECT COUNT(*) FROM Customers)

    GO

    -- Update MiddleName

    UPDATE c1

    SET MiddleName = c2.MiddleName

    FROM Customers c1

    INNER JOIN Customers c2 ON c2.tmpOrder1 = c1.tmpOrder2

    GO

    -- Re-randomise

    UPDATE Customers

    SET tmpOrder2 = ABS(CHECKSUM(NEWID()))%(SELECT COUNT(*) FROM Customers)

    GO

    -- Update LastName

    UPDATE c1

    SET LastName = c2.LastName

    FROM Customers c1

    INNER JOIN Customers c2 ON c2.tmpOrder1 = c1.tmpOrder2

    GO

    -- Re-randomise

    UPDATE Customers

    SET tmpOrder2 = ABS(CHECKSUM(NEWID()))%(SELECT COUNT(*) FROM Customers)

    GO

    -- Update BirthDate

    UPDATE c1

    SET BirthDate = c2.BirthDate

    FROM Customers c1

    INNER JOIN Customers c2 ON c2.tmpOrder1 = c1.tmpOrder2

    GO

    ALTER TABLE Customers

    DROP COLUMN tmpOrder1

    GO

    ALTER TABLE Customers

    DROP COLUMN tmpOrder2

    GO

  • craig (10/9/2008)


    Thanks a million Jeff

    Using your script as a basis, you turned a 26+ hour process into 32 second process.

    This is the code I'm now using

    -- Add Column for Sequential Order

    ALTER TABLE Customers

    ADD tmpOrder1INT

    GO

    -- Add Sequential Order

    UPDATE c1

    SET tmpOrder1 = RowNum

    FROM Customers c1

    INNER JOIN

    (

    SELECT CustomerID, ROW_NUMBER() OVER (ORDER BY CustomerID) 'RowNum' FROM Customers

    ) c2 ON c2.CustomerID = c1.CustomerID

    GO

    -- Add Column for Random Order

    ALTER TABLE Customers

    ADD tmpOrder2INT

    GO

    -- Add Random Order

    UPDATE Customers

    SET tmpOrder2 = ABS(CHECKSUM(NEWID())) % (SELECT COUNT(*) FROM Customers)

    GO

    -- Update FirstName

    UPDATE c1

    SET FirstName = c2.FirstName

    FROM Customers c1

    INNER JOIN Customers c2 ON c2.tmpOrder1 = c1.tmpOrder2

    GO

    -- Re-randomise

    UPDATE Customers

    SET tmpOrder2 = ABS(CHECKSUM(NEWID()))%(SELECT COUNT(*) FROM Customers)

    GO

    -- Update MiddleName

    UPDATE c1

    SET MiddleName = c2.MiddleName

    FROM Customers c1

    INNER JOIN Customers c2 ON c2.tmpOrder1 = c1.tmpOrder2

    GO

    -- Re-randomise

    UPDATE Customers

    SET tmpOrder2 = ABS(CHECKSUM(NEWID()))%(SELECT COUNT(*) FROM Customers)

    GO

    -- Update LastName

    UPDATE c1

    SET LastName = c2.LastName

    FROM Customers c1

    INNER JOIN Customers c2 ON c2.tmpOrder1 = c1.tmpOrder2

    GO

    -- Re-randomise

    UPDATE Customers

    SET tmpOrder2 = ABS(CHECKSUM(NEWID()))%(SELECT COUNT(*) FROM Customers)

    GO

    -- Update BirthDate

    UPDATE c1

    SET BirthDate = c2.BirthDate

    FROM Customers c1

    INNER JOIN Customers c2 ON c2.tmpOrder1 = c1.tmpOrder2

    GO

    ALTER TABLE Customers

    DROP COLUMN tmpOrder1

    GO

    ALTER TABLE Customers

    DROP COLUMN tmpOrder2

    GO

    Very cool... thanks for posting your solution. I was just getting to the rework on this and you saved me a bit of time. 🙂 And thanks for the feedback on the performance!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Craig, BTW... how many rows was this for?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Oh... one more thing... since you added and then deleted columns from the "real" table instead of using a temp table as a driver, you should probably run a DBCC DBReindex to repack some of the page splits you may have caused on the Clustered Index.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • There's only 154,000 rows. That's how bad my row by row cursor solution was. :crazy:

    This is it below (with minimal tidy up)

    SET NOCOUNT ON

    DECLARE @customerID INT

    DECLARE @gender CHAR(1)

    DECLARE @dob DATETIME

    DECLARE @firstName VARCHAR(50)

    DECLARE @middleName VARCHAR(50)

    DECLARE @lastName VARCHAR(50)

    DECLARE @birthDate DATETIME

    DECLARE myCursor CURSOR FOR

    SELECT CustomerID, Gender, BirthDate

    FROM Customers

    WHERE CustomerID > -1

    ORDER BY CustomerID

    OPEN myCursor

    FETCH NEXT FROM myCursor INTO @customerID, @gender, @dob

    WHILE @@FETCH_STATUS = 0

    BEGIN

    SELECT

    @firstName = (SELECT TOP 1 firstname FROM Customers WHERE gender = @gender ORDER BY NEWID()),

    @middleName = (SELECT TOP 1 middlename FROM Customers WHERE gender = @gender ORDER BY NEWID()),

    @lastName = (SELECT TOP 1 lastname FROM Customers WHERE gender = @gender ORDER BY NEWID()),

    @birthDate = (SELECT TOP 1 birthdate FROM Customers WHERE (BirthDate IS NULL AND @dob IS NULL) OR

    (BirthDate IS NOT NULL AND @dob IS NOT NULL) ORDER BY NEWID())

    UPDATE Customers

    SET FirstName = @firstName,

    MiddleName = @middleName,

    LastName = @lastName,

    BirthDate = @birthDate

    WHERE CustomerID = @customerID

    FETCH NEXT FROM myCursor INTO @customerID, @gender, @dob

    END

    CLOSE myCursor

    DEALLOCATE myCursor

  • Holy moly! Thanks for posting that. It's always good to see the before'n'after.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply