How to ensure order is kept in import from csv

Question

How to ensure order is kept in import from csv

stubob

SSCommitted

Points: 1839
More actions
March 17, 2004 at 8:05 am

#160654

I'm importing from csv text files & need to ensure row order is maintained.
I've already imported one file and then added an id field so order can be maintained from now on. But how do I know order was kept during DTS & adding the id field? I checked a few sample areas and it does seem to match the original file, but I can't check the entire 3 million rows manually.
So what import methods can I use to be sure order is kept?
Data: Easy to spill, hard to clean up!

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

EdVassie SSC Guru Points: 60480 More actions · Answer 1

If you are concerned with row order when retrieving data you must use the ORDER BY clause in the SELECT statement. The ANSI standard says that without the ORDER BY clause, rows may be returned in many order, and that is what you are likely to get, regardless of the physical order the rows are stored in

.

Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

Frank The Tank SSC Eights! Points: 870 More actions · Answer 2

You could create a clustered index on the target table that uses the same columns that determine the order of the rows in the source csv file. That would ensure that the data rows in the table are physically ordered in the same manner as they are in your source data. This is (I think) the only way to be sure that the data stored in the table is ordered the way you want.

stubob SSCommitted Points: 1839 More actions · Answer 3

Then I assume I can be sure rows are transfered to the destination SQL table in the same order as the source csv file.

I'll go with that (unless someone out there knows better) - and import into a table with a clustered index.

Data: Easy to spill, hard to clean up!

Frank The Tank SSC Eights! Points: 870 More actions · Answer 4

Yes, the clustered index will ensure that the rows are physically ordered based on the columns you specify in the index. Some things to bear in mind: You can only have one clustered index per table (since it is determining the row order, it would not make sense that you can have more than one). Essentially your clustered index has to define the primary key of the table. You might want to take a look at the "Using Clustered Indexes" topic in Books Online. Search for "clustered indexes, overview" in the BOL index. Make sure you use only the minimum number of columns you need in the clustered index to ensure correct ordering and define your key.

stubob SSCommitted Points: 1839 More actions · Answer 5

I understand clustered indexes. What I don't fully understand is DTS.

My question: If DTS imports a csv file with no primary key, will the resulting table be in the same order as the source csv file? From testing the imported file, it seems to me that it is in the same order. However, is this behavior guaranteed or by chance?

Data: Easy to spill, hard to clean up!

Antares686 SSC Guru Points: 125444 More actions · Answer 6

Yes the order should be maintained on import from the CSV file thru DTS unless your table has a CLustered index that would cause a resort. Being however you say you have a IDENTITY column thou it should not.

EdVassie SSC Guru Points: 60480 More actions · Answer 7

The behaviour is NOT guaranteed. You need to understand what is meant by 'order' in a relational database, and how this interacts with a DTS load.

The physical order in which rows exist has no relevance to the actual order rows are presented to the user in a SELECT statement. There is 1 and only 1 way to obtain rows from a relational database in a specified sequence and that is the ORDER clause. If you do not use SELECT ... ORDER BY ... then any sequence of rows you obtain may not be repeatable. There are many reasons for this, some of which are given below.

Even if the table you are loading into exists in a single physical file and there is no cluster index, then the physical order in which the data is stored is not guaranted. If the table exists in multiple physical files the physical order is even less guaranteed. Even with a cluster index the physical order within the database is not guaranteed.

When you are inserting rows into a table, SQL Server will allocate an extent of space for your table. Assuming there is no cluster index and you have a large number of rows, this extent will be filled sequentially from start to end. If more rows need to be inserted, a new extent will be obtained and rows inserted into that. Extents are allocated to a table in the physical sequence they exist in the database.

If the current DTS load is the first load into the table, then at the end of the load, the rows will almost certainly be in the same physical order in the table as they were in the input file.

If you do any maintenance work on the database before your next DTS load you may have released an extent before the data for your table starts, or even in the middle of it. The next DTS load will again use free extents in the physical sequence they exist in the database, but as you can see there is no guarantee these extents will be after all previous extents for your table. In this situation, data for the current load will be physically interspersed between data previously loaded.

It is important to stress there is nothing wrong with this behaviour. On the contary this behaviour is desirable as it helps re-use space without extending the physical size of the database.

Even if a cluster index is used, if the key sequence of new data is intermixed with existing data, an extent split will eventually occur. The new extent will be the first available in the database, which could physically exist before extents containing data with lower key values. A cluster index will guarantee that all rows in a given extent are physically stored in the order specified by the key, but has no impact on the physical sequence that extents are stored in. Although a 'SELECT *' on a table with a cluster index but NOT using an ORDER BY will normally present all rows in the sequence given by the key, the order is not actually guaranteed, and if multiple physical files or parallelism are involved then the order of presentation may not be repeatable.

So to sum up, there is no point in being concerned about trying to maintain physical sequence of rows across multiple DTS loads, because there is no way to guarantee the sequence. Your concern should be directed towards obtaining the rows in a desired sequence in a SELECT statement, which will require an ORDER BY. You can reduce the time required to obtain the data in the desired sequence by defining a cluster index that matches your ODER BY.

Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

stubob SSCommitted Points: 1839 More actions · Answer 8

Thanks EdVassie for the detailed & thoughtful reply. The problem, which I keep restating so it's understood, is that I have csv files that need to be imported to SQL Server. The order of the rows is important, however, there is no combination of columns that I can use to order the rows. Once in SQL Server, I can easily add an identity column as a primary key to keep the order. However, will the act of importing and creating a key upset the current order of rows? It seems the answer is: probably not.

I've decided that I want to be more certain of the order. Since no one came up with a DTS solution to ensure order is maintained, the best solution I can think of is to create a program (a batch file if I can, otherwise C++) that adds an additional id column to the csv file. The additional column will simply number the rows, providing the key for ordering the table in sql server.

Data: Easy to spill, hard to clean up!

EdVassie SSC Guru Points: 60480 More actions · Answer 9

If you define an ID column in your table you can load your data using DTS. All that should be needed is for the DTS package to name all the columns it is inserting data in to. Any columns in the table that are not named must have a default value, or be an ID column.

This should save you writing some custom code.

Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

Bill Nye Hall of Fame Points: 3248 More actions · Answer 10

I recently had an unrelated problem (nul ASCII characters in the text file that confused DTS about the end of the row). My solution was to read it in line by line in an ActiveX task and re-write each line back to a new file along with a row number. This should work for your situation by creating an identity column to keep your rows in the order of the original text file.

'********This step takes about 5 minutes for ~5.7million rows

Dim i 'counter

Const ForReading = 1, ForWriting = 2, ForAppending = 8

Const TristateUseDefault = -2, TristateTrue = -1, TristateFalse = 0

Dim fso,fso2, f,f2, ts,ts2

Set fso = CreateObject("Scripting.FileSystemObject")

fso.CreateTextFile "d:\Newfile.txt" ' Create a file.

Set f = fso.GetFile("d:\Newfile.txt")

Set ts = f.OpenAsTextStream(ForWriting, TristateUseDefault)

Set fso2 = CreateObject("Scripting.FileSystemObject")

Set f2 = fso.GetFile("d:\OriginalFile.txt")

Set ts2 = f2.OpenAsTextStream(ForReading, TristateUseDefault)

i=1

Do While Not ts2.AtEndOfStream

s = ts2.ReadLine 'read next line from file

ts.write s&"*"& i &" "&vbcrlf 'write to newly create file

i=i+1

Loop

ts.Close

ts2.Close

set fso=nothing

set fso2=nothing

sholliday SSCrazy Points: 2100 More actions · Answer 11

I use the below.

While I haven't anally been concerned with the ordinal inserts, my experience says it does go in "in order" from the cvs file.

By using the "entryid", you can order on that.

Also, I have a "get rid of duplicate scripts". Its slow as molasses but gets the job done.

The second one uses dynamic sql. Why? Because I can pass in which field has the primary id as a parameter. This is slower of course, but my imports aren't deadline critical.

Just use a 0 for intWorkingimportjobid, I was setting it up so different people could use the import table at the same time.

My script sets the duplicates as Status=D, I can either delete them, or just ignore them when I run the import procedures.

Maybe this will help, I dunno.

..

if exists (select * from sysobjects

where id = object_id('tblWorkingImportTable'))

DROP TABLE tblWorkingImportTable

GO

CREATE TABLE

tblWorkingImportTable (

datacolumn1

varchar(320) NULL ,

datacolumn2

varchar(320) NULL ,

datacolumn3

varchar(320) NULL ,

datacolumn4

varchar(320) NULL ,

datacolumn5

varchar(320) NULL ,

datacolumn6

varchar(320) NULL ,

datacolumn7

varchar(320) NULL ,

datacolumn8

varchar(320) NULL ,

datacolumn9

varchar(320) NULL ,

datacolumn10

varchar(320) NULL ,

datacolumn11

varchar(320) NULL ,

datacolumn12

varchar(320) NULL ,

datacolumn13

varchar(320) NULL ,

datacolumn14

varchar(320) NULL ,

datacolumn15

varchar(320) NULL ,

datacolumn16

varchar(320) NULL ,

datacolumn17

varchar(320) NULL ,

datacolumn18

varchar(320) NULL ,

datacolumn19

varchar(320) NULL ,

datacolumn20

varchar(320) NULL,

relationalidforcolumn1

int NULL ,

relationalidforcolumn2

int NULL ,

relationalidforcolumn3

int NULL ,

relationalidforcolumn4

int NULL ,

relationalidforcolumn5

int NULL ,

relationalidforcolumn6

int NULL ,

relationalidforcolumn7

int NULL ,

relationalidforcolumn8

int NULL ,

relationalidforcolumn9

int NULL ,

relationalidforcolumn10

int NULL ,

relationalidforcolumn11

int NULL ,

relationalidforcolumn12

int NULL ,

relationalidforcolumn13

int NULL ,

relationalidforcolumn14

int NULL ,

relationalidforcolumn15

int NULL ,

relationalidforcolumn16

int NULL ,

relationalidforcolumn17

int NULL ,

relationalidforcolumn18

int NULL ,

relationalidforcolumn19

int NULL ,

relationalidforcolumn20

int NULL ,

entryid

int IDENTITY (1,1) PRIMARY KEY NONCLUSTERED ,

createdate

datetime NOT NULL DEFAULT getDate() , -- used to show when the data was put in the table

workingimportjobid

int NULL DEFAULT (0), -- used to distinquish this job from all others

rowstatus

char(1) NULL

)

GO

GRANT REFERENCES , SELECT , INSERT , DELETE , UPDATE ON tblWorkingImportTable TO myUser

GO

And also

if exists (select * from sysobjects where id = object_id('dbo.prc_eliminate_workingtable_duplicates') and sysstat & 0xf = 4)

drop procedure dbo.prc_eliminate_workingtable_duplicates

GO

CREATE PROCEDURE prc_eliminate_workingtable_duplicates

(

@intWorkingimportjobid int ,

@strUniqueColumnName varchar(64) ,

@int_return_value int = 0 output ,

@str_return_message varchar(2000) = '' output

&nbsp

AS

declare @strSQLStatement varchar(8000)

/*

Select distinct datacolumn2 from tblWorkingImportTable where datacolumn2 IN (

Select datacolumn2 from tblWorkingImportTable

GROUP BY datacolumn2

HAVING Count(datacolumn2) >= 2

&nbsp

*/

DECLARE @uniqueid int

DECLARE @msg varchar(2000)

DECLARE @firstQueryFETCHSTATUS int

DECLARE @intCounter int

select @intCounter = 0 --//default value

/*

DECLARE curData CURSOR FAST_FORWARD FOR Select distinct datacolumn2 from tblWorkingImportTable where datacolumn2 IN (

Select datacolumn2 from tblWorkingImportTable

GROUP BY datacolumn2

HAVING Count(datacolumn2) >= 2

&nbsp

*/

set nocount on

select @strSQLStatement = 'DECLARE curData CURSOR FAST_FORWARD FOR Select Distinct ' + @strUniqueColumnName + ' from tblWorkingImportTable where ' + @strUniqueColumnName + ' IN (Select ' + @strUniqueColumnName + ' from tblWorkingImportTable GROUP BY ' + @strUniqueColumnName + ' HAVING Count( ' + @strUniqueColumnName + ' ) >= 2)'

EXEC (@strSQLStatement)

OPEN curData

-- Perform the first fetch.

fetch curData into @uniqueid

select @firstQueryFETCHSTATUS = @@FETCH_STATUS

IF @firstQueryFETCHSTATUS <> 0

begin

select @msg = '<<No Duplicates In Import File/Data.>>'

print @msg

end

WHILE @firstQueryFETCHSTATUS = 0

BEGIN

SELECT @msg = ' Duplicate ID (' + @strUniqueColumnName + ') = ' + convert(varchar(10), @uniqueid)

--PRINT @msg

select @intCounter = 0

DECLARE @entryid int

DECLARE cursorInside CURSOR FAST_FORWARD FOR Select entryid from tblWorkingImportTable where datacolumn2 = @uniqueid

OPEN cursorInside

-- Perform the first fetch.

fetch cursorInside into @entryid

IF @@FETCH_STATUS <> 0

begin

select @msg = '<<No data.>>'

print @msg

end

WHILE @@FETCH_STATUS = 0

BEGIN

select @intCounter = @intCounter + 1

SELECT @msg = ' One of the duplicates has an entryid = ' + convert(varchar(10), @entryid)

--PRINT @msg

SELECT @msg = ' @intCounter = ' + convert(varchar(10), @intCounter)

--PRINT @msg

if @intCounter > 1

BEGIN

-- This is a duplicate

UPDATE tblWorkingImportTable

SET rowstatus = 'D'

WHERE entryid = @entryid

END

FETCH NEXT FROM cursorInside INTO @entryid

END

CLOSE cursorInside

DEALLOCATE cursorInside

--Select rowstatus from tblWorkingImportTable where datacolumn2 = @uniqueid

FETCH NEXT FROM curData INTO @uniqueid

select @firstQueryFETCHSTATUS = @@FETCH_STATUS

END

CLOSE curData

DEALLOCATE curData

set nocount off

select @int_return_value = 0

select @str_return_message = 'Success'

GO

--GRANT EXECUTE ON prc_eliminate_workingtable_duplicates TO myUser

GO

sholliday SSCrazy Points: 2100 More actions · Answer 12

I'd like a combination of

Bill Nye's solution, and my input.

You can use his solution to verify my claims. Or you can tack on a "ordinal_insert" column on my table, and use bill's ordinal value to populate that field.

.