March 17, 2004 at 8:05 am
I'm importing from csv text files & need to ensure row order is maintained.
I've already imported one file and then added an id field so order can be maintained from now on. But how do I know order was kept during DTS & adding the id field? I checked a few sample areas and it does seem to match the original file, but I can't check the entire 3 million rows manually.
So what import methods can I use to be sure order is kept?
Data: Easy to spill, hard to clean up!
March 18, 2004 at 3:26 am
If you are concerned with row order when retrieving data you must use the ORDER BY clause in the SELECT statement. The ANSI standard says that without the ORDER BY clause, rows may be returned in many order, and that is what you are likely to get, regardless of the physical order the rows are stored in
.
Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.
When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara
March 18, 2004 at 6:26 am
You could create a clustered index on the target table that uses the same columns that determine the order of the rows in the source csv file. That would ensure that the data rows in the table are physically ordered in the same manner as they are in your source data. This is (I think) the only way to be sure that the data stored in the table is ordered the way you want.
March 18, 2004 at 7:12 am
Then I assume I can be sure rows are transfered to the destination SQL table in the same order as the source csv file.
I'll go with that (unless someone out there knows better) - and import into a table with a clustered index.
Data: Easy to spill, hard to clean up!
March 18, 2004 at 7:22 am
Yes, the clustered index will ensure that the rows are physically ordered based on the columns you specify in the index. Some things to bear in mind: You can only have one clustered index per table (since it is determining the row order, it would not make sense that you can have more than one). Essentially your clustered index has to define the primary key of the table. You might want to take a look at the "Using Clustered Indexes" topic in Books Online. Search for "clustered indexes, overview" in the BOL index. Make sure you use only the minimum number of columns you need in the clustered index to ensure correct ordering and define your key.
March 20, 2004 at 5:04 pm
I understand clustered indexes. What I don't fully understand is DTS.
My question: If DTS imports a csv file with no primary key, will the resulting table be in the same order as the source csv file? From testing the imported file, it seems to me that it is in the same order. However, is this behavior guaranteed or by chance?
Data: Easy to spill, hard to clean up!
March 21, 2004 at 6:32 am
Yes the order should be maintained on import from the CSV file thru DTS unless your table has a CLustered index that would cause a resort. Being however you say you have a IDENTITY column thou it should not.
March 22, 2004 at 3:00 am
The behaviour is NOT guaranteed. You need to understand what is meant by 'order' in a relational database, and how this interacts with a DTS load.
The physical order in which rows exist has no relevance to the actual order rows are presented to the user in a SELECT statement. There is 1 and only 1 way to obtain rows from a relational database in a specified sequence and that is the ORDER clause. If you do not use SELECT ... ORDER BY ... then any sequence of rows you obtain may not be repeatable. There are many reasons for this, some of which are given below.
Even if the table you are loading into exists in a single physical file and there is no cluster index, then the physical order in which the data is stored is not guaranted. If the table exists in multiple physical files the physical order is even less guaranteed. Even with a cluster index the physical order within the database is not guaranteed.
When you are inserting rows into a table, SQL Server will allocate an extent of space for your table. Assuming there is no cluster index and you have a large number of rows, this extent will be filled sequentially from start to end. If more rows need to be inserted, a new extent will be obtained and rows inserted into that. Extents are allocated to a table in the physical sequence they exist in the database.
If the current DTS load is the first load into the table, then at the end of the load, the rows will almost certainly be in the same physical order in the table as they were in the input file.
If you do any maintenance work on the database before your next DTS load you may have released an extent before the data for your table starts, or even in the middle of it. The next DTS load will again use free extents in the physical sequence they exist in the database, but as you can see there is no guarantee these extents will be after all previous extents for your table. In this situation, data for the current load will be physically interspersed between data previously loaded.
It is important to stress there is nothing wrong with this behaviour. On the contary this behaviour is desirable as it helps re-use space without extending the physical size of the database.
Even if a cluster index is used, if the key sequence of new data is intermixed with existing data, an extent split will eventually occur. The new extent will be the first available in the database, which could physically exist before extents containing data with lower key values. A cluster index will guarantee that all rows in a given extent are physically stored in the order specified by the key, but has no impact on the physical sequence that extents are stored in. Although a 'SELECT *' on a table with a cluster index but NOT using an ORDER BY will normally present all rows in the sequence given by the key, the order is not actually guaranteed, and if multiple physical files or parallelism are involved then the order of presentation may not be repeatable.
So to sum up, there is no point in being concerned about trying to maintain physical sequence of rows across multiple DTS loads, because there is no way to guarantee the sequence. Your concern should be directed towards obtaining the rows in a desired sequence in a SELECT statement, which will require an ORDER BY. You can reduce the time required to obtain the data in the desired sequence by defining a cluster index that matches your ODER BY.
Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.
When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara
March 22, 2004 at 8:10 am
Thanks EdVassie for the detailed & thoughtful reply. The problem, which I keep restating so it's understood, is that I have csv files that need to be imported to SQL Server. The order of the rows is important, however, there is no combination of columns that I can use to order the rows. Once in SQL Server, I can easily add an identity column as a primary key to keep the order. However, will the act of importing and creating a key upset the current order of rows? It seems the answer is: probably not.
I've decided that I want to be more certain of the order. Since no one came up with a DTS solution to ensure order is maintained, the best solution I can think of is to create a program (a batch file if I can, otherwise C++) that adds an additional id column to the csv file. The additional column will simply number the rows, providing the key for ordering the table in sql server.
Data: Easy to spill, hard to clean up!
March 22, 2004 at 8:21 am
If you define an ID column in your table you can load your data using DTS. All that should be needed is for the DTS package to name all the columns it is inserting data in to. Any columns in the table that are not named must have a default value, or be an ID column.
This should save you writing some custom code.
Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.
When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara
March 22, 2004 at 5:59 pm
I recently had an unrelated problem (nul ASCII characters in the text file that confused DTS about the end of the row). My solution was to read it in line by line in an ActiveX task and re-write each line back to a new file along with a row number. This should work for your situation by creating an identity column to keep your rows in the order of the original text file.
'********This step takes about 5 minutes for ~5.7million rows
Dim i 'counter
Const ForReading = 1, ForWriting = 2, ForAppending = 8
Const TristateUseDefault = -2, TristateTrue = -1, TristateFalse = 0
Dim fso,fso2, f,f2, ts,ts2
Set fso = CreateObject("Scripting.FileSystemObject")
fso.CreateTextFile "d:\Newfile.txt" ' Create a file.
Set f = fso.GetFile("d:\Newfile.txt")
Set ts = f.OpenAsTextStream(ForWriting, TristateUseDefault)
Set fso2 = CreateObject("Scripting.FileSystemObject")
Set f2 = fso.GetFile("d:\OriginalFile.txt")
Set ts2 = f2.OpenAsTextStream(ForReading, TristateUseDefault)
i=1
Do While Not ts2.AtEndOfStream
s = ts2.ReadLine 'read next line from file
ts.write s&"*"& i &" "&vbcrlf 'write to newly create file
i=i+1
Loop
ts.Close
ts2.Close
set fso=nothing
set fso2=nothing
April 12, 2004 at 11:45 am
I use the below.
While I haven't anally been concerned with the ordinal inserts, my experience says it does go in "in order" from the cvs file.
By using the "entryid", you can order on that.
Also, I have a "get rid of duplicate scripts". Its slow as molasses but gets the job done.
The second one uses dynamic sql. Why? Because I can pass in which field has the primary id as a parameter. This is slower of course, but my imports aren't deadline critical.
Just use a 0 for intWorkingimportjobid, I was setting it up so different people could use the import table at the same time.
My script sets the duplicates as Status=D, I can either delete them, or just ignore them when I run the import procedures.
Maybe this will help, I dunno.
..
if exists (select * from sysobjects
where id = object_id('tblWorkingImportTable'))
DROP TABLE tblWorkingImportTable
GO
CREATE TABLE
tblWorkingImportTable (
datacolumn1
varchar(320) NULL ,
datacolumn2
varchar(320) NULL ,
datacolumn3
varchar(320) NULL ,
datacolumn4
varchar(320) NULL ,
datacolumn5
varchar(320) NULL ,
datacolumn6
varchar(320) NULL ,
datacolumn7
varchar(320) NULL ,
datacolumn8
varchar(320) NULL ,
datacolumn9
varchar(320) NULL ,
datacolumn10
varchar(320) NULL ,
datacolumn11
varchar(320) NULL ,
datacolumn12
varchar(320) NULL ,
datacolumn13
varchar(320) NULL ,
datacolumn14
varchar(320) NULL ,
datacolumn15
varchar(320) NULL ,
datacolumn16
varchar(320) NULL ,
datacolumn17
varchar(320) NULL ,
datacolumn18
varchar(320) NULL ,
datacolumn19
varchar(320) NULL ,
datacolumn20
varchar(320) NULL,
relationalidforcolumn1
int NULL ,
relationalidforcolumn2
int NULL ,
relationalidforcolumn3
int NULL ,
relationalidforcolumn4
int NULL ,
relationalidforcolumn5
int NULL ,
relationalidforcolumn6
int NULL ,
relationalidforcolumn7
int NULL ,
relationalidforcolumn8
int NULL ,
relationalidforcolumn9
int NULL ,
relationalidforcolumn10
int NULL ,
relationalidforcolumn11
int NULL ,
relationalidforcolumn12
int NULL ,
relationalidforcolumn13
int NULL ,
relationalidforcolumn14
int NULL ,
relationalidforcolumn15
int NULL ,
relationalidforcolumn16
int NULL ,
relationalidforcolumn17
int NULL ,
relationalidforcolumn18
int NULL ,
relationalidforcolumn19
int NULL ,
relationalidforcolumn20
int NULL ,
entryid
int IDENTITY (1,1) PRIMARY KEY NONCLUSTERED ,
createdate
datetime NOT NULL DEFAULT getDate() , -- used to show when the data was put in the table
workingimportjobid
int NULL DEFAULT (0), -- used to distinquish this job from all others
rowstatus
char(1) NULL
)
GO
GRANT REFERENCES , SELECT , INSERT , DELETE , UPDATE ON tblWorkingImportTable TO myUser
GO
And also
if exists (select * from sysobjects where id = object_id('dbo.prc_eliminate_workingtable_duplicates') and sysstat & 0xf = 4)
drop procedure dbo.prc_eliminate_workingtable_duplicates
GO
CREATE PROCEDURE prc_eliminate_workingtable_duplicates
(
@intWorkingimportjobid int ,
@strUniqueColumnName varchar(64) ,
@int_return_value int = 0 output ,
@str_return_message varchar(2000) = '' output
 
AS
declare @strSQLStatement varchar(8000)
/*
Select distinct datacolumn2 from tblWorkingImportTable where datacolumn2 IN (
Select datacolumn2 from tblWorkingImportTable
GROUP BY datacolumn2
HAVING Count(datacolumn2) >= 2
 
*/
DECLARE @uniqueid int
DECLARE @msg varchar(2000)
DECLARE @firstQueryFETCHSTATUS int
DECLARE @intCounter int
select @intCounter = 0 --//default value
/*
DECLARE curData CURSOR FAST_FORWARD FOR Select distinct datacolumn2 from tblWorkingImportTable where datacolumn2 IN (
Select datacolumn2 from tblWorkingImportTable
GROUP BY datacolumn2
HAVING Count(datacolumn2) >= 2
 
*/
set nocount on
select @strSQLStatement = 'DECLARE curData CURSOR FAST_FORWARD FOR Select Distinct ' + @strUniqueColumnName + ' from tblWorkingImportTable where ' + @strUniqueColumnName + ' IN (Select ' + @strUniqueColumnName + ' from tblWorkingImportTable GROUP BY ' + @strUniqueColumnName + ' HAVING Count( ' + @strUniqueColumnName + ' ) >= 2)'
EXEC (@strSQLStatement)
OPEN curData
-- Perform the first fetch.
fetch curData into @uniqueid
select @firstQueryFETCHSTATUS = @@FETCH_STATUS
IF @firstQueryFETCHSTATUS <> 0
begin
select @msg = '<<No Duplicates In Import File/Data.>>'
print @msg
end
WHILE @firstQueryFETCHSTATUS = 0
BEGIN
SELECT @msg = ' Duplicate ID (' + @strUniqueColumnName + ') = ' + convert(varchar(10), @uniqueid)
--PRINT @msg
select @intCounter = 0
DECLARE @entryid int
DECLARE cursorInside CURSOR FAST_FORWARD FOR Select entryid from tblWorkingImportTable where datacolumn2 = @uniqueid
OPEN cursorInside
-- Perform the first fetch.
fetch cursorInside into @entryid
IF @@FETCH_STATUS <> 0
begin
select @msg = '<<No data.>>'
print @msg
end
WHILE @@FETCH_STATUS = 0
BEGIN
select @intCounter = @intCounter + 1
SELECT @msg = ' One of the duplicates has an entryid = ' + convert(varchar(10), @entryid)
--PRINT @msg
SELECT @msg = ' @intCounter = ' + convert(varchar(10), @intCounter)
--PRINT @msg
if @intCounter > 1
BEGIN
-- This is a duplicate
UPDATE tblWorkingImportTable
SET rowstatus = 'D'
WHERE entryid = @entryid
END
FETCH NEXT FROM cursorInside INTO @entryid
END
CLOSE cursorInside
DEALLOCATE cursorInside
--Select rowstatus from tblWorkingImportTable where datacolumn2 = @uniqueid
FETCH NEXT FROM curData INTO @uniqueid
select @firstQueryFETCHSTATUS = @@FETCH_STATUS
END
CLOSE curData
DEALLOCATE curData
set nocount off
select @int_return_value = 0
select @str_return_message = 'Success'
GO
--GRANT EXECUTE ON prc_eliminate_workingtable_duplicates TO myUser
GO
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply