December 30, 2014 at 9:10 am
hoping for a brainstorm about this problem and here's the other information I have:
Source file is txt, with 70,000,000+ (70 million).
-has a unique column called keyword_id
-if I grep the source file I can verify no keyword_id is the same aka a unique keyword_id per row
-if I use import wizard it imports the file properly, retaining unique keyword_id identities and correct rowcount
-if I use ssis Flat File Source to import to OLEDB table, having a keyword_id datatype of bigint, the end results if a table with thousands of keyword_ids whose last three digits have changed to 0000.
-in each of the above tests (grep, import wizard, ssis) the final number of rows imported is the same.
--flow is straightforward: FF Source, OLE DB destination. No other conversions.
When this problem first happened I told the vendor that they sent me a file with duplicates. Having processed other files containing same kind of data I thought one off error on their part. Well, they showed me with unix command line that no duplicates and they are right.
Always the rowcount checks out but with ssis, the keyword_ids are altered (from something like 778542125 to 778542000) so that they loose their uniqueness....
Anyone think of why this may be happening?
December 30, 2014 at 9:48 am
You could try using the Import Wizard, saving the package it produces and then comparing that with the one you've created manually.
But if I had to guess, it sounds like a datatype problem to me.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
December 31, 2014 at 7:47 am
I wrote in my opening statement that I used Import Wizard and in resulting table there are no 'duplicates'
last evening I used bulk insert to import the .txt file and also no duplicates. This will be my work around (script task) but I do not understand why flat file source task -> OLE DB task not importing file properly.
Last evening I also rebuilt the data flow flat file source and oledb connection to make sure I didn't make any inadvertent setting changes in ssis and repulled the 7 million rows and again. destination column datatype for this column is bigint, columns align, all rows imported, and no truncation errors or errors of any kind, but once again keyword_ids have substituted last three numbers of keyword_id to 000.
December 31, 2014 at 8:26 am
KoldCoffee (12/31/2014)
I wrote in my opening statement that I used Import Wizard and in resulting table there are no 'duplicates'last evening I used bulk insert to import the .txt file and also no duplicates. This will be my work around (script task) but I do not understand why flat file source task -> OLE DB task not importing file properly.
Last evening I also rebuilt the data flow flat file source and oledb connection to make sure I didn't make any inadvertent setting changes in ssis and repulled the 7 million rows and again. destination column datatype for this column is bigint, columns align, all rows imported, and no truncation errors or errors of any kind, but once again keyword_ids have substituted last three numbers of keyword_id to 000.
I understood your 'opening statement'.
That is why I suggested saving the SSIS package produced by the import wizard, as it clearly worked better than the one you created.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
December 31, 2014 at 9:02 am
I don't understand "saving the SSIS package produced by the import wizard"
the process of importing these files needs to be automated. Therefore, the object of the ssis package is not one of simply importing a single.txt file one time and putting the tools away.
I am not asking for someone to help me import a txt file...but to brainstorm why ssis is not doing the job. Ultimately the solution has to be in ssis, and if I have to I will, for consistency, change all my dataflows to use bulk insert script task....because I can't rely on the most simplest of FF source to OLE DB transformation flows....it appears.
This is a stumper for me and my colleague.
December 31, 2014 at 9:16 am
KoldCoffee (12/31/2014)
I don't understand "saving the SSIS package produced by the import wizard"the process of importing these files needs to be automated. Therefore, the object of the ssis package is not one of simply importing a single.txt file one time and putting the tools away.
I am not asking for someone to help me import a txt file...but to brainstorm why ssis is not doing the job. Ultimately the solution has to be in ssis, and if I have to I will, for consistency, change all my dataflows to use bulk insert script task....because I can't rely on the most simplest of FF source to OLE DB transformation flows....it appears.
This is a stumper for me and my colleague.
I'm not sure how much more I can say.
If you run your package, you have a duplicates problem.
If you run the package which is created by the import wizard, there is no problem.
If you edit both packages side by side, there will be a difference in the way that the import wizard performs the import which explains why it works and yours does not. Identify the difference and then apply it to your package.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
December 31, 2014 at 9:21 am
It sounds promising but I don't understand
"If you edit both packages side by side, there will be a difference in the way that the import wizard performs the import which explains why it works and yours does not. Identify the difference and then apply it to your package. "
how do you edit import export wizard side by side? Do you mean expose the xml or code layer of import wizard and same of ssis package and compare contrast line by line? If so, how to expose code layer of each?
December 31, 2014 at 9:29 am
Now it's my turn to be stumped! I don't know what you mean.
Are you aware that the import wizard creates a package "behind the scenes" & then runs it?
As you run through the import wizard steps, you will be asked whether you want to save this package (have a look here).
Choose yes, save it to the file system and then just edit it as usual.
The absence of evidence is not evidence of absence.
Martin Rees
You can lead a horse to water, but a pencil must be lead.
Stan Laurel
December 31, 2014 at 10:39 am
...:w00t:
December 31, 2014 at 10:40 am
Well I never! I will take a look at import wizard with new eyes as soon as in office. Btw, keyword id column is float in source column and I am setting destination as bigint....without explicit conversion.
January 1, 2015 at 12:46 am
i saw option to save as ssis package. examined it and found source text was importing keyword_id as string.
what I was doing in SSIS when the duplicates were happening was assigning the keyword_id in source to 'float' and on destination to bigint and then running the data flow task.
********
It turns out that in computer land it's well known that float will round numbers and in flight , before landing in the ole db destination table. So duplicates were being caused by rounding.
*********
It had been a while since I'd built the data flow, and I'd forgotten that by default the source file uses string as keyword_id.
Examining Import Wizard's ssis package revealed that the keyword_id is presented as a string.
*******
Solution
I added a dataconversion step between FF Source and OLEDB destination to convert keyword_id to decimal first, and then import to a bigint column. That worked without causing duplicates.
*******
thanks Phil, great help!
January 6, 2015 at 1:53 pm
Glad you worked this out, as I suspected from reading the thread that it was some strange conversion taking place. I have found that converting to string in these situations in general isolates the problem column.
----------------------------------------------------
Viewing 12 posts - 1 through 11 (of 11 total)
You must be logged in to reply to this topic. Login to reply