Text File Import Problem

  • Hi to all,

    I have recently joined at a firm to work as a Data Analyst and part of my daily job requires me to go thru a lot of data. I usually carry out most of my work on Excel as its enough for most of the work and occassionally use SQL if the data is reallyHuge.The Issue I face is that Till Now, i have imported files into SQL(I'm using SQL server 2008 Management Studio) from Excel sheets and had no problem,but now I have to import a text file of about 800 mb in size and When i Run the import/export wizard,I 'm greeted with an error that says 'Data Conversion failed'. I need help regarding this. Before i go into the problem itself please let me tell you:I am a Newbie to SQL.

    Now,please allow me to describe my predicament in further detail.

    I am trying to import a text file of about 800 mb in size that has data in the following format:

    (A sample of data from the txt file)

    Then I choose the Data Source As Flat file from the drop down and everything seem to be fine until I try to run the Package when the following error hits me in the face:

    I've tried to change the data types for source and destination but of no use....... To be honest i don't even know which data types to assign to the columns.... The values of column 1 in the text file can go upto 1000000 or more.

    Please try and help me..... without resolving this issue I'm stuck and everything has come to a standstill here... not many ppl to approach for help here either:-(

    Thanks in advance to anyone willing to take a look.

  • Can u please provide some data in text file? And let me know what is code page you are using?

  • The data in the text file is something like this:

    COLUMN0 COLUMN1

    yahoo510297

    kijiji376608

    facebook login350253

    gmail313800

    windows live292609

    img wlxrs286278

    hotmail.com272570

    you tube263347

    http://www.facebook.com252276

    google.ca237275

    yahoo mail214191

    microsoft messenger190257

    nbcsports msnbc186310

    msn184693

    image xboxlive181117

    ebay175051

    yahoo.ca146628

    plenty of fish134887

    google maps134701

    canada 411134323

    craigslist129121

    yahoo.com128963

    google.com128295

    face book126520

    youtube.com117249

    mapquest116769

    plentyoffish111863

    img2 wlxrs96811

    porn93996

    westjet91128

    air canada88518

    mls.ca88008

    As you can see the data in the 'COLUMN1' is striclty numbers(+ve Integers) while the data in 'COLUMN0' can be anything(text,symbols,numbers etc.)

    ..........and the data is tab-delimited.

  • Would it be possible to attach an extract from the actual file you're having difficulties with? A hundred or so records in the actual format the import is attempting to handle would useful.

    Typing the data yourself no doubt removes any of the problem data the import is encountering.

    BrainDonor

  • I had a similar problem. In regard to the truncation errors, what I initially did was create a staging table and set all my character fields to TEXT. That way, I was intentionally making them larger than my data so there'd be no truncation errors. As I gained more experience, I eventually learned how to use BCP to elegantly import text using proper column types and without using a staging table.

    As for your issue, I have two thoughts:

    • If your first column can contain "anything," you might want to check your collation to make sure it can handle it.
    • The numbers in your col2 look large, and it's possible (speculation on my part) that the int type might not be able to handle it. (I don't remember the int limit off the top of my head -- can anyone else expand on this?)

    Just my $0.02 . . .

    +--------------------------------------------------------------------------------------+
    Check out my blog at https://pianorayk.wordpress.com/

  • ok. I will provide a sample.Actually I cannot open the original file because its about 830 MB in size and my notepad can barely handle it.I have tried opening it but it takes an eon for the notepad to respond and still doesn't open the file,So I have imported the data into Excel and though I couldn't import all of the data (as there's a Row-Limit in Excel),A large amount of it got imported and now I'm pasting over a 100 rows from the Excel sheet:

    chfi5985

    alc5971

    garth brooks announces he's coming out of retirement thursday5968

    gamespot5950

    osap5944

    canada5938

    go transit5937

    saskatoon star phoenix5934

    192.168.1.15920

    e-bay5915

    cibc personal banking5913

    starfall5913

    charity village5909

    windows update5886

    http://www.kijiji.com5886

    google.5881

    kelli carpenter5875

    http://www.googleca5846

    gogole5839

    6495831

    future shop.ca5828

    680 news5823

    superstore5811

    "berlin heart" + charles tillman + washington5797

    disney5795

    funny games5790

    toyota5774

    rogers yahoo mail5773

    canada.com5771

    job bank canada5761

    750-pound shark5747

    flight status5742

    flyerland.ca5740

    great west life5732

    literotica.com5729

    toronto weather5723

    http://www.westjet.com5713

    gthl5708

    yahoo email5696

    avon.ca5694

    cat food diet5691

    yahoo answers5682

    torontostar5681

    allegiant air5676

    no frills5676

    http://www.f5675

    ford canada5674

    faceboo5672

    hoymail5669

    file extension pps5668

    big boobs5658

    scotia online5650

    globeandmail5640

    future shop canada5638

    http://www.hotmail5638

    waiter jon-barrett ingels5630

    cbs5628

    rosie o'donnell kelli carpenter5628

    rogers video5627

    ikea canada5620

    cam4.com5616

    download.com5602

    the province5593

    canadapost5584

    lotteries5584

    kate gosselin5568

    nasza klasa5555

    southwest airlines5552

    days of our lives5549

    mass murderer dies5548

    kate beckinsale5547

    blackberry5510

    facebook applications5500

    halloween costume ideas5496

    perezhilton5496

    plenty of fish in the sea5495

    surf the channel5487

    google news5486

    justintv5480

    http://www.sympatico.ca5438

    rogers.yahoo.com5427

    aol mail5426

    yahoo fantasy hockey5412

    cbc sports5410

    img microsoft5408

    maria shriver apologizes5408

    tim hortons5402

    farmville5395

    party packagers5395

    rogers.ca5385

    toyota canada5384

    royalbank.com5380

    yotube5376

    caa5362

    adam lambert5357

    pbskids5342

    the hun5338

    sex videos5332

    rona.ca5331

    meteomedia5324

    q1075322

    webmail5314

    silverdaddies5312

    home depot.ca5302

    soupy sales5300

    vancity5300

    tv shack5299

    coronation street5298

    dr oz5297

    localtvmatters.ca5291

    ups tracking5291

    servicecanada5289

    i3 microsoft5283

    sarnia observer5276

    shaw cable5274

    slutload5267

    new york times5254

    virgin mobile5242

    om5241

    tdcanadatrust.ca5241

    'jon & kate'5237

    avg free download5236

    hotwire5214

    mail5212

  • Sheez . . . that's quite the variety of data.

    Here's how I'd tackle it . . .

    • For the first column, make it a VARCHAR(MAX) column (equivalent to TEXT, which Microsoft is going to depreciate)
    • Make the second column INT

    One thing to bear in mind: the single quotes (') in the first column will likely cause problems, because T-SQL makes extensive use of them.

    +--------------------------------------------------------------------------------------+
    Check out my blog at https://pianorayk.wordpress.com/

  • One other thought (speaking of stray characters): data cleansing is always a good thing. For example, since you're importing from Excel, it's probably using the TAB character as a delimiter. If any TAB characters exist in your first column, it's going to to halt your import.

    +--------------------------------------------------------------------------------------+
    Check out my blog at https://pianorayk.wordpress.com/

  • @ray K

    you mean change the column data types for the 'Data Source'??? coz I can't.... the defaults are as follows:

    Column 0 - string [DT_STR]

    Column 1 - single-byte signed integer [DT_I1]

    ....and they remain the same after pressing the 'Suggest types'....

    and there're no varchar or nvarchar choices available in the drop-down..... am using import/export wizard BTW......

  • No. What I'm saying is change the destination to VARCHAR(MAX). That would make the column large enough to be handle variable-length character data to whatever the upper limit is (I don't remember the limit off the top of my head). That should be enough to accommodate that column and take care of any truncation issues.

    Also (and if I'm mistaken with this, someone please correct me), I believe VARCHAR(MAX) is flexible enough to be able to handle different types of characters.

    At the same time, you also need to make sure your data actually can be imported. I had a situation where I was importing log files, and the format was changed in the middle of a month (specifically, an extra column was added). I had to make an adjustment in order to accommodate the extra column. But because these were log files, I wasn't allowed to make any edits to the data source files themselves.

    Your situation, however, looks like a different case. Your column contains inconsistent data, and I'm going to assume that you can safely edit the data source without compromising anything. Strange data characters (like the TAB character I mentioned before, for example) can stop an import in its tracks. That's what data cleansing is: making sure your data is in a consistent format so that it's usable by the database.

    I also mentioned the single-quote; that shouldn't be too much of a problem with imports (unless you're using it as a text qualifier), but it could be an issue with INSERTS, UPDATES, and SELECTS. (For example: "select * from TABLE where COLUMN = 'O''Brien'" -- note that you need two single quotes, because one single quote would result in an error.)

    Hope this helps. Good luck.

    +--------------------------------------------------------------------------------------+
    Check out my blog at https://pianorayk.wordpress.com/

  • I just remembered something . . .

    VARCHAR(MAX) is equivalent to the TEXT type in SQL Server. You should change the column type to TEXT. However, Microsoft is depreciating this data type, and they suggest you should use VARCHAR(MAX) instead.

    +--------------------------------------------------------------------------------------+
    Check out my blog at https://pianorayk.wordpress.com/

  • I've taken the data you posted and turned it into something that may actually resemble your data. Because you just copied it into your message it was one column, so I have attached a spreadsheet here that has two columns.

    In future try to include the precise data in the correct format so we're all working from the same starting point. It isn't always easy, but I also find that going through that process sometimes helps me find the answer before I complete the post.

    There's an article about this at: http://www.sqlservercentral.com/articles/Best+Practices/61537/.

    The code I have pasted below is a very basic import from an excel spreadsheet into a temporary table. When you run this I would like to know if it gives you any errors. We should then be able to work from this starting point.

    I understand that you have very little SQL knowledge, so we'll keep it as simple as possible.

    The core of it creates a temporary table with two varchar columns of the maximum size that SS 2008 can create.

    Then it uses the OPENROWSET command to import from the spreadsheet (you can change the filename and location to whatever you require). This should import the data from the first sheet of that spreadsheet into the temporary table.

    The SELECT statement that follows is just to show whatever is in that temp table.

    With the number of records in your spreadsheet it may take a while. I've never tried this method for any reason other than curiosity, so I couldn't begin to guess.

    When you run this code you may get a message along the lines of:

    "SQL Server blocked access to STATEMENT 'OpenRowset/OpenDatasource' of component 'Ad Hoc Distributed Queries' because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of 'Ad Hoc Distributed Queries' by using sp_configure. For more information about enabling 'Ad Hoc Distributed Queries', see "Surface Area Configuration" in SQL Server Books Online. "

    If you do see this then need to also run the code that is commented out. Remove the '--' from the start of each line. This changes the SQL Server configuration to allow the OPENROWSET command to run, and then removes the permission again at the end.

    The temporary table is just while we're messing about. You can create a permanent table instead of course.

    --drop table #ExcelImport

    --sp_configure 'show advanced options', 1

    --reconfigure

    --go

    --sp_configure 'Ad Hoc Distributed Queries', 1

    --reconfigure

    CREATE TABLE #ExcelImport(

    ImportTextVarChar(MAX),

    ImportNumberVarChar(MAX))

    INSERT INTO #ExcelImport(ImportText,ImportNumber)

    SELECT * FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0',

    'Excel 8.0;Database=C:\TEMP\Imports\TestImport.xls;HDR=YES',

    'SELECT * FROM [Sheet1$]')

    SELECT * FROM #ExcelImport

    --sp_configure 'show advanced options', 1

    --reconfigure

    --go

    --sp_configure 'Ad Hoc Distributed Queries', 0

    --reconfigure

  • >RayK

    Hi,

    Thank you very Much for your suggestions:)

    >Braindonor

    Hi,

    Firstly,Thank you very much for your interest and patience.One couldn't have approached my problem in a better way,I suppose,and I appreciate it.

    Secondly, I've read the Forum Posting Ettiquette and I believe I now have a clue as to how to ask for help on a professional forum.Thanks,I'll make sure i follow them in the future:)

    Now--I've tried the code you have provided and was the greeted by the error you already predicted,So I removed the Comments from the code and tried to execute it and encountered the following error:

    Msg 102, Level 15, State 1, Line 2

    Incorrect syntax near 'sp_configure'.

    Msg 102, Level 15, State 1, Line 19

    Incorrect syntax near 'show advanced options'.

    Msg 15123, Level 16, State 1, Procedure sp_configure, Line 51

    The configuration option 'Ad Hoc Distributed Queries' does not exist,

    or it may be an advanced option.

    So,I also made a few searches for info about Surface Area Configuration and found that i can tweak the settings here: Server-->facets-->-->Surace area config. and While I couldn't find the 'Ad Hoc Distributed Queries' option, there was one which read 'AdHocRemoteQueriesEnabled' which was 'False' and I've turned it to 'true' and tried to execute the program and it popped the following error msg:

    OLE DB provider "Microsoft.Jet.OLEDB.4.0" for linked server "(null)"

    returned message "Unspecified error".

    Msg 7303, Level 16, State 1, Line 14

    Cannot initialize the data source object of OLE DB provider

    "Microsoft.Jet.OLEDB.4.0" for linked server "(null)".

    I guess both those options are not obviously interchangeable,after all....

    So, after that I've searched how to enable 'Ad Hoc Distributed Queries' and found some info here:

    http://msdn.microsoft.com/en-us/library/ms187569.aspx

    However, it was the same as what you have written in the code you provided.So, I don't understand why:

    1. it throws up a Syntax error when i try to exec ur code.

    2.why I can't see the 'Ad Hoc Distributed Queries' option itself in the Surface Area Cofig. window.

  • You were right - AdHocRemoteQueriesEnabled is the option you wanted. To see what the settings are for the various options you can run

    SELECT * FROM sys.configurations

    ORDER BY name

    and you should now see 'Ad Hoc Distributed Queries' with a value of 1.

    The only reference I've found for the Jet error message is at http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/9605444f-a6e1-40ab-be9f-f40362672039, which then points to http://support.microsoft.com/default.aspx?scid=kb;en-us;239114 for the latest version of the Jet database engine.

    I would suggest taking the steps listed in it to check you have the latest files installed. Be careful though - there are different downloads depending upon which version of Windows you are using. If in doubt and you have PC support staff where you are, ask them.

    BrainDonor

  • Hi folks,

    >braindodnor: Thank you Braindonor:-)

    I have finally got around the problem of importing the data into SQL db from the text file.Here's what I did,I split the 800 MB file into 30 smaller text files and tried to upload each one and when they failed to import again,I did a little fiddling around and found that the whole problem was with the text file 'ENCODING'...... all of them (including the acual 800MB original) were in ANSI text format and my sql import/export wizard accepted only 'UNICODE'. So was able to import some files after changing their encoding from ANSI to UNICODE.

    Now I've got two simple questions:

    1.)I know there are a few editors out there that can open very large text files(PFE,Gvim etc.),but is there one that would let me open my 800 meg file(ANSI) and save it as a 'UNICODE' text file?? . . .I haven't seen that option in any of the text editors I've used so far:-(

    or

    2.)I have 30 text files(of abt. 25 MB each) and each time I import one of them a NEW table is being created in the SQL db,So my question is : Is there any way to Append/Combine the tables after/before Importing so I would have one big table instead of 30 smaller ones???

    Can someone help me???Any help appreciated.

    Thanks in Advance.

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply