Importing data with lots of columns

  • I have a whole bunch of tab delimited data files with many columns (over 100 to 150 depending on the year of the file) Then I have about 30 years of the data, and the exact structure varies based on what they were collecting that year.

    Alot of the columns are race columns like "Korean 0/1" "Hispanic 0/1"

    I tried importing the data as a flat file and using the suggest types function, but the 1000 row preview limit seems to be causing it to miss id the field as something it is not. So then the import fails, and I have to go back and tweak that column, run it again, tweak again... etc Very time consuming

    The best luck I've had was importing all the data as a varchar 50 and then going back and converting the data to the best data type.

    My questions are:

    is there a better way/tool for doing this than "suggest types"

    Is there a good way to clean up/determine the best data types after the fact (maybe an included tool)

    This is a problem I come up against often, as I work with alot of Gov't data sets AND GIS data, and I find myself spending alot of my time just cleaning up these datasets so that I can use them, and I really don't think it should be this difficult/time consuming.

    I don't know, maybe it is wishful thinking on my part

  • If you find something that's really good, because I've been doing this for a long time and I don't know anything that I would call really good.

    The best luck I've had was importing all the data as a varchar 50 and then going back and converting the data to the best data type.

    Hah! That's exactly what I do (except I use varchar(255)).

    The one thing that I might suggest is to take a look at Access. Many people find it's facilities for this kind of thing to be better or easier to use than SQL-Servers. If you do to, then you can use the Access tools to bring a dataset into an MS-Access and then use the Data Upgrade facility to automatically migrate it to a SQL Server database.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • I'm actually surprised about only 1 response...

    Maybe it is not as common of a task as I thought it was.

    Does anyone know if there are any good third party products that is useful for identify column data types and/or importing data?

  • When you're importing from text files, or Excel/CSV/etc., one of the little-known (aparently) factors is that SQL uses the top four rows to determine the data type. If it finds numbers in those, it will render string values later in the same column as Null.

    Is that the kind of thing you're running into? If so, there are solutions for that, but they require a registry edit and computer restart, and you have to set the IMEX property on the connection string for the file.

    It sounds like you might be running into something else, but I'm not clear on what. Can you provide a couple of sample files, preferably with the data anonymized? If so, I can almost certain streamline the import process for you.

    Another question is how many file format variations you have. Are there dozens of different formats? Every file has a different format? A dozen formats? What kind of range is there?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Well first up I'm using SQL to analyze the data.

    In this particular project, I have a row fro every person who's died in Texas since 1976. How they died, where, etc. The data is already de-identifed but its still IRB protected so I can't really post it.

    Part of my problem is about every 5 to 10 years, they decide to change what data points they collect, so the tables look different.

    Since I'm not adding to this, I don't care to much about how the data is stored in the database as long as:

    - none of it is lost (i keep getting truncation errors unless i make the varchar fields WAY to big)

    - the indexes are efficient (I think varchar tend to be slower, I could be wrong)

    - it doesn't take a HUGE amount of hard drive space (i don't like the fact that the mdf file tends to be over twice the size of the original text file)

    Then quite a few of the fields are Yes/No's which convert pretty will to bit fields, but overall I've found this to be a very time consuming task. More and more my bosses want access and analysis of this data, so I'm looking for a good solution for this project, and the ones to come.

  • In that case, with file formats changing every few years, I'd be inclined towards setting up XML definition files and using OpenRowset Bulk to import the data.

    I've used that method to import large, complex files. It's very efficient.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply