Taking a single 80 character piece of Text and converting to columns in a table.

Question

Taking a single 80 character piece of Text and converting to columns in a table.

Viewing 15 posts - 16 through 30 (of 73 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 1

Thom A wrote:

How to you get 26 different columns from the string 'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D P'? I see 6 at most. Where are the other 20 coming from?

Fixed field length, Thom... oldest format in the world. Look at the table the OP designed and realize all the columns are fixed width. This is truly a "Unit Record Definition".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 2

Lynn Pettis wrote:

Is this coming to you in a file?

You beat me to that question, Lynn. If it is (and I think this is where your going with this), we can use a BCP format file along with BULK INSERT.

In case some of you don't know, it's actually one of the fastest forms of import there is because all of the "splitting" is done at the machine language level and it doesn't need to find delimiters.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 3

Jeff Moden wrote:

Lynn Pettis wrote:
Is this coming to you in a file?
You beat me to that question, Lynn. If it is (and I think this is where your going with this), we can use a BCP format file along with BULK INSERT.
In case some of you don't know, it's actually one of the fastest forms of import there is because all of the "splitting" is done at the machine language level and it doesn't need to find delimiters.

We are definitely thinking the same thing here.

Jonathan AC Roberts SSCoach Points: 18355 More actions · Answer 4

I don't think there is anything wrong with multiple substring commands, but if there are a lot of different row types and you want to make coding a bit simpler you could go with something like this:

DECLARE  @StringFormat TABLE(id  int identity(1,1) NOT NULL, Length int NOT NULL)
INSERT INTO @StringFormat (Length)
VALUES (2),(1),(6),(6),(6),(7),(1),(1),(2),(4),(4),(1),(8),(1),(3),(4),(3),(6),(1),(1),(1),(4),(4),(1),(1)

;WITH RawData AS 
(
    SELECT 1 RowNum, 'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D P' Line
    UNION ALL
    SELECT 2, 'BSNC002647745191912080000001 PTE5P02 146650455 RTY 094D Q' 
),
CTE2 AS
(
    SELECT *,
           SUBSTRING(RawData.Line,X.StartPos+1,X.Length) Col
      FROM RawData
    CROSS APPLY(SELECT TOP(8000) 
                       s.Id,
                       s.Length,
                       SUM(s.Length) OVER (ORDER BY s.Id)-s.Length StartPos
                  FROM @StringFormat s ORDER BY s.id) X
)
SELECT MAX(IIF(Id=1,Col,NULL))  recordidentity,
       MAX(IIF(Id=2,Col,NULL))  transactiontype,
       MAX(IIF(Id=3,Col,NULL))  trainuid,
       MAX(IIF(Id=4,Col,NULL))  daterunsfrom,
       MAX(IIF(Id=5,Col,NULL))  daterunsto,
       MAX(IIF(Id=6,Col,NULL))  daysrun,
       MAX(IIF(Id=7,Col,NULL))  [bankholiday running],
       MAX(IIF(Id=8,Col,NULL))  headcode,
       MAX(IIF(Id=9,Col,NULL))  traincategory,
       MAX(IIF(Id=10,Col,NULL)) courseindicator,
       MAX(IIF(Id=11,Col,NULL)) trainservicecode,
       MAX(IIF(Id=12,Col,NULL)) portionid,
       MAX(IIF(Id=13,Col,NULL)) powertype,
       MAX(IIF(Id=14,Col,NULL)) timingload,
       MAX(IIF(Id=15,Col,NULL)) speed,
       MAX(IIF(Id=16,Col,NULL)) operatingcharacteristics,
       MAX(IIF(Id=17,Col,NULL)) seatingclass,
       MAX(IIF(Id=18,Col,NULL)) sleepers,
       MAX(IIF(Id=19,Col,NULL)) reservations,
       MAX(IIF(Id=20,Col,NULL)) connectionindicator,
       MAX(IIF(Id=21,Col,NULL)) cateringcode,
       MAX(IIF(Id=22,Col,NULL)) spare,
       MAX(IIF(Id=23,Col,NULL)) stpindicator
  FROM CTE2
 GROUP BY CTE2.RowNum

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 5

MVDBA (Mike Vessey) wrote:

my advice - fix the source file, put a delimiter in there and then have whoever designed that file format go and work for someone else.
In my experience it's a lot easier if you get good data in (The GIGO principal). plus it takes the liability for a failed import off your shoulders.

Nah... don't change a thing. The source is actually in a perfect format that requires no delimiters. No wasted space, no wasted time figuring out where delimiters are. Using a BCP format file and BULK INSERT would make this import/parse absolutely fly. That's if the source is actually a file.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

MVDBA (Mike Vessey) SSC-Insane Points: 21797 More actions · Answer 6

Jeff Moden wrote:

MVDBA (Mike Vessey) wrote:
my advice - fix the source file, put a delimiter in there and then have whoever designed that file format go and work for someone else.
In my experience it's a lot easier if you get good data in (The GIGO principal). plus it takes the liability for a failed import off your shoulders.
Nah... don't change a thing. The source is actually in a perfect format that requires no delimiters. No wasted space, no wasted time figuring out where delimiters are. Using a BCP format file and BULK INSERT would make this import/parse absolutely fly. That's if the source is actually a file.

jeff, I think this is the first time I've disagreed with you 🙂 lol

if some plonker decides to change a field length or a datetime to just a date then the string length changes and you have to recode all of the substrings from scratch.. delimeters all the way - just not CSV , I use ~

MVDBA

Jonathan AC Roberts SSCoach Points: 18355 More actions · Answer 7

MVDBA (Mike Vessey) wrote:

Jeff Moden wrote:
MVDBA (Mike Vessey) wrote:
my advice - fix the source file, put a delimiter in there and then have whoever designed that file format go and work for someone else.
In my experience it's a lot easier if you get good data in (The GIGO principal). plus it takes the liability for a failed import off your shoulders.
Nah... don't change a thing. The source is actually in a perfect format that requires no delimiters. No wasted space, no wasted time figuring out where delimiters are. Using a BCP format file and BULK INSERT would make this import/parse absolutely fly. That's if the source is actually a file.
jeff, I think this is the first time I've disagreed with you 🙂 lol
if some plonker decides to change a field length or a datetime to just a date then the string length changes and you have to recode all of the substrings from scratch.. delimeters all the way - just not CSV , I use ~

It looks like the format of files Cobol generates, fixed width exports were/are the standard export format from mainframes.

MVDBA (Mike Vessey) SSC-Insane Points: 21797 More actions · Answer 8

Jonathan AC Roberts wrote:

MVDBA (Mike Vessey) wrote:
Jeff Moden wrote:
MVDBA (Mike Vessey) wrote:
my advice - fix the source file, put a delimiter in there and then have whoever designed that file format go and work for someone else.
In my experience it's a lot easier if you get good data in (The GIGO principal). plus it takes the liability for a failed import off your shoulders.
Nah... don't change a thing. The source is actually in a perfect format that requires no delimiters. No wasted space, no wasted time figuring out where delimiters are. Using a BCP format file and BULK INSERT would make this import/parse absolutely fly. That's if the source is actually a file.
jeff, I think this is the first time I've disagreed with you 🙂 lol
if some plonker decides to change a field length or a datetime to just a date then the string length changes and you have to recode all of the substrings from scratch.. delimeters all the way - just not CSV , I use ~
It looks like the format of files Cobol generates, fixed width exports were a the standard export format from mainframes.

yuk, you just made me shiver we had an as/400 with cobol and oracle.... this was over 15 years ago . I still have nightmares

MVDBA

Jonathan AC Roberts SSCoach Points: 18355 More actions · Answer 9

MVDBA (Mike Vessey) wrote:

Jonathan AC Roberts wrote:
MVDBA (Mike Vessey) wrote:
Jeff Moden wrote:
MVDBA (Mike Vessey) wrote:
my advice - fix the source file, put a delimiter in there and then have whoever designed that file format go and work for someone else.
In my experience it's a lot easier if you get good data in (The GIGO principal). plus it takes the liability for a failed import off your shoulders.
Nah... don't change a thing. The source is actually in a perfect format that requires no delimiters. No wasted space, no wasted time figuring out where delimiters are. Using a BCP format file and BULK INSERT would make this import/parse absolutely fly. That's if the source is actually a file.
jeff, I think this is the first time I've disagreed with you 🙂 lol
if some plonker decides to change a field length or a datetime to just a date then the string length changes and you have to recode all of the substrings from scratch.. delimeters all the way - just not CSV , I use ~
It looks like the format of files Cobol generates, fixed width exports were a the standard export format from mainframes.
yuk, you just made me shiver we had an as/400 with cobol and oracle.... this was over 15 years ago . I still have nightmares

Also, the dates are formatted yymmdd it looks like they didn't fix the format for the y2k bug. Do you remember that? 😀

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 10

MVDBA (Mike Vessey) wrote:

Jeff Moden wrote:
MVDBA (Mike Vessey) wrote:
my advice - fix the source file, put a delimiter in there and then have whoever designed that file format go and work for someone else.
In my experience it's a lot easier if you get good data in (The GIGO principal). plus it takes the liability for a failed import off your shoulders.
Nah... don't change a thing. The source is actually in a perfect format that requires no delimiters. No wasted space, no wasted time figuring out where delimiters are. Using a BCP format file and BULK INSERT would make this import/parse absolutely fly. That's if the source is actually a file.
jeff, I think this is the first time I've disagreed with you 🙂 lol
if some plonker decides to change a field length or a datetime to just a date then the string length changes and you have to recode all of the substrings from scratch.. delimeters all the way - just not CSV , I use ~

Wrote quite a few COBOL programs to export data to flat files to export data for importing into SQL Server. Was the simplest and fastest way to do the exports. Also did it the for files we created for exports from SQL Server that were then imported using COBOL programs. If the format of a file changed, we knew we had other changes to make. I wouldn't bash fixed width files, they work.

Mike01 SSChampion Points: 11395 More actions · Answer 11

we still use mainframe and AS400. This is how we get files all the time. Bulk Insert works well and fast. I still have arguments with mainframers as they don't understand why we need these new-fangled technologies 😉

For better, quicker answers, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

MVDBA (Mike Vessey) SSC-Insane Points: 21797 More actions · Answer 12

Mike01 wrote:

we still use mainframe and AS400. This is how we get files all the time. Bulk Insert works well and fast. I still have arguments with mainframers as they don't understand why we need these new-fangled technologies 😉

Hahahahaha - I had the same discussion with an oracle developer during a due diligence check. he wasn't using any kind of joins, just listing tables and putting everything into the where clause.

when I asked him about it I got the reply " we don't use these new fangled join types" - this was in 2007 - I politely pointed him to ANSI 92 documentation and then very unpolitely explained that his code was 15 years out of date

MVDBA

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 13

MVDBA (Mike Vessey) wrote:

Jeff Moden wrote:
MVDBA (Mike Vessey) wrote:
my advice - fix the source file, put a delimiter in there and then have whoever designed that file format go and work for someone else.
In my experience it's a lot easier if you get good data in (The GIGO principal). plus it takes the liability for a failed import off your shoulders.
Nah... don't change a thing. The source is actually in a perfect format that requires no delimiters. No wasted space, no wasted time figuring out where delimiters are. Using a BCP format file and BULK INSERT would make this import/parse absolutely fly. That's if the source is actually a file.
jeff, I think this is the first time I've disagreed with you 🙂 lol
if some plonker decides to change a field length or a datetime to just a date then the string length changes and you have to recode all of the substrings from scratch.. delimeters all the way - just not CSV , I use ~

Heh... agreed... first disagreement but not really. Just a different experience.

My experience has been that people that create such a fixed field record type for data transmission actually have their act together much better than most. If you're transmitting non-variable width datatypes, it also means you don't have the extra overhead of both transmitting extra characters as delimiters and they know that if they need to change the record layout, they'll transmit a new one, which is frequently easily and automatically consumable. The really good ones transmit such a thing with the data.

There also nothing to prevent some plonker/wonk from totally trashing delimited data no matter what you use for a delimiter. Not only can they change things like you mention (changing date/time to just date), they can (and frequently do) change the order of columns and do really stupid things like not including the last column if it doesn't have data in it or, regardless of delimiter, do stupid things like include the delimiter as a part of data.

To me, it doesn't matter what the data provider uses so long as it's correct and consistent. Delimited data is cool, especially if they include column names on the first row (and I love it when they include the datatype on the second row) because I can auto-magically create a target staging table and, possibly, a BCP format file based solely on the content of the delimited file. I also have a great love for fixed field formats simply because of their blinding speed when done properly. "Done Properly", in such cases, does mean using a BCP format file instead of loading lines from the file as a blob and doing a bunch of substrings.

The bottom line, though, is that regardless of the record layout, delimited or otherwise, the data provider actually needs to know what their doing and then do it right very consistently or you're screwed.

Shifting gears a bit, the idea of using CSV or TSV is because people wanted to be able to do things like have variable width data like names, addresses, etc, and they wanted to be able to manually check/troubleshoot the data using (bleah!) spreadsheets. Of course, that morphed to using all manner of different delimiters because comma and tab characters can actually be a part of the data. All of that sucks, IMHO. So does XML, JSON, and whatever other markup language you want to consider.

With that in mind, I'd love to see people go back to the ancient method for the transmission of non-graphical data using ASCII control characters. In particular, I'd like to see people go back to using ASCII character #30 as a record separator and ASCII character #31 for a unit separator, also know as a "delimiter". There are a few other control characters that are extremely useful in controlling data transmission but, the use of at least the two I mentioned would greatly simplify life a whole lot. I also wouldn't mind it if they went back to sending integer data as Big Endian binary data and decimal data in a binary decimal manner.

Of course, none of that will happen because most of the world still depends on spreadsheets for I/O and, for those that don't, they still have some overwhelming personal need to be able to view and edit data in one (or Notepad). 😀

With that, I'll leave my parting thought as an image...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 14

Jonathan AC Roberts wrote:

MVDBA (Mike Vessey) wrote:
Jeff Moden wrote:
MVDBA (Mike Vessey) wrote:
my advice - fix the source file, put a delimiter in there and then have whoever designed that file format go and work for someone else.
In my experience it's a lot easier if you get good data in (The GIGO principal). plus it takes the liability for a failed import off your shoulders.
Nah... don't change a thing. The source is actually in a perfect format that requires no delimiters. No wasted space, no wasted time figuring out where delimiters are. Using a BCP format file and BULK INSERT would make this import/parse absolutely fly. That's if the source is actually a file.
jeff, I think this is the first time I've disagreed with you 🙂 lol
if some plonker decides to change a field length or a datetime to just a date then the string length changes and you have to recode all of the substrings from scratch.. delimeters all the way - just not CSV , I use ~
It looks like the format of files Cobol generates, fixed width exports were/are the standard export format from mainframes.

Heh... careful now... you'll piss off Lynn Pettis!

Truth be told, it IS an ancient form but, oh my, is it ever useful and fast. In fact, it used to be the way DB-II and DB-III structured it's tables. One file per table or index structured in a very similar manner. Variable width stuff was treated as "out of row", which was great and it all led to much smaller backups (don't back something up unless it actually changed) and made it possible to easily restore a single table as a copy of a table that got messed up.

{EDIT} I should have scrolled down. I see Lynn Pettis picked up on the Cobol reference, as expected. 😀

This reply was modified 4 years, 9 months ago by Jeff Moden.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004362 More actions · Answer 15

Jonathan AC Roberts wrote:

MVDBA (Mike Vessey) wrote:
Jonathan AC Roberts wrote:
MVDBA (Mike Vessey) wrote:
Jeff Moden wrote:
MVDBA (Mike Vessey) wrote:
my advice - fix the source file, put a delimiter in there and then have whoever designed that file format go and work for someone else.
In my experience it's a lot easier if you get good data in (The GIGO principal). plus it takes the liability for a failed import off your shoulders.
Nah... don't change a thing. The source is actually in a perfect format that requires no delimiters. No wasted space, no wasted time figuring out where delimiters are. Using a BCP format file and BULK INSERT would make this import/parse absolutely fly. That's if the source is actually a file.
jeff, I think this is the first time I've disagreed with you 🙂 lol
if some plonker decides to change a field length or a datetime to just a date then the string length changes and you have to recode all of the substrings from scratch.. delimeters all the way - just not CSV , I use ~
It looks like the format of files Cobol generates, fixed width exports were a the standard export format from mainframes.
yuk, you just made me shiver we had an as/400 with cobol and oracle.... this was over 15 years ago . I still have nightmares
Also, the dates are formatted yymmdd it looks like they didn't fix the format for the y2k bug. Do you remember that? 😀

Boy howdy, do I EVER! I was the Director of MIS for a small telephone company at the time. I kept getting forms from the PUC of every bloody state in the union asking about how we were preparing, what the impact on our customers might be, etc, ect, ad infinitum. AND, the form clearly stated that you had to fill out THEIR form instead of just stapling a common response to them. Every one of the buggers had a similar question as the last question... "What has cost your company the most time and effort in preparing for Y2K". My answer was (and in all caps) "FILLING OUT THIS STUPID FORM!". 😀

I used to work for a large DOD company. I started prepping all data I worked with for Y2K way back in 1980. Except for the damned PUC response forms, Y2K was a "no action required" task for me.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)