Stored procedures and Creating tables. (Problem).

  • Sergiy (7/19/2016)


    Been there, done that.

    Exactly that.

    So then I have some questions, because I am still strugling with parts of the code.

    How do you handle EOL within fields? (Or a bit less dramatic: How do you handle delimiters within a field.)

    Especially handling the EOL within fields is a bit problematic, although I have a 'solution' it is slow and rather cumbersome.

    So how do you differentiate between an EOL between textqualifiers and and EOL which is the actual end of the line or row. (I am using Transact SQL for this at the moment and my wish is to keep my technologie limited to this only.)

    I'm pretty sure it's not 'legal', and to me - it's not reasonable.

    Not reasonlable is for a humanity hardly a reason not to do it.

    Not 'legal', where is it determined that this is not 'legal'?

    For myself I can think of a reason to mix textqualifiers. If there is a column which does not contain a double quote, then a double qoute is an excelent textqualifier for that column. Same goes for another column which does not contain a single quote. For a Parser which parses from start to finish this is hardly a problem, if the first non white symbol in a field is a textqualifier, this is the textqualifier for that field.

    With SQL-server I am not parsing from start to finish but using RBAR, it is quite difficult to parse these constructs effectively.

    I honestly do not see how this process can benefit from dynamically created static tables.

    1. The CSV fields are not known at design time, they are not known at the start of the process. At the end of the process they must be static tables. So there is no other way than determine which fields there are and construct the table from this.

    So this is not only a benefit, but a requirement.

    2. Creating and 'removing' files in a process makes the process independ of other processes and creation routines. So the code becomes selfcontained.

    3. Static tables (as instead of temporary tables) resulted in a far better performance.

    I don't see why the clustered index has to be removed.

    Can you name a single reason for doing that?

    I did a fair number of tests. Did not test all combinations and did not test the tested combinations extensively. Most time was consumed filling the table, without the clustered index this worked reasonable so I stuck with that. Haven't retested the endsolution with a clustered index during the fill.

    So the reason for removing the clustered index was that in a specific solution filling was faster without the clustered index.

    Thanks for you reply,

    Ben Brugman

  • GSquared (7/20/2016)


    I built something very similar, but it uses XML instead of EAV, and NULLs out empty columns. Used less space and XML parsing was faster than EAV reconstitution. (I got the idea from Sparse Columns tables, which use XML for their actual storage.)

    I did build a parser on XML with the same technologie, but this worked far more efficient within XML each '<' you encounter does start something and within the 'fields' you will never encounter a '<', because it is replaced by a '& l t' symbol (spaces added otherwise it is not 'correctly' represented here). So splitting up the XML is a 'one' step process. Also the EOL is just White space and does not account for anything.

    Code for this was relatively easy build in transact SQL not using RBAR.

    One advantage was that with hardly any coding after (partly) parsing a XML, it was very easy to create the output which was 'beautified' with correct indents etc.

    My solution keeps a record of the rows received, and assigns a hash value to each, since many of the files have very low delta rates and send the exact same rows day after day after day. Dramatically reduced the amount of processing needed for subsequent files after the first one.

    But a hash is only a guess that something is the same. How did you determine that it was actually the same ?

    But mine's a specialized solution for a specific datastream.

    Most generic solutions start of as a specialized solution. In my case 6 specialized solutions are required, I have only seen 2 of them, so I thought building a generalized solution is a good start for all of them.

    (Within my XML solution I did also create and remove temporary and static tables).

    Thanks for your sharing your information.

    Ben

    (Edit: Indents instead of indexes for the beautifier)

  • ben.brugman (7/21/2016)


    How do you handle EOL within fields? (Or a bit less dramatic: How do you handle delimiters within a field.)

    Especially handling the EOL within fields is a bit problematic, although I have a 'solution' it is slow and rather cumbersome.

    I did not have to.

    Our suppliers were kind enough to agree on a protocol where EOL within a "Product Description" or "delivery instruction" fields were replaced with an "impossible" character (like CHAR(255) or something)

    So how do you differentiate between an EOL between textqualifiers and and EOL which is the actual end of the line or row. (I am using Transact SQL for this at the moment and my wish is to keep my technologie limited to this only.)

    I have an idea how it can be done, but the approach is in development stage, to be tested on practical cases, not ready to be published yet.

    Not reasonlable is for a humanity hardly a reason not to do it.

    Not 'legal', where is it determined that this is not 'legal'?

    For myself I can think of a reason to mix textqualifiers. If there is a column which does not contain a double quote, then a double qoute is an excelent textqualifier for that column. Same goes for another column which does not contain a single quote. For a Parser which parses from start to finish this is hardly a problem, if the first non white symbol in a field is a textqualifier, this is the textqualifier for that field.

    It's unreasonable because it creates more trouble than it was without it.

    It all sounds OK when you exporting a file.

    But when you importing - you need to make decisions based on uncertain criteria.

    For example, if you have a string like this:

    'content of the field 1,here is field 2,field 3 has a description of a 'new item'

    How will you interpret such string?

    Is it a single field quoted with single quotes?

    Or a single quote just happened to be the first character in a string which did not need to be quoted because it did not have a delimiter within its content?

    Unless you have a strict rule that every value must be quoted you have no way to tell that 1st character is a "dynamic quote" or not.

    _____________
    Code for TallyGenerator

  • I honestly do not see how this process can benefit from dynamically created static tables.

    1. The CSV fields are not known at design time, they are not known at the start of the process. At the end of the process they must be static tables. So there is no other way than determine which fields there are and construct the table from this.

    So this is not only a benefit, but a requirement.

    2. Creating and 'removing' files in a process makes the process independ of other processes and creation routines. So the code becomes selfcontained.

    3. Static tables (as instead of temporary tables) resulted in a far better performance.

    1. What is the purpose of those static tables at the end of the process?

    What can read data from tables having an unpredictable number of columns of unpredictable data types?

    It's a dead end.

    At the end of the process the data should go into static data interfaces with a well known set of columns and data types. Otherwise it's a dead end.

    In my system I used 4 column EAV type of table, where FileID was in Column1, Row Numbers were in "Field names" were in column3, and "Field Values" in Column4.

    At the end I "pivoted" it (using cross-tab queries) into whatever set of columns the interface was mapped to.

    2. Did you mean "fields", not "files"?

    Parsing delimiter strings into a "vertical" static table makes the process not any less self-contained.

    It even allows you to have different numbers of columns for different rows.

    When my process saw this happening it knew there is a mistake which needs to be rectified or, if known methods did not work, reported upstairs (means - to me).

    3. I never saw a single case of dynamically created static tables performing better than properly defined temp tables.

    Is it that you do not bother creating constraints and indexes on temp tables?

    I don't see why the clustered index has to be removed.

    Can you name a single reason for doing that?

    I did a fair number of tests. Did not test all combinations and did not test the tested combinations extensively. Most time was consumed filling the table, without the clustered index this worked reasonable so I stuck with that. Haven't retested the endsolution with a clustered index during the fill.

    So the reason for removing the clustered index was that in a specific solution filling was faster without the clustered index.

    Thanks for you reply,

    Ben Brugman

    From my experience (quite extended one) I can tell that if you apply a correct ORDER BY clause to the inserted recordset (matching the definition of the clustered index) there would not be any noticeable overhead caused by the existence of clustered index.

    This part should not be too hard to verify. 🙂

    _____________
    Code for TallyGenerator

  • There were still some questions asked, I'll try to anwser them briefly.

    Sergiy (7/26/2016)


    Our suppliers were kind enough to agree on a protocol where EOL within a "Product Description" or "delivery instruction" fields were replaced with an "impossible" character (like CHAR(255) or something)

    I am trying to build a generic and flexible solution, End Of Lines, delimiters and textqualifiers are all possible characters within a CSV. So I am planning to be able to handle any well formed CSV current and future.

    It all sounds OK when you exporting a file.

    If it sounds OK somebody probably will do this. Handling mixed textqualifiers does take a performance hit, so I am excluding this at the moment. Because the usage if textqualifiers is strictly defined (as with SQL-server) for a linear parser it would not pose a problem.

    The example you gave is not a well formed CSV. (The text within two delimiters can not contain an odd number of textqualifiers).

    1. What is the purpose of those static tables at the end of the process?

    What can read data from tables having an unpredictable number of columns of unpredictable data types?

    It's a dead end.

    The static tables are the result of the import. And this is not the end and it is not a dead end. Offcourse there must be columns which contain information which is needed/used. But there is no requirement that these should be the only columns.

    2. Did you mean "fields", not "files"?

    Sorry this should have read tables and not "files". Because one table results in one file I mixed up the terminology. Sorry.

    Remark.

    A well formed CSV has the same number of columns for each line/row. It is relatively easy to detect lines which do not conform to this.

    3. I never saw a single case of dynamically created static tables performing better than properly defined temp tables.

    Is it that you do not bother creating constraints and indexes on temp tables?

    In my test, the same constructs worked faster with static tables.

    I don't see why the clustered index has to be removed.

    Can you name a single reason for doing that?

    In my test, it was faster.

    From my experience (quite extended one) I can tell that if you apply a correct ORDER BY clause to the inserted recordset (matching the definition of the clustered index) there would not be any noticeable overhead caused by the existence of clustered index.

    This part should not be too hard to verify. 🙂

    Verification would take some time.

    Yes I do also have some experience (this month 43 years) and I can tell that an ORDER BY can create quite an performance hit.

    Greetings

    Ben

    Mixed textqualifiers.

    Sometimes mixed qualifiers are used where the textqualifiers are equivalent. For example an open quote is different from a closing quote, this might happen when somebody has used Word to form the text. Word often will give different quotes for open and close. These situations are easy to handle because then the qualifiers are both handled equivalent.

    Sometimes mixed qualifers are used be cause some fields can contain a double quote where other fields can contain a single quote. I have seen this type of files, but not lately.

  • ben.brugman (8/2/2016)


    The static tables are the result of the import. And this is not the end and it is not a dead end. Offcourse there must be columns which contain information which is needed/used. But there is no requirement that these should be the only columns.

    Here we are.

    You actually need tables with well pre-defined list of columns.

    There rest is white noise which will be ignored in the end.

    I believe building tables to hold the unnecessary data, load it, parse, store (even temporarily) will cause quite a performance hit. With no any benefits.

    Don't you think?

    I don't see why the clustered index has to be removed.

    Can you name a single reason for doing that?

    In my test, it was faster.

    Faster for INSERT.

    But populating a heap table with a messy set of strings with unknown definitions is not the point of the exercise - right?

    At the end the data must be filtered, sorted and made into an indexed data storage.

    Delaying doing it till populating transactional tables will only add to the troubles.

    _____________
    Code for TallyGenerator

  • ben.brugman (8/2/2016)


    It all sounds OK when you exporting a file.

    If it sounds OK somebody probably will do this. Handling mixed textqualifiers does take a performance hit, so I am excluding this at the moment. Because the usage if textqualifiers is strictly defined (as with SQL-server) for a linear parser it would not pose a problem.

    The example you gave is not a well formed CSV. (The text within two delimiters can not contain an odd number of textqualifiers).

    I gave you an actual example from a real business case.

    Customers requested long account numbers in CSV not to be distorted when the files are opened in Excel.

    To make it happen we had to add a single quote to the numeric value - so Excel reads it as text, not a float value.

    Account number was the first column in quite a wide set. And there were plenty of other single quotes in text fields to the right.

    Here is the actual string I just saved as CVS from my Excel 2010:

    abc,'123,456,789',here we are

    It's been 5 columns in Excel, and it must be interpreted as 5 columns by any tool parsing this CSV string.

    So, it's not about performance.

    You simply cannot build a logic to consistently and correctly interpret files with mixed textquilifiers.

    Mixed textqualifiers.

    Sometimes mixed qualifiers are used where the textqualifiers are equivalent. For example an open quote is different from a closing quote, this might happen when somebody has used Word to form the text. Word often will give different quotes for open and close. These situations are easy to handle because then the qualifiers are both handled equivalent.

    Word uses fixed pairs of quotes. And it does not mix them with any other text qualifiers.

    There are no text qualifiers, it's different text qualifiers per file format.

    Sorry, but you must know from some meta-data what are your text qualifiers before you start reading a file.

    Same about delimiters.

    CSV is not the only format used in this world.

    There are TSV, semicolumn-delimited, ^-delimited, and al sorts of other kind of files.

    You may try to build a software which will figure out delimiters and textqualifiers used for the file from the file itself, but it will never get beyond "the best guess".

    Which is never good enough for an automated tool.

    _____________
    Code for TallyGenerator

  • Sergiy (8/2/2016)


    I believe building tables to hold the unnecessary data, load it, parse, store (even temporarily) will cause quite a performance hit. With no any benefits.

    Don't you think?

    No I do not think so, but am open to suggestions.

    So how would you load/parse/store the data were all unnecessary data is filtered out before it is inserted ?

    Given are the CSV files.

    (Set 1 has 110 files, not all files/columns are used.

    5 other sets which are not all available yet.

    The CSV files are properly formed **).

    Future CSV files.).

    Faster for INSERT.

    Yes, at the moment all performance problems are during the insertion of data, after the data is inserted the there are no performance problems.

    Ben

    **)

    properly formed CSV's.

    At a number of places on the internet a definition can be found, for example:

    Wikipedia:

    https://en.wikipedia.org/wiki/Comma-separated_values

    References:

    https://www.ietf.org/rfc/rfc4180.txt

    Robustness Principle:

    CSV implementations will follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.

    (E.g. other delimitors as TAB/semicolon are accepted)

  • ben.brugman (7/18/2016)


    At the moment I am in a process converting a number of CSV files into XML files.

    Heh... as one of my favorite comic characters is know to say...

    [font="Arial Black"]GAH![/font]

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ben.brugman (8/15/2016)


    Sergiy (8/2/2016)


    I believe building tables to hold the unnecessary data, load it, parse, store (even temporarily) will cause quite a performance hit. With no any benefits.

    Don't you think?

    No I do not think so, but am open to suggestions.

    So how would you load/parse/store the data were all unnecessary data is filtered out before it is inserted ?

    Given are the CSV files.

    (Set 1 has 110 files, not all files/columns are used.

    5 other sets which are not all available yet.

    The CSV files are properly formed **).

    Future CSV files.).

    OK, we need to start from the basics.

    There are no columns in files.

    File - by definition - is a sequence of binary values.

    According to the common convention those binary values are grouped into bytes.

    That's it.

    We can't tell any more about a file unless we have some metadata.

    You need to know some facts about a file before you can make some sense out of it.

    1. Encoding.

    Knowing file encoding we can turn your sequence of bytes into a sequence of text symbols.

    Depending on the encoding we choose the output will be different.

    2. Row delimiters.

    You have to know what kind of characters are used for terminating rows.

    Knowing row delimiters you can turn your sequence of text symbols into a sequence of text blocks named "rows".

    3. Value delimiters and quotes.

    You have to know which symbols inside of texts blocks we have named "rows" separate smaller sequences (we name them "values") from each other. We name those symbols "delimiters".

    And we have to know which symbols are used to indicate that a symbol identical to delimiter is not a delimiter, but a part of the value, so it should not be used for splitting a row. We name those symbols "quotes".

    Until you know these 4 facts about any of CSV files you have no chance of parsing it correctly.

    You may try to figure out them on-fly, from the file content itself.

    But it does not change anyhing about the approach:

    1. get the file metadata first;

    2. then parse the file.

    Please note - up to this point there are no columns in the picture yet.

    Encoding, delimiters and quotes allow us to split a file into a sequence of values.

    Only thing we can do now is start collecting those values and allocate sequential numbers to each of them.

    When 2 values are separated not by a value delimiter but by a row terminator we may say that a new row starts here. Then we increment another counter - Row Number, and reset the value counter to 1.

    After that we end up with sequence of numbered values: RowNo, ColNo, Value.

    Up to this point we have no options of omitting/skipping unnecessary data.

    We must read every single byte of the file to interpret its content correctly and to find every single value we possibly need to import from it.

    But now we have a choice:

    a) Pivot the whole sequence, every value in it, into a table, including the unnecessary data;

    b) Using out knowledge about placeholders of the values we actually need to take from the file pick only those values which have appropriate combinations of RowNo-ColNo numbers allocated to it.

    I personally believe that creating a staging table for only that subset of data which will be actually needed in further processing will save time and system resources.

    Can you argue that?

    _____________
    Code for TallyGenerator

  • The question was:

    So how would you load/parse/store the data were all unnecessary data is filtered out before it is inserted ?

    Given are the CSV files.

    No additional Meta data is supplied.

    The files contain information, this information is presented in columns.

    So I do not agree with you that 'There are no columns in files'. (Files for SQLserver and/or Oracle do also contain columns, so I do not agree with your statement.)

    Encoding, Row delimiters, Valuedelimiters and textqualifiers can all be determined from the content.

    (That is what I normaly do when I get a file say from a institution without the metadata. Normally there is enough information within the file to make sense of the file and parse the file. In general there is no metadata about the existing columns.)

    So I import all the unnecessary data and start processing from there. This is agains your 'method'.

    So how would you load/parse/store the data were all unnecessary data is filtered out before it is inserted ?

    Ben

    About parsing, normally I determine the delimitor, EOL delimitor and textqualifiers by hand.

    But I made a script to count the number of comma's, semicolons, tabs, CR, LF, CRLF, qoutes and double qoutes.

    From these numbers for most files you can guess what the different symbols are.

    From the number of each symbol in the file it is fairly easy to pick which symbol is used for what.

    When picked correctly the number of fields for each row is the same. If the number of fields for each row is the same, it is highly likely that the symbols are correct and the file parsed correctly.

    Only after parsing and creating a table of the information I start with interpreting the information.

    So the above question remains, how to process the file so that only 'usefull' data is inserted in the database?

  • ben.brugman (8/16/2016)


    The question was:

    So how would you load/parse/store the data were all unnecessary data is filtered out before it is inserted ?

    Given are the CSV files.

    No additional Meta data is supplied.

    The files contain information, this information is presented in columns.

    So I do not agree with you that 'There are no columns in files'. (Files for SQLserver and/or Oracle do also contain columns, so I do not agree with your statement.)

    Encoding, Row delimiters, Valuedelimiters and textqualifiers can all be determined from the content.

    (That is what I normaly do when I get a file say from a institution without the metadata. Normally there is enough information within the file to make sense of the file and parse the file. In general there is no metadata about the existing columns.)

    So I import all the unnecessary data and start processing from there. This is agains your 'method'.

    So how would you load/parse/store the data were all unnecessary data is filtered out before it is inserted ?

    Ben

    About parsing, normally I determine the delimitor, EOL delimitor and textqualifiers by hand.

    But I made a script to count the number of comma's, semicolons, tabs, CR, LF, CRLF, qoutes and double qoutes.

    From these numbers for most files you can guess what the different symbols are.

    From the number of each symbol in the file it is fairly easy to pick which symbol is used for what.

    When picked correctly the number of fields for each row is the same. If the number of fields for each row is the same, it is highly likely that the symbols are correct and the file parsed correctly.

    Only after parsing and creating a table of the information I start with interpreting the information.

    So the above question remains, how to process the file so that only 'usefull' data is inserted in the database?

    How do you currently distinguish between data which is 'useful' and data which is not?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (8/16/2016)


    How do you currently distinguish between data which is 'useful' and data which is not?

    After the date is inserted and put into columns. The column_names (from the file) and the first 100 rows are made available to a 'mapping' team.

    The target database is known.

    Some examples of fairly common data are below the signature.

    For 'domain' knowledge is far more complex. (A made up example is below the signature).

    So the mapping team makes a mapping between the supplied data and the required data.

    Things like Name,Address,City are fairly easy recognisable, although there is a huge variation in the way this is stored. Birthdate is fairly also.

    Other data can very often be recognised and sometimes has to be transformed. So very often I make a views of the available 'tables' to match the model in the target database. (Sometimes a view is made which produces gibberisch, sometimes it provides a clear insight in the information).

    Privately.

    I am a Diabetic. Over the years I have used a number of different Bloodglucose meters and a number of different Insuline pumps. Also have I used some foodintake applications. Most (if not all) of these can deliver the data in a 'CSV' file. The formats, the type, the amount of data varies for each device. Meta data is not available. Manufacturers although they have a very friendly helpdesk do not supply information about the formats.

    Combining these datasets is 'impossible' in a spreadsheat, within SQLserver this is feasable.

    From this 'internal' data I can even infere some information that the manufacturer is not supplying.

    Also Privately.

    Privately I also do downloads of different CSV's where limited or no Meta data is supplied. For example the phone bills, there are a number of CSV's for the phone bills to download. They have different formats for their different downloads.

    For banking, different banks have different structures for their CSV downloads. (Sometimes some things can be choosen).

    Conclusion:

    For my work and also privately there are a large number of CSV's I like to be able to import in a database. Only after they are imported I start processing of the data. Because all data is imported, I do not have to decide before hand which data is usefull to me and which is not. Also in general the provider of the data does not provide the meta data. In most cases I am able recognise the information in the data. Although not all information is immediatly recognised by me (or the team).

    Ben

    But there are a huge amount of variations, for example the date of birth.

    The format is a variation but that is hardly a problem.

    Sometimes a part of the birthdate is unknown.

    Sometimes only the day is unknown.

    Sometimes even the month is unknown.

    Sometimes a half year is unknown.

    Unknown day can be stored as:

    1935-08-00 or 1935-08-01 Uncertainty 1 or 1935-08-01 Uncertainty 31.

    Unknown month can be stored as

    1935-00-00 or 1935-01-01 Uncertainty 2 or 1935-00-01 Uncertainty 366.

    Unknown halfyear is stored as the first of the year, or the first of july of the year. And I do not know the indicator which comes with this uncertainty.

    There are probably other systems with other uncertainties.

    Similar things happen with names of persons.

    There the complexity arises with birthname, maidenname, partnername, familyname. Each name can have prefix. Then there is often a givenname, a middlename, initials or a combination.

    Adresses can have some complexity.

    (Streetname, number, other designation, city, postalcode).

    For example an elderly person can live on a adres, but within that adres there are a large number of elderly people each with their own appartmentnumber or roomnumber. The roomnumber is still a 'legal' adres for official reasons.

    Some systems just bundel the street, number and additional information in a single adresline. Other systems have distinct fields for each part.

    A made up domain structure.

    Might be:

    Three tables for a orderrequest, the order and the order lines.

    Other systems put this in two tables or in even more than three tables.

  • This is really fascinating. Ben, thanks for a great discussion.

    IF you or anyone else wanted to write an article, or series, on dealing with CSV data in a few specific ways, I think it would be great. I know I've had issues at times with CSVs that were problematic to load.

  • ben.brugman (8/16/2016)


    The question was:

    So how would you load/parse/store the data were all unnecessary data is filtered out before it is inserted ?

    The answer was given.

    You probably should read it.

    Given are the CSV files.

    No additional Meta data is supplied.

    What is additional?

    You have a row:

    1935-01-01

    How do you interpret it?

    Remember - no additional meta data supplied.

    What is that?

    Unknown birth date?

    Or 3 dimensions of a parcel size?

    Or 3 totally independent values separated by a dash which was used a delimiter for the file?

    The files contain information, this information is presented in columns.

    Sorry, but it is, as English would say, absolute bollocks.

    Open any csv, Excel, mdf file in Notepad, WordPad or Photoshop - and find any columns in it.

    Columns only appear after parsing a file by an appropriate application according to the rules defined by metadata.

    Excel won't even attempt to parse a CSV file without asking you what is the delimiter.

    Only when you tell it which delimiter to use - it presents you the data split to columns, according to the metadata you've just provided.

    BULK INSERT or BCP have "default" parameters which will be used if you have not provided explicit metadata in a form of parameters or a format file.

    No automation tool can present data from a file in columns unless you provide parsing rules for the file.

    A file is a sequence of bytes.

    Nothing more.

    No any columns in it.

    _____________
    Code for TallyGenerator

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply