How to store lots of data (1024 columns) for a particular reading

  • Hello, I am having some issues in deciding how to store data taken from a specific device. The device is a spectrometer, that means it takes samples every minute. Each sample contains 1024 readings, one for each channel of the spectrometer. So a typical sample record from this device would be:

    Time: 2011-01-01 01:01:01

    DeviceID: 1

    Channel1Counts: 25

    Channel2Counts: 29

    Channel3Counts: 38

    .

    .

    .

    Channel1024Counts: 72

    The goal of this project is to get a device logging its data to a database and then to build a web app to view the data, look at trends, etc. How could I express this data in a table and keep the ability to assemble this data back into a given sample?

    I see only two ways:

    1.) (Wide solution) Structure a table with 1024+2 columns. Pros: the sample reading is held together in the database. Cons: requires a lot of columns (not even sure if this is getting close to the SQL limit) and if more channels are required in the future there will be lots of modifications to code.

    2.) (Tall solution) Structure two tables, 1 to create a unique sample ID and label it with a timestamp and Device ID, and another table that has unique sample ID as a foreign key and then has a column for "Counts Number" and "Channel Number". Pros: flexible solution and seems to be more relational based design. Cons: Makes assembling records harder, would need to use a view to do this likely. Also makes updating data harder.

    Suggestions?

  • loki1049 (11/16/2011)


    Hello, I am having some issues in deciding how to store data taken from a specific device. The device is a spectrometer, that means it takes samples every minute. Each sample contains 1024 readings, one for each channel of the spectrometer. So a typical sample record from this device would be:

    Time: 2011-01-01 01:01:01

    DeviceID: 1

    Channel1Counts: 25

    Channel2Counts: 29

    Channel3Counts: 38

    .

    .

    .

    Channel1024Counts: 72

    The goal of this project is to get a device logging its data to a database and then to build a web app to view the data, look at trends, etc. How could I express this data in a table and keep the ability to assemble this data back into a given sample?

    I see only two ways:

    1.) (Wide solution) Structure a table with 1024+2 columns. Pros: the sample reading is held together in the database. Cons: requires a lot of columns (not even sure if this is getting close to the SQL limit) and if more channels are required in the future there will be lots of modifications to code.

    2.) (Tall solution) Structure two tables, 1 to create a unique sample ID and label it with a timestamp and Device ID, and another table that has unique sample ID as a foreign key and then has a column for "Counts Number" and "Channel Number". Pros: flexible solution and seems to be more relational based design. Cons: Makes assembling records harder, would need to use a view to do this likely. Also makes updating data harder.

    Suggestions?

    Data should be stored at granular level therefore I'll go with your "Tall solution"

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • hybrid of the wide table idea?

    do those 1024 columns logically break into groups that could be, say 4 or 5 other tables? (measurement types?)then you have , dunno, 5 tables each with approx 200 coumns apiece, all refering to the master table? the insert and update would involve 5 tables, but a little easier to manage, maybe?

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (11/16/2011)


    hybrid of the wide table idea?

    do those 1024 columns logically break into groups that could be, say 4 or 5 other tables? (measurement types?)then you have , dunno, 5 tables each with approx 200 coumns apiece, all refering to the master table? the insert and update would involve 5 tables, but a little easier to manage, maybe?

    No, the 1024 readings all have only one thing in common, they are radiation counts at various energies. It wouldn't make sense to further break them down into separate tables. I'm thinking the tall solution is the only one that seems actually feasible as well.

  • From a normalization perspective you are justified in using either model. Use whichever model will provide the easiest path for your development and will meet your performance requirements. The view you alluded to creating sounds awful. Predictably, the proper solution will depend on your situation.

    Side thought: what will the data types be? There is a hard-limit on row width containing fixed-length data types of 8060 bytes so if most are bigint or decimal/numeric you may run into an issue that forces you into the tall model.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • SQL server limits the number of columns in a table and view to 1024. So the wide solution is not acceptable as it stands because 1024 columns plus 1 sample number column exceeds the limit. For the record, the values stored in the 1048 entries would likely be INT. Our boss is pushing to store the additional 1024 pieces of data as comma separated strings in 3 NVARCHAR(MAX) columns. Apparently, we don't need to do manipulations to this data dynamically. I'm not sure on how I feel about this solution, opinions?

  • I would avoid storing a comma-separated list if at all possible. If your boss is married to the idea the least you could do is steer them towards coalescing the 1024 values into an XML document before storing it in an XML column. Personally I would push for storing each discrete value in its own column with a proper data type.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (11/18/2011)


    I would avoid storing a comma-separated list if at all possible. If your boss is married to the idea the least you could do is steer them towards coalescing the 1024 values into an XML document before storing it in an XML column. Personally I would push for storing each discrete value in its own column with a proper data type.

    +1 to this. At least with the XML you have more manipulation and transfer options. I personally would go with the 'tall' solution, but that's because I'd assume that I was storing data to do something with it eventually rather then just stare at it. If all you're doing is recording for posterity, well, meh, shove it into a single Varchar(Max) and call it good. Just get it in writing that there's is no intended data manipulation expected.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • For a spectrometer, you would expect all numeric values. Rather than bother with XML, why not store it all as "fixed length" fields within a fixed length BINARY datatype which is easily "split" positionally?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • 1.) (Wide solution) Structure a table with 1024+2 columns.

    Pros: the sample reading is held together in the database.

    It’s possible with Wide Tables.

    A wide table is a table that has defined a column set. Wide tables use sparse columns to increase the total of columns that a table can have to 30,000.

    Cons: requires a lot of columns (not even sure if this is getting close to the SQL limit)

    Nope, it's not a limitation. But I won't go for it.

    and if more channels are required in the future there will be lots of modifications to code.

    Agree. And it should be your major worry.

    2.) (Tall solution) Structure two tables, 1 to create a unique sample ID and label it with a timestamp and Device ID, and another table that has unique sample ID as a foreign key and then has a column for "Counts Number" and "Channel Number".

    Pros: flexible solution and seems to be more relational based design.

    Agree.

    Cons: Makes assembling records harder, would need to use a view to do this likely.

    If assembling means getting them in a row with crosstab / pivot, then Yes. But it’s a presentation issue and can be managed well in presentation layer.

    Also makes updating data harder.

    Nope. It should be faster. That's the advantage of Normalization.

    Just a suggestion. Whenever you in dilemma for database design you can rely on Normalization Principles. They will guide you well.

  • Dev (11/19/2011)


    1.) (Wide solution) Structure a table with 1024+2 columns.

    Pros: the sample reading is held together in the database.

    It’s possible with Wide Tables.

    A wide table is a table that has defined a column set. Wide tables use sparse columns to increase the total of columns that a table can have to 30,000.

    Do you have a reference you could share that shows how to do that?

    Cons: requires a lot of columns (not even sure if this is getting close to the SQL limit)

    Nope, it's not a limitation. But I won't go for it.

    Why not?

    and if more channels are required in the future there will be lots of modifications to code.

    Agree. And it should be your major worry.

    2.) (Tall solution) Structure two tables, 1 to create a unique sample ID and label it with a timestamp and Device ID, and another table that has unique sample ID as a foreign key and then has a column for "Counts Number" and "Channel Number".

    Pros: flexible solution and seems to be more relational based design.

    Agree.

    Cons: Makes assembling records harder, would need to use a view to do this likely.

    If assembling means getting them in a row with crosstab / pivot, then Yes. But it’s a presentation issue and can be managed well in presentation layer.

    Also makes updating data harder.

    Nope. It should be faster. That's the advantage of Normalization.

    IMHO, that's a bit contrary to the whole reason for data warehousing and high performance reporting/analysis in this case. Normally, denormalized data is quite a bit faster for lookups than normalized data for the very reasons you mentioned. Normalized data for this type of information would, indeed, require reassembly of rows. For spectrometer data of this nature, all 1024 columns are usually equally important and must be stored... even the columns with "0" data are important in spectral analysis.

    Just a suggestion. Whenever you in dilemma for database design you can rely on Normalization Principles. They will guide you well.

    Normally, I'd agree but I'm not sure that they actually apply for the data in this case because of the relatively major pain that reassembling rows with 1024 "plottable" columns would take.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Pros: the sample reading is held together in the database.

    It’s possible with Wide Tables.

    A wide table is a table that has defined a column set. Wide tables use sparse columns to increase the total of columns that a table can have to 30,000.

    Do you have a reference you could share that shows how to do that?

    Special Table Types

    http://msdn.microsoft.com/en-us/library/ms186986.aspx

    Cons: requires a lot of columns (not even sure if this is getting close to the SQL limit)

    Nope, it's not a limitation. But I won't go for it.

    Why not?

    Performance Issue & Size Considerations.

    Performance Considerations for Wide Tables

    http://msdn.microsoft.com/en-us/library/cc645884.aspx

    Using Sparse Columns

    http://msdn.microsoft.com/en-us/library/cc280604.aspx

    Also makes updating data harder.

    Nope. It should be faster. That's the advantage of Normalization.

    IMHO, that's a bit contrary to the whole reason for data warehousing and high performance reporting/analysis in this case. Normally, denormalized data is quite a bit faster for lookups than normalized data for the very reasons you mentioned. Normalized data for this type of information would, indeed, require reassembly of rows. For spectrometer data of this nature, all 1024 columns are usually equally important and must be stored... even the columns with "0" data are important in spectral analysis.

    Data Warehousing Application are designed for READ ONLY (or READ MOSTLY) purposes. For faster DMLs we should stick to OLTP systems (Normalized).

    Just a suggestion. Whenever you in dilemma for database design you can rely on Normalization Principles. They will guide you well.

    Normally, I'd agree but I'm not sure that they actually apply for the data in this case because of the relatively major pain that reassembling rows with 1024 "plottable" columns would take.

    Having 1024 columns in one table is BAD design indeed. It's bad even for DW applications. I am not very sure why MS introduced Wide Tables at the first place? As a Database Designer / Architect I don't find a single excuse for crossing 100 columns per table (MS supports 1024 for normal tables).

    For some reason, if I had to allow this data be stored in wide fashion. I would go for XML data type.

  • No, the output of a spectrometer isn't what most people would consider to be "OLTP". It's just a wad of data much like call records or any other "wide row" data might consist of. Each row has its own integrity and joins with other tables won't be necessary for analysis.

    Having a "tall skinny" (NVP or Name Value Pair) table, like what you're suggesting, would actually more than triple the storage space required because, not only must you store the data, but you also need to store the column name and where you were storing a row number reference only once, you must now store it for each single piece of data. Use of a such a triple size table would require three times the IO to get anything done.

    For all the reasons I just mentioned, I believe an NVP style of table for this problem would be one of the worse things that could be done.

    So far as being a Data Architect and not finding a reason to cross the proverbial 100 column border goes, I'd normally agree but this isn't normal data and the processes the data will be made to suffer won't be normal either.

    Thanks for the links.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/19/2011)


    Dev (11/19/2011)


    1.) (Wide solution) Structure a table with 1024+2 columns.

    Pros: the sample reading is held together in the database.

    It’s possible with Wide Tables.

    A wide table is a table that has defined a column set. Wide tables use sparse columns to increase the total of columns that a table can have to 30,000.

    Do you have a reference you could share that shows how to do that?

    Sparse columns (requires SQL 2008)

    Thing is, all the columns must be nullable and it's intended for columns that are mostly (90%+) NULL

    I'm not going to write out 1000+ columns, but the table design would look like this:

    CREATE TABLE Samples (

    SampleDate DATETIME

    Reading1 int SPARSE NULL,

    Reading2 int SPARSE NULL,

    Reading3 int SPARSE NULL,

    ...

    Reading1024 int SPARSE NULL,

    cs xml column_set FOR ALL_SPARSE_COLUMNS);

    Considering that the main point of sparse columns and column sets is for designs where there are lots and lots of optional columns, it'd be the last thing I'd consider for this kind of design where all the columns will have values all the time.

    If this would be purely insert and select the entire set for processing (not ever selecting some columns and not others), I'd go with Jeff's initial suggestion. Save the whole lot in a binary blob and let the app handle the encoding and decoding.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Having a "tall skinny" (NVP or Name Value Pair) table, like what you're suggesting, would actually more than triple the storage space required because, not only must you store the data, but you also need to store the column name and where you were storing a row number reference only once, you must now store it for each single piece of data. Use of a such a triple size table would require three times the IO to get anything done.

    It’s a lookup / master table. We would populate it ONCE & would be flexible enough to except the business rule changes (adding / removing more values). It’s a major advantage.

    The child tables (where actual data needs to be stored) will have reference from lookup table. We may intelligently populate child records (say populate the records where you have values, ignore NULLs).

    We may also use Table Variables to pass values from Front End so SET based operations can be done on the data.

    I don’t understand how it will consume triple the space? Even if it does space is cheap but that’s not the concern (time being).

    Whatever the solution we opt we need to perform operations on same volume of data. The advantage we would get here with Tall Approach is we would be able to skip the unnecessary data split operation.

Viewing 15 posts - 1 through 15 (of 44 total)

You must be logged in to reply to this topic. Login to reply