Big Data or Small Data

  • Comments posted to this topic are about the item Big Data or Small Data

  • And much of our data could be shrunken down in size if the smallest datatypes were used.

    I see databases whose datatypes are the defaults given to them from the Code First mentality of Entity Framework. This means DATETIME2(7) where a DATE would do because only the date is relevant, INTs used in enumeration tables with 2 entries (my favourite is the YesNo table) but whose FKs in tables with 80 million rows are unnecessarily swollen or the use of NVARCHAR where VARCHAR would perfectly suffice. DECIMAL(15,7) where DECIMAL(7,2) would do just as well. Not everyone needs Unicode data all of the time, especially if one codepage will suffice and your entire fleet of instances and DBs use that one codepage. That being said, I must properly investigate UTF-8.

    As an exercise, I once reduced a database in size down by a third simply by creating a new database with more sensible datatypes and migrating the data across. If you have a 1TB database and you wished for more RAM (don't we all?), then this is one avenue to consider.

  • A Data Scientist friend said that he can do more with a high quality 1Gb dataset than he ever could with many TB/PB from the data swamp.

    He explained that a lot of  ML models have too many variables in them.  If you have more than 7 then your model is highly suspect.  I came across an article that backed up that assertion, I wish I had book marked it for you.  The gist was that you have to gain an understanding of the relationship between the parameters that go into an ML model.  If they are highly correlated then you will amplify a signal that shouldn't be amplified.  For the business attributes available, many are correlated so 7 truly independent variables are the top end of what you can expect.

    Lets suppose variables A & B are correlated with C & D being non-correlated.  As a scientist he would try

    • A, C, D
    • B, C, D
    • A, C
    • A, D
    • B, C
    • B, D
    • C, D

    I've found that data scientist are quick to point out that while they can put together a query that gives them what they want they need help to get it to perform.  As Steve says, the opportunity cost of Big Data is time.  The more data scientists have to do grunt work on filthy data the less value they bring to an organisation.  The way they write queries does not lend itself to being able to extract value from BigData at speed.

    Some of the Big Data work I have done in the past revealed that 37TB of data could be reduced to a few Gb.  But it took evaluation of the 37Tb to make that determination

  • This was removed by the editor as SPAM

  • Couple of things come to mind about this:

    I don't worry as much about big or small as I do about good or bad.

    Size doesn't matter as long as we are provided the time and resources to handle the volume.

    Of course, my perspective goes back to the days of merging data into multiple reels of magnetic tape on the six foot tape drives,  there was no such thing as multitasking, and we ran processes during lunch breaks for Teamsters union data entry employees.

    Rick
    Disaster Recovery = Backup ( Backup ( Your Backup ) )

  • I find that a whole lot of people have some really serious problems with smaller data sets.  I been told probably hundreds of times that "performance doesn't matter because it runs overnight" or "performance doesn't matter because the table is small and will never get any larger".  They never consider what happens when they find out that the tables grew and they "ran out of night".

    I hate being right in those cases because it always happens at the worst times.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Someone here imported a large text file using ADF, and all the fields ended up as nvarchar(max). I was livid. I don't think the person who did this even bothered to look at the result.

    I had to run a 5-hour job (at night, ha, ha) to convert the fields to something reasonable.

    I tell our staff to create the most "efficient" table structures because small jobs become big jobs, and you don't want time wasted on poor design.

  • skeleton567 wrote:

    Couple of things come to mind about this:

    I don't worry as much about big or small as I do about good or bad.

    Size doesn't matter as long as we are provided the time and resources to handle the volume.

    Of course, my perspective goes back to the days of merging data into multiple reels of magnetic tape on the six foot tape drives,  there was no such thing as multitasking, and we ran processes during lunch breaks for Teamsters union data entry employees.

    That "as long as" is a really big IF in many places

  • Jeff Moden wrote:

    I find that a whole lot of people have some really serious problems with smaller data sets.  I been told probably hundreds of times that "performance doesn't matter because it runs overnight" or "performance doesn't matter because the table is small and will never get any larger".  They never consider what happens when they find out that the tables grew and they "ran out of night".

    I hate being right in those cases because it always happens at the worst times.

    Small data sets aren't for measuring perf in dev. That's a misuse.

    Small data is more for analytics, recognizing the we might do as well with the last 2 years data as we do with the last 10. And small can still be 1TB. It's just not the 14TB we have now.

  • dplaut 49149 wrote:

    Someone here imported a large text file using ADF, and all the fields ended up as nvarchar(max). I was livid. I don't think the person who did this even bothered to look at the result.

    I had to run a 5-hour job (at night, ha, ha) to convert the fields to something reasonable.

    I tell our staff to create the most "efficient" table structures because small jobs become big jobs, and you don't want time wasted on poor design.

    I'm actually OK with imports like this, but to staging. This solves a lot of hassles in the L of ETL/ELT. It's the T that still needs to be done to move this to the correct sizes and whack what was loaded.

  • Steve Jones - SSC Editor wrote:

    Jeff Moden wrote:

    I find that a whole lot of people have some really serious problems with smaller data sets.  I been told probably hundreds of times that "performance doesn't matter because it runs overnight" or "performance doesn't matter because the table is small and will never get any larger".  They never consider what happens when they find out that the tables grew and they "ran out of night".

    I hate being right in those cases because it always happens at the worst times.

    Small data sets aren't for measuring perf in dev. That's a misuse.

    Small data is more for analytics, recognizing the we might do as well with the last 2 years data as we do with the last 10. And small can still be 1TB. It's just not the 14TB we have now.

    Who said anything about "Dev"?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • dplaut 49149 wrote:

    Someone here imported a large text file using ADF, and all the fields ended up as nvarchar(max). I was livid. I don't think the person who did this even bothered to look at the result.

    I had to run a 5-hour job (at night, ha, ha) to convert the fields to something reasonable.

    I tell our staff to create the most "efficient" table structures because small jobs become big jobs, and you don't want time wasted on poor design.

    With the caveat that I don't know your use case this isn't necessarily something I worry about.

    We ingest a lot of data feeds (internal and external) whose reliability is ....ahem.... variable.  After discussions with our stakeholders they didn't want an absolute all succeeds or all fails pipeline.

    Our approach is to ingest the data into a tolerant structure and validate it.  This allows us to separate out data into two streams

    • Data that passes validation
    • Data that is WTF!

    We have tolerances on what % of an incoming feed is allowed to be WTF before we pronounce an absolute stop on that feed.  That can be zero % (everything must validate) up to a figure the business feels comfortable with.

    There are many reasons why a source can have variable quality.  Some examples I would put forward are as follows: -

    • Manually assembled in spreadsheets and is external to the company.
    • Glitches in data from a sensor
    • Uncommunicated change upstream

    Things like sensor glitches are a fact of life.  The loss of the occasional row of data is neither here nor there as long as we are measuring and evaluating the rows that fail validation.  Patterns in the glitches can be useful for spotting failing sensors.

    Inherent in this approach is that a transformation to a more disciplined layer must take place, especially where fine-grained accuracy is needed.

     

  • dplaut 49149 wrote:

    Someone here imported a large text file using ADF, and all the fields ended up as nvarchar(max). I was livid. I don't think the person who did this even bothered to look at the result.

    I had to run a 5-hour job (at night, ha, ha) to convert the fields to something reasonable.

    I tell our staff to create the most "efficient" table structures because small jobs become big jobs, and you don't want time wasted on poor design.

     

    DP, as a long-time (over 60 years now) user of digital data, I have found that I actually PREFER importing as varchar(max) and nvarchar(max), etc   This avoids task failures due to data that may not be submitted in the proper format.  Once I get it safely INTO a database, I can first VALIDATE the data, easily report and eliminate any bad stuff, and report it back to the source without causing my task to fail.   Then I can, under MY control, CONVERT the data to my desired format without problems.   Especially in these days where we receive data from many different sources such as online websites and even intra-company servers with various applications, there is always the risk of improperly and invalid data hitting our processes.

    One good example would be for you to visit https://community.quicken.com/categories/errors-and-troubleshooting and review the many reports of data failures when downloading from various data sources.   The critical thing this illustrates is that data failures need to be reported back to the source instead of the destination or the middle-man.

    I always have preferred receiving data in the most innocuous format possible.

     

    Rick
    Disaster Recovery = Backup ( Backup ( Your Backup ) )

  • I personally don't write code to do imports any more, but I sure used to.

    I did pretty much the opposite as what others do.  There were some exceptions, of course, but I usually created staging tables using the exact datatypes for columns that I ultimately needed.  Then, I used the built in error handling options of either BULK INSERT or BCP to sequester bad rows during the import.  That meant that I didn't have to cycle through the data to find bad rows, it didn't stop a million row import because of an error or two, the errors were super easy to troubleshoot, and it became really easy to provide feedback to the provider of the data as to what they may be doing incorrectly.

    I also was able to do things like programmatically create tables and BCP Format Files for imports that varied from run to run (Double-Click.net files).  It used to take the company I worked for 45 minutes to massage a short/wide table of 30,000 rows with up to 800 columns 45 minutes just to get a file ready for import.  Business increased to where they were importing 100's of files per day and the simply didn't have enough time to do it all (it used PERL, Active-X, some custom compiled programming, and more) and the DTS (predecessor of SSIS) part of it was a train wreck.

    I rewrote it all in T-SQL (32 bit machines/SQL Server 2000 with BCP back then and I don't remember if it had the error sequestration back then) and was doing the import, data validation, and final table loads for 8 files every 2 minutes and that included the massive "unpivot" I wrote in T-SQL  (they didn't have UNPIVOT back then) to normalize the data when moving it from the staging tables to the final table.

    And, yeah... that was one of the times where they said "It didn't run" when they tested it because it ran so comparatively fast.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply