Big Data, Many Challenges

  • I'm facing a technological shift in my role at present and I imagine I'm not the only one. Big Data is coming to us all and as SQL Pro's we need to be ready. I've outlined my particular situation below.

    I work for a major UK website. Our developers are encouraged to be agile and have adopted emerging technologies like Hadoop, Hive and MongoDB as replacements to SQL Server. I agree with their approach but as a SQL Server Data Warehouse Professional it presents significant problems.

    We are still required to support marketing and finance functions but the products sold via our websites are now migrating to unstructured cloud storage. Microsoft have announced a partnership with HortonWorks to provide a connector to Hadoop from SQL Server but released no firm dates. For SQL Server to be relevant in my organisation I need to be able to load this data and distribute it into existing reports, outgoing data feeds and cubes.

    In the next few months I have to find a way to satisfy traditional business requirements, such as maintaining a single customer view across product streams, with tools such as Hive / Pig to extract data from Hadoop. Whilst the performance of our existing DW in mix of SQL Server 2000/2008 is poor due to incremental developments over time by a host of consultants without any over-arching plan, the performance from Hive is much, much worse and there are very few support tools within this immature market.

    Our situation as an innovative developer led organisation may be untypical, but I would really love to hear from other SQL Pro's who've face processing masses of unstructured data to support business as usual requirements and more traditional third-party tools.

  • In my view there are two very different things going on, on one side "Business Support" systems like ledger, billing, financials, etc. which should stay as they are and, "Decision Support" systems where un-structured data sooner or later will hit our data warehouses.

    In this context I see un-structured data as an extension of a traditional - please read dimensional when I say traditional - data warehouse. This approach means that the "un-structured data" issue is more of an ETL issue than a core data warehouse one.

    Few data warehouses are trully near-real-time ones meaning that is a little unrealistic to expect to have near-real-time access to a miryad of un-structured, external data sources when you do not have yet near-real-time access to your own data.

    Last but not least and as somebody told me a couple of days ago, search engines like google or bing are showing us what happened yesterday while twitter is showing us what is going on right now. It shouldn't come as a surprise that a growing number of companies are monitoring twitter and other sources as a way to keep track and improve customer service.

    Bottom line is, I will put a fight if somebody attempts to change the platform of "Business Support" systems or the core of "Decision Support" systems just because "big data" is comming.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • MonkeyMan (11/5/2011)


    Whilst the performance of our existing DW in mix of SQL Server 2000/2008 is poor due to incremental developments over time by a host of consultants without any over-arching plan, the performance from Hive is much, much worse and there are very few support tools within this immature market.

    Our situation as an innovative developer led organisation may be untypical, but I would really love to hear from other SQL Pro's who've face processing masses of unstructured data to support business as usual requirements and more traditional third-party tools.

    I'm probably preaching to the choir but this is why I don't adopt new, supposedly "better" technology right away. It can sometimes be hell to live on the "bleeding edge". 🙂

    Anyway, my recommendation would be to do what hasn't been done, yet. You've identified some problems, especially in the area of performance, with some of the tools. It's time to do some performance comparisons, weigh out the ROI and, possibly, develop the "over-arching" plan to not only keep things from getting worse, but to begin making incremental improvements along with the incremental development.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply