Unstructured data sources into a structured DW

  • Jeff Moden (7/6/2016)


    Steve Jones - SSC Editor (7/6/2016)


    If it's low volume, do what's easiest. It won't matter enough.

    Heh... unless the low volume data also happens to be the most critical.

    Well, if it's low volume, performance implications don't come into play, so whether you shred ahead or in real time, I'm not sure this matters.

  • Steve Jones - SSC Editor (7/6/2016)


    Jeff Moden (7/6/2016)


    Steve Jones - SSC Editor (7/6/2016)


    If it's low volume, do what's easiest. It won't matter enough.

    Heh... unless the low volume data also happens to be the most critical.

    Well, if it's low volume, performance implications don't come into play, so whether you shred ahead or in real time, I'm not sure this matters.

    Understood but not the point I was trying to make. The point is that it's unstructured, the OP has no specs for what the data will be, the Devs are changing it all the time, and no one knows what's going on with the data from one transmission to the next.

    Like Eric said, this data doesn't belong in a DW to begin with and if it is critical data, then having to touch it with human hands isn't the way to go. It should be 100% consistent and fully automated.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This is semi-structured data really. It records new customers and has tags such as Name, Address + 60 other fields.

    When I said unstructured I mentioned this as JSON data has no schema.

    I appreciate this challenge is felt more widely regardless of data source however, I'm expecting JSON will change more frequently as it's administered by Marketing/Sales (they seem to bring in new requirements quite frequently) and the absence of a schema whens there is no strongly typed interface/contract.

    I need to bring it into the DW as it's a key business entity.

    One way or the other we have to parse the data. I'm looking to do something that is low maintenance as when the API changes and I get a call from marketing to add an attribute to the customer dim I won't have much time.

    I'm going to parse the fields I need and leave the others so that if the API does change there's less risk of a breaking change. Doesn't help for instances where fields are re-used !

    I reckon we'll see more of this in the future. At least it makes for interesting day 🙂

  • leehbi (7/7/2016)


    This is semi-structured data really. It records new customers and has tags such as Name, Address + 60 other fields.

    When I said unstructured I mentioned this as JSON data has no schema.

    I appreciate this challenge is felt more widely regardless of data source however, I'm expecting JSON will change more frequently as it's administered by Marketing/Sales (they seem to bring in new requirements quite frequently) and the absence of a schema whens there is no strongly typed interface/contract.

    I need to bring it into the DW as it's a key business entity.

    One way or the other we have to parse the data. I'm looking to do something that is low maintenance as when the API changes and I get a call from marketing to add an attribute to the customer dim I won't have much time.

    I'm going to parse the fields I need and leave the others so that if the API does change there's less risk of a breaking change. Doesn't help for instances where fields are re-used !

    I reckon we'll see more of this in the future. At least it makes for interesting day 🙂

    I actually work in marketing too. If you're dealing with the Google API's from Adwords and so forth, then you don't actually have a lot of changes there. What I do is just bring in most of everything they need or may need in the future. It all starts with a conversation as others have mentioned previously. You have to get them on board with being consistent and to be fairly honest, being I work in marketing too with data warehousing, the standard KPI's do not change that much unless you're doing something dramatically different.

    If you're working with other data sources and API's that are always changing internally or from a third-party, then yeah that always makes things difficult.

    Feel free to list out some of the third parties if you can. If these are digital marketing based, I can surely provide some insight on everything from Google Analytics to BrightEdge and beyond.

  • I actually work in marketing too. If you're dealing with the Google API's from Adwords and so forth, then you don't actually have a lot of changes there. What I do is just bring in most of everything they need or may need in the future. It all starts with a conversation as others have mentioned previously. You have to get them on board with being consistent and to be fairly honest, being I work in marketing too with data warehousing, the standard KPI's do not change that much unless you're doing something dramatically different.

    Google Adwords is another source on my horizon !

  • leehbi (7/7/2016)


    This is semi-structured data really. It records new customers and has tags such as Name, Address + 60 other fields.

    When I said unstructured I mentioned this as JSON data has no schema.

    I appreciate this challenge is felt more widely regardless of data source however, I'm expecting JSON will change more frequently as it's administered by Marketing/Sales (they seem to bring in new requirements quite frequently) and the absence of a schema whens there is no strongly typed interface/contract.

    I need to bring it into the DW as it's a key business entity.

    One way or the other we have to parse the data. I'm looking to do something that is low maintenance as when the API changes and I get a call from marketing to add an attribute to the customer dim I won't have much time.

    I'm going to parse the fields I need and leave the others so that if the API does change there's less risk of a breaking change. Doesn't help for instances where fields are re-used !

    I reckon we'll see more of this in the future. At least it makes for interesting day 🙂

    Yuck.

    That being said, the best bet might be a generic staging structure to store the shredded Json data into. Then you can have a secondary process pull out the elements it recognizes from this into your final DW product. Make sure to communicate ad nauseaum that there will be gaps between the final product and what was served from the .json source. Task someone to put together a workflow so the DW can be modified when needed to serve new data elements so one one is 'surprised' that they changed the source and it didn't automagically appear in the DW.

    If everyone plays nicely and does their pieces, it can actually work somewhat smoothly, but good luck on that one 🙂

    I've shredded .json data with powershell scripts into pre-defined tables that matched the .json structure, but my provider actually gave me the schema, so the task wasn't hard at all.

    If it were me, i'd be building something like a process that recognizes known data chunks and reports on things it doesn't recognize, with some sort of workflow to handle the unknowns.

  • leehbi (7/8/2016)


    I actually work in marketing too. If you're dealing with the Google API's from Adwords and so forth, then you don't actually have a lot of changes there. What I do is just bring in most of everything they need or may need in the future. It all starts with a conversation as others have mentioned previously. You have to get them on board with being consistent and to be fairly honest, being I work in marketing too with data warehousing, the standard KPI's do not change that much unless you're doing something dramatically different.

    Google Adwords is another source on my horizon !

    Then you're pretty solid because those API's do not change all that often. The data is highly structure, but the output is JSON. The end user may change, but that's with any data source. I would surely stick to having that conversation and get them on the same page with your needs if possible. I personally run into this all the time, that's why I pull in what they ideally want and some additional metrics and dimensions from the data sources that they may want in the future.

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply