Data Science Sanity Checks

  • Comments posted to this topic are about the item Data Science Sanity Checks

    Best wishes,
    Phil Factor

  • This was a very thought provoking editorial. I'm still cogitating but immediately...

    I see the value of having someone whose job it is to check these things out and know the gritty details of the data. I think this is a crucial role, but I'd encourage everyone involved in the production, storage, and use of the data to try and understand it at a fine level also.

    That being said, I'm not sold on the idea that there can be sufficient resources to identify all trends emerging within the data and then investigating them before the rest of the business pick up on them - the other option is the question of 'releasing' data to folk and that seems like an impediment to work, and still couldn't guarantee success.

  • Remember that investment management is about risk, not return. In contemplating what opportunities to pursue you have to understand what the risks are you are taking, and then acknowledge whether you were right or not.

  • The big problem with this is that business is about making money and sometimes there is money to be made even from bad data. Take the recent twitter hack on the AP news. Businesses that reacted quickly could short sell the stock market and make a bundle off this bad data even though it was absolutely off base and a quick check could show that.

  • Awesome subject Phil, thanks. I relate it to driving a car, you have to keep you eyes on the road, the mirrors the dashboard/control panel (throw in some breakfast and a smart phone too for added troubles/difficulties) and there's your business system. Crazy drivers (the other drivers naturally) to deal with, road hazards/accidents, etc. it's a constant barrage and requires constant vigilance of monitoring input and output to get where you need to go.

  • One of the real things that must be checked when confronted by an anomalous item is: is it an artifact? The way data is collected, the questions asked, the context of its selection can sometimes cause very subtle distortions, not noticeable in the small scale but visible when trying to pull signal out of a lot of noise.

    At times (often) the researcher doesn't have access to the context of the original data acquisition and there is plenty of room for serious errors.

    ...

    -- FORTRAN manual for Xerox Computers --

  • So the whole business should stop seeing crucial data that supports their daily decision-making, and wait for a data scientist to sanitise the data (however long it takes)?

    The fact is the end users of those data are the domain experts who can tell what is rogue data and what is real trend better than anybody else, including the data scientist who has generic data knowledge but not necessarily the domain knowledge.

    IMHO, we should just give the data to the business, and give them the tools that highlights abnormal trends and help them do the analysis. That way you don't stop them seeing the data, but also help them identify rogue data.

  • I agree with Charles. At some point we have to trust our users.

  • charles.wong (5/13/2013)


    So the whole business should stop seeing crucial data that supports their daily decision-making, and wait for a data scientist to sanitise the data (however long it takes)?

    ...

    The point was caution. Especially with market research and other non dollars and cents determinations.

    Rogue data and artifact are not the same thing. Polling organizations (at least the good ones) have learned the pitfalls of categorization. The majority of potential customers may say they prefer prodcut B, but unless you know what A,C, and D were, or if other options were missing from the list (the old 'have you stopped beating your wife?' conumdrum), you don't know what they would actually buy. Even priming questions, that is seemingly unrelated questions asked before the choice have been proven to make a big difference in the answers givern.

    Disastrous business and political decisions have been made by not understanding the data. When dealing with large amounts of data from disparate and uncontrolled sources, the risk is higher. By all means listen to those close to the issue, but remember that everyone, including those close to the issue can unintentionally bring in their own preferences and biases (remeber 'new Coke'?)

    ...

    -- FORTRAN manual for Xerox Computers --

  • charles.wong (5/13/2013)


    So the whole business should stop seeing crucial data that supports their daily decision-making, and wait for a data scientist to sanitise the data (however long it takes)?

    The fact is the end users of those data are the domain experts who can tell what is rogue data and what is real trend better than anybody else, including the data scientist who has generic data knowledge but not necessarily the domain knowledge.

    IMHO, we should just give the data to the business, and give them the tools that highlights abnormal trends and help them do the analysis. That way you don't stop them seeing the data, but also help them identify rogue data.

    That really expresses where the efforts of data managers should be expended. Thanks for saying it so well

  • The fact is the end users of those data are the domain experts who can tell what is rogue data and what is real trend better than anybody else, including the data scientist who has generic data knowledge but not necessarily the domain knowledge.

    I agree that this should be the case, but if it is your experience, then you've been far more fortunate than I have! The data scientist may not have the necessary domain knowledge in some cases but he, or she, knows of all the specialist expertise that can be alerted in order to investigate. It would, I believe, be very wrong to neglect this and just leave it to individual areas of the business to figure it out. The data scientist has to check anyway that the changes in data aren't due to error within IT, and that requires a measure of understanding of the business domain..

    Best wishes,
    Phil Factor

  • Hi Phil

    Thanks for the response. Don't get me wrong, I'm not saying we don't need Data Scientists, and I totally agree that the Data Scientists should do the investigations you described in those circumstances.

    I was just a bit reserved about having to put things on hold until the investigation is complete, as the timeliness of some of these information could be crucial for the business. My thinking is along the line of using intelligent tools to help highlighting these anomalies and warn the users to take those data with a pitch of salt (or even allow them to do they own analysis). This way users have the data, but are also warned about the possible flaws. The data scientist could then tell them the result after finishing his investigation.

    I think we are more or less on the same page.

    Regards

    Charles

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply