Most of us know that data is being used to make more and more decisions inside of all kinds of organizations from retail giants to banks to sports teams. We are constantly asked, or see reports, of data driven decisions. We often need to show some data that supports and explains the rationale for making some choice. As our populace becomes more data savvy, I expect this trend to continue.
AI (Artificial Intelligence), and the related Machine Learning (ML), are becoming more and more widely used. From mobile phones to autos to trading systems, we regularly see new "AI capabilities" being added to products and services. No business or industry seems immune, and I'm sure many of you are seeing AI being incorporated or feeling pressure to start using some AI in your work. As you work with AI, or start to, you'll quickly realize the importance of data in your efforts.
This is true for the cleanliness of data, but perhaps even more important in the tagging of data sets. As Amazon learned, building an AI or ML system, is hard. They scrapped one system that was being used to rate resumes and help their recruiters sort through the volume of applications they received. Why? Because of bias.
Apparently the system would downgrade women's resumes for various reasons. To me, this is a perfect example of a principle I've had throughout my career: garbage in garbage out. In this case it's not necessarily bad data that was the problem, but bad tagging of what was a good and bad resume, probably from the internal prejudices of a few people.
There will be more dangers as we use ML and AI technologies in our work. It won't be enough that we clean the raw data for training, but also that we clean and properly manage the tagging of what data sets represent the results we are looking for. Like in much of our software, it's easy for us to only consider the happy path, to only tag those items we think are good results. That is useful, but we might also be unconsciously tagging other results as bad, which appears to have happened to Amazon.
We can build systems that do a better, more rational job than most humans, but we need extraordinary care to ensure our training data lacks bias. Unfortunately, most people both think they're not biased and are unwilling to spend extra resources to deeply examine the data. Two things that worry me about the future of our AI/ML systems that will inform us.