Many of us have likely been asked about data science or machine learning in the last few years by someone in our organization. This has become a hot field, with many companies looking to try and find ways to use the technology to improve their work in some way. While I don't know how successful these projects have been in organizations, I know that some areas seem to be finding practical applications. Image recognition, translation, and even some application scoring systems have benefited from machine learning algorithms.
To build a successful model that works well, we need training data. Often lots of training data, and then have some metadata about how our training data might be applicable to a particular question. For example, if we have lots of pictures of dogs, we might need to tag the different breeds in order for a system to differentiate among them. If we want loan applications scored, we should have a corpus of documents that are already scored. This metadata allows the system to learn.
There is a company, Clearview AI, that markets itself as a facial recognition system. To build their model, they scraped images from YouTube and other Internet sources, without consent from Google or the subjects of the videos. This is interesting, as the data itself is publicly available, but gathering it into a database and using it for other purposes might run afoul of data privacy laws, like the GDPR.
I don't quite know how to feel about this use of public data. While I don't mind people viewing my pictures, I'm not sure that I like the idea of them being copied into a database for some other purpose. That might seem silly, or even strange, but I do think there is power in data and more power in more data. Allowing others to put my images in a database and use them, perhaps to train a model to recognize me, feels like overreach.
If you need data for your company, or your idea, what can you do? Many people just scrape Google, Facebook, etc., and get data. That might cause you some legal issues in some places, and you ought to be aware of the implications if you choose to do this, or you are asked to do this. I don't think this is how we want data to be gathered. I know there are some guidelines for responsible AI at Microsoft, but not necessarily rules in place for many companies. Hopefully that will change over time.