One of the frustrations of medical research is in getting good representative data. It should, you might suppose, be easy: you just scoop up all the medical records into a huge medical database, anonymize the data, and then just sharpen up your Python skills. Sigh; nope.
The problem with the main 'big-data' medical sets such as the IBM Watson/Truven database is that they largely represent only people with medical insurance. This doesn't always cause problems, provided all conclusions bear in mind that it is a skewed population. This is why the medical records of a country such as the UK are potentially so valuable for research: they represent an entire population. With the help of a database that provides information on this scale, we can do a great deal more to advance medical and pharmaceutical science.
What, if anything, stands in our way? The first problem is that such records can't reliably be pseudonymized or 'de-identified'. If you are researching individuals, and you know quite a bit about them, there are likely to be a couple of unusual characteristics that will allow you to match the records to the individual. Researchers have shown that it can result in a match rate of over 90%.
The second problem is a regulatory problem. Medical records in Europe are not owned by the entity that collects or stores the data but to the patient. Although few people will disagree with the intention of allowing their records to be used, many will refuse consent because there are doubts about security. If there is a breach, you can't change your medical history as you would your password. The GDPR's guidance is that confidential patient information can only be used by hospital and university researchers, medical royal colleges and pharmaceutical companies researching new treatments.
The third problem is the poor general understanding of the constraints of statistics. Statistical methods should come with the same warnings as a chain-saw. We still haven't reversed out of the statistical nightmares that presented a false '40%' conclusion about the value of statins in reducing cholesterol, certain types of which we now discover we need in spadesful for good health, unlike statins.
The fourth problem is that there is no central database for medical records. It is difficult for this to happen because of the poor "interoperability" of the data, and the inability of the many health information systems to work together to join up the many separate, and sometimes warring 'care settings' making up the NHS (National Health Service).
A fifth problem is bad data; by which I mean poor data quality, completeness and accuracy. There have been many reports of potentially incorrect codes being used to record illnesses and treatments, as well as missing or invalid identifiers, such as NHS numbers.
It is typical of projects of this sort that few, if any, problems involve database technology. We have all the analysis tools we need. The tasks we face are mainly organizational and they'll take time to resolve.