Disclaimer: I am not a medical, radiology or epidemiology professional. This article was an experiment, from an engineering and data scientist perspective, and should be regarded as such.
To apply Deep Learning for COVID-19, we need a good dataset – one with lots of samples, edge cases, metadata, and different looking images. We want our model to generalize to the data, such that it can make accurate predictions on new, unseen data. All the work for this article is available on GitHub.
Unfortunately, not much data is available, but there are already posts on LinkedIn/Medium claiming >90% or in some cases 100% accuracy on detecting COVID-19 cases. Though, these posts usually contains mistakes, which are not acknowledged.
We have actively tried to mitigate some of the usual mistakes, by thinking about and coming up with solutions to the following problems:
- You might not believe the following is an issue, but it is a typical fallacy for new people in machine learning and data science. That is, the first rule of machine learning: never, ever test your model's performance with the same data you used to train it with. Not using a testing dataset, and instead testing and measuring the accuracy of their model on the training dataset, does not give an accurate representation of how well the model generalizes to new, unseen data.
- Not using computer vision techniques to achieve better generalization – augmentation being an absolute necessity, especially in our case where there are very few samples for our model to learn from.
- Not thinking about what the model learns, i.e. will our model actually learn the pattern of what COVID-19 looks like on an X-Ray image, or is it likely that there is some other noisy pattern in our dataset, that it will learn instead?
One amusing story of early machine learning is called Detecting Tanks: "photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days."
- Not using the correct metrics. If you use the accuracy metric on a heavily imbalanced dataset, then your model might look like it's performing well in the general case, even though it's performing poorly on COVID-19 cases. 95% accuracy is not that impressive, if your accuracy on COVID-19 cases are 30%.
Here is a list of different data sources that I have accumulated over the period of the coronavirus spreading. Currently, this is the best we are going to get from the internet, because the data being collected in many countries and hospitals are classified information. Even if it was not classified information, you still need consent from each patient.
- GitHub: COVID-19 Chest X-R
Source - Continue Reading: https://mlfromscratch.com/covid-19-modeling/