COVID-19 Study: Part I – My Journey into Data Science and Machine Learning

Background and Motivation

Since the start of the COVID-19 pandemic, there have been numerous studies aimed at understanding how the virus propagates, how effective different measures are in slowing the spread of the virus, and on attempts to forecast the number of positive cases and/or deaths caused by the virus forward in the future with various timelines (typically up to six months in the future).

While it is important to understand what needs to be done in order to stop or slow down the spread of the virus, measures like face covering, social distancing, quarantines and businesses shutdowns are difficult to quantify. In addition, these measures are clearly not applied uniformly across different countries in the world. That’s why it is difficult to draw conclusions which are universally applicable.

In the study presented here, we adopt a different approach by asking and attempting to answer the question: Is it possible to predict the deaths caused by the virus in different countries based on common, well-established factors which are expected to affect to a lesser or greater degree the virus spread and its mortality rate?

These factors together with the sources used for collecting the data are listed below:

Population density– population per square km (2018 data from https://data.worldbank.org/indicator/EN.POP.DNST)
Agglomerates (%) – percentage of people living in large metropolitan areas with population of one million and above (2019 data from https://data.worldbank.org/indicator/EN.URB.MCTY.TL.ZS);
Age demographics – percentage of the population in the age groups of 0-14, 15-64, 65- (2019 data from http://wdi.worldbank.org/table/2.1);
GDP per capita in USD (2019 data from https://data.worldbank.org/indicator/NY.GDP.PCAP.CD);
Health access and quality (HAQ) index (2016 data from https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(18)30994-2/fulltext).

The first two factors are expected to be critical for creating favorable or unfavorable conditions for the spread of the virus. Age demographics is clearly an important factor, since most viruses are known to affect to different extent people in different age groups. And the final two factors are related to the resources and the quality of healthcare a country has in dealing with health disasters or similar events.

The COVID-19 data used in this study is taken from the John Hopkins University site, https://coronavirus.jhu.edu/data/mortality, as of July 22 and includes:

Confirmed positive cases
Number of deaths
Case fatality – percentage of deaths among positive cases
Deaths per 100K of the country population

Although Case fatality is briefly considered here, this is done mostly for illustration purposes and completeness. The Case fatality is greatly dependent on the ability of a country to perform large scale testing and in order to draw any meaningful conclusions the people tested have to comprise a statistically representative sample of the country’s population. However, to the best of our knowledge, COVID-19 testing meeting this criterion has not been done in the majority of the world’s countries which makes the Case fatality values unreliable.

The main COVID-19 record which should provide an adequate base for comparison across different countries is arguably Deaths per 100K of the country’s population, and is the focus of our study.

All of the above mentioned data has been compiled in “covid19_2020_07_22.csv” and can be found in the link below: https://github.com/marin-stoytchev/data-science-projects/tree/master/covid19_project.

Preview of the tabulated data used in the study is shown in Fig. 1.

**Fig. 1: Tabulated data for 159 countries with COVID-19 records as of 07/22/2020**

The table contains information for 159 countries for which COVID-19 deaths per 100K records were available as of July 22.

The study is organized in five sections as described below:

Background and motivation
Examining the distributions, correlations and relationships between features
Predicting the HAQ Index
Predicting the number of COVID-19 deaths per 100K
Discussion of results and next steps

For reference, the complete code for analysis and modeling of the data can be found in the link specified for the tabulated data file.

Distributions and Relationships of Different Features

Feature Values Distributions

The bar plot in Fig. 2 shows the number of countries per continent which are included in the collected data.

**Fig. 2: Number of countries in the study by continent**

The exact number of countries in each continent is as follows:

Africa 48
Europe 43
Asia 38
N. America 16
S. America 12
Australia 2

Box plots of different features per continent are used as the most natural way to gain an understanding of the magnitude and distributions of the values of these features. We first examine the distributions of the two population density related features which are presented in Fig. 3 below.

**Fig. 3: Distributions of countries’ population per square km and of percentage of population in large metropolitan areas by continent**

The plots show two trends:

On average, high population density with low(er) percentage of people living in large (> 1 mill.) metro areas is observed in Asia, Europe, Africa, and North America
On average, low population density with high(er) percentage of people living in large (> 1 mill.) metro areas is observed in South America and Australia

Considering that the population density could certainly play a role in the spread of COVID-19, one could make the argument that these two trends are canceling each other since a larger portion of the people in countries with low population density live in large metropolitan areas. Or, in other words, it is reasonable to expect that differences in population density should not cause significant differences in the COVID-19 data for different countries.

We turn our attention next to the age demographics, which is naturally expected to have an impact on the number of COVID-19 deaths. The reason behind is the numerous records and studies which have shown substantially higher mortality rate for people of age 65 and older.

**Fig. 4: Distributions of countries’ age demographics by continent**

From the plots above one can conclude that:

Africa is the “youngest” continent with average of ~ 40 % of population below 15 years old and only ~ 3 % of the population on average older than 64
Europe is the “oldest” continent with average of ~ 15 % of population below 15 years old and ~ 20 % older than 64
The rest of the continents can be classified as “middle-aged” with similar age demographics with some deviations which are more pronounced for the age group above 65.

The distributions of GDP per capita and HAQ Index values are shown in the Fig.5.

**Fig. 5: Distributions of countries’ GDP per capita and HAQ Index by continent**

It is clear that these two features have very similar behavior. Highest scores in both charts belong to countries in Australia and Europe. Countries in Africa have the lowest scores in both charts. Countries in Asia, North and South America show middle scores.

Finally, the charts for Case Fatality and Deaths per 100k are presented in Fig. 6.

**Fig. 6: Distributions of countries’ Case fatality and Deaths per 100K by continent**

Both of these features show large variations and their distributions do not seem to correlate well. For the reasons mentioned above, we do not consider Case Fatality data to be reliable. Thus, Deaths per 100k is selected as the target of our study.

Correlation and Relationships Between Features

The correlation between different features and Deaths per 100K is presented in Fig. 7. Here, the age group of 1-14 is omitted, because it is in direct relationship with the other two age groups (the sum of all three age groups equals 100%).

**Fig. 7: Heatmap of correlations between different features**

The observations from the correlations heatmap are as follow:

No strong correlation is observed between most of the features (for clarification, strong correlation here is considered correlation above 0.7)
The highest degree of correlation is observed between
- HAQ Index and Age 65-
- HAQ Index and Age 15-64
- HAQ Index and GDP per capita ($)

all of which is natural to expect.

Regarding correlation between Deaths per 100K and the other features
- Overall, low degree of correlation is observed (below 0.5)
- Highest degree of correlation appear to be with HAQ Index, Age 65-, and GDP – thus, these relationships need to be examined more closely
- Lowest degree of correlation is observed with the two population density related features

Following on the above observations, the relationships between Deaths per 100K and the three features most correlated with it are examined. Figs. 8-10 provide scatter plots which are typically used to visualize the relationship between different features.

**Fig. 8: Scatter plot of Deaths per 100K vs. Age 65-**

**Fig. 9: Scatter plot of Deaths per 100K vs. HAQ Index**

**Fig. 10: Scatter plot of Deaths per 100K vs. GDP per capita**

The following observations can be made from these plots:

No well-defined relationships are observed between the number of COVID-19 deaths and any of these features.
Countries from Africa and Asia exhibit very low COVID-19 deaths numbers, despite the large spread of the values of all three features among these countries.
Countries from South and North America, and Europe show wide range (including very large values) of COVID-19 deaths numbers with Europe showing the widest range.
Countries with the lowest GDP and HAQ scores consistently show low number of COVID-19 deaths, while countries with high scores in these two categories show higher likelihood of large numbers of COVID-19 deaths.

All of the above observations raise certain question marks about the COVID-19 data, with the last observation being particularly puzzling since it shows a strongly counter-intuitive relationship.

ML Algorithm and Model Validation: Predicting HAQ Index

In this section, we apply the Machine Learning (ML) algorithm XGBoost in predicting the HAQ Index of randomly selected countries based on the knowledge of the values of the other features considered here. The motivation behind is three-fold: 1) validate the applicability of the ML algorithm; 2) establish the relevance of the selected features; 3) gain confidence in the accuracy of the model used.

The reason the HAQ Index is selected for this exercise is because there is no explicit connection/relation in determining the values of this index to any of the features used for predicting it. As described in the article from June 02, 2018 in the medical journal, The Lancet, (https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(18)30994-2/fulltext), the HAQ Index values for each country are calculated based on the success rate of preventing and treating serious and common illnesses, which include several types of cancer, tuberculosis, tetanus, measles, and others. Thus, the HAQ index is a highly specialized medical index and is well suited for the validation of the ML algorithm applied here and for establishing the accuracy of the model’s predictions.

As mentioned above, XGBoost is used to create the predictive model. Currently, XGBoost is the most widely and successfully used ML algorithm for regression or classification of any type of tabulated data. We note that detailed description of XGBoost is beyond the scope of this work. A good reference on XGBoost is the paper by Tianqi Chen and Carlos Guestrin (https://arxiv.org/pdf/1603.02754.pdf).

To make predictions with the XGBoost model, 80% of the countries are selected in completely random manner and the data for these countries is used to train the model. The remaining 20% of the data is used as a test set for validating the model’s predictions. After training the model, the model predicts the HAQ Index of the test set by using information from the test data for all features, except the HAQ index. Then, the predicted HAQ Index values are compared to the true test HAQ Index values. For achieving best possible accuracy the model is optimized before making the predictions.

The result from the model’s predictions of the unknown HAQ Index values is shown in Fig. 11.

**Fig. 11: Model predictions of unknown HAQ Index values**

We would like to note that the error presented in the plot is the RMSE of the predictions divided by the average of the test HAQ Index values. This allows for adequate comparison between the accuracy of predictions from different models and with different data.

As it can be seen from the plot, the model predicts accurately the HAQ Index values – predicted data points are closely aligned with the perfect fit line – and the error is only 10% of the average of the test data values.

In addition, the XGBoost model allows to extract the importance of the features in predicting the target values. It is informative to find the importance assigned by the model to different features in this case. We note that during optimization feature importances may vary significantly which makes it difficult to draw meaningful conclusions from a single trial. For achieving more accurate results, the optimization and training (fit) cycle were performed ten times. The feature importances provided here are the average obtained from these ten trials. We note that there were no significant trial-to-trial variations in the predicted values and the prediction error which additionally confirms the accuracy of the predictions.

The feature importances in predicting the HAQ Index are shown in Fig. 12.

**Fig. 12: Feature importance in predicting HAQ Index.**
**The sum of all importance scores is one.**

The plot reveals that the model places very high importance on GDP per capita and the age demographics, which is consistent with the degree of correlation of these features with the HAQ index as shown earlier (continent was not included in the obtaining of the correlation matrix).

The results from predicting the HAQ Index unambiguously show that using XGBoost with the features (factors) used in this study one can accurately predict the values of a strictly medically-derived feature. Thus, a high level of confidence in the model and its accuracy has been established.

Main Goal: Predicting COVID-19 Deaths per 100K

The same ML algorithm and procedure are used in predicting unknown Deaths per 100K values. The only difference here is that the HAQ index is added as known feature, i.e. the information about the HAQ Index together with all other features is used in making the predictions.

The result of the predictions for the number of COVID-19 deaths per 100K is shown in Fig. 13.

**Fig. 13: Model predictions of unknown Deaths per 100K values**

In contrast with the HAQ Index predictions, there is a significant number of predicted Deaths per 100K values which deviate dramatically from the true values. This, in turn, results in a large prediction error of 134% of the test data average.

Although the model accuracy is poor, it is still informative to find the feature importance according to the model. The result is presented in Fig. 14.

**Fig. 14: Feature importance in predicting Deaths per 100K**

In this case, the model ranks Continent, GDP per capita, and HAQ index as the features with the highest importance in predicting the Deaths per 100K by country. It is surprising, however, to see that the age demographic plays very little importance, which is not what one would expect.

Discussion of Results and Next Steps

In our study, we have demonstrated that using XGBoost, which is currently one of the most powerful ML algorithms, and well-established, universally accepted features (factors) one can accurately predict the values of a very specific medical index, HAQ Index.

On the other hand, the predictions made for arguably the most reliable COVID-19 data, Deaths per 100K, using the same algorithm, methodology, and features are extremely inaccurate. In addition, we have observed several anomalous relationships of the number of COVID-19 deaths with key features, GDP per capita and HAQ Index in particular.

For understanding these seemingly contradicting results, we need to consider the following factors.

Temporal Epidemic Profile

Perhaps the most important factor to consider in analyzing the COVID-19 data is the different temporal epidemic profile for different countries.

To clarify, we can roughly separate the epidemic progression in different countries into two categories:

Complete Epidemic Cycle (CEC) Profile: In this case, the first few death cases caused by the virus are observed at a certain date; as the epidemic progresses the number of deaths per day increases until reaching a maximum after which (slowly) declines to levels equal to those at the beginning of the epidemic.
Incomplete Epidemic Cycle (IEC) Profile: This case can be separated into three different sub-categories.
- Early-stage IEC – the epidemic is in its beginning with only few deaths per day (the numbers here are typically significantly smaller than the eventual maximum).
- Middle-stage IEC – the number of deaths per day has reached its peak or is close to it (can be on either side of the peak).
- Late-stage IEC – the number of deaths is significantly lower than the peak value, but is still larger than the numbers at the start of the epidemic.

The temporal profile of the COVID-19 spread in Spain is an example of a CEC profile as shown in Fig. 15.

**Fig. 15: Number of COVID-19 deaths per day: Spain, March 1-Aug. 17** (https://covid19.who.int/region/euro/country/es)

The three different stages of IEC are illustrated in Fig. 16

**Fig. 16: COVID-19 deaths per day:**
**a) early-stage IEC – Paraguay** (https://covid19.who.int/region/amro/country/py);
**b) middle-stage IEC – Ethiopia** (https://covid19.who.int/region/afro/country/et);
**c) late-stage IEC – Egypt** (https://covid19.who.int/region/emro/country/eg)

From the graphs presented above, it is clear that depending on the temporal epidemic profile for a particular country the total number of deaths would vary significantly. In order to be able to evaluate the accuracy of a model’s predictions for Deaths per 100K and draw meaningful conclusions, ideally all countries would have CEC profiles or at least a combination of CEC and late-stage IEC profiles. In reality, however, this is not the case. We note that examining the temporal epidemic profiles for countries in Europe and Africa reveals that, as of the date of the COVID-19 data used in the study, the majority of European countries were in the CEC category, while most African countries were in different stages of IEC. In our opinion, this difference is perhaps the main reason for the prediction results for Deaths per 100K and the anomalous relationships observed.

We would like to note that there are more complex temporal epidemic profiles than the ones described above. In addition, at the time of this writing, many countries have started experiencing new increase in COVID-19 cases and, eventually, deaths. However, we believe that the epidemic profiles adopted here are a good first approximation.

In order to correct for these differences, one possible approach is to project the total number of deaths for countries with IEC profiles by using for example the average CEC profile for their closest neighbors where applicable, or perhaps for a larger region of the continent, and even the entire continent in case the numbers of countries with CEC profiles is small. This is something we intend to investigate next in our COVID-19 study.

Short-term vs. Long-term Response

As already discussed, we were able to accurately predict unknown HAQ Index values, yet failed to do so when predicting the number of COVID-19 deaths. In connection with this, we would like to point out a substantial difference in the nature of these two entities.

We note that the HAQ Index is a long-term factor, a result of years of sustained quality of healthcare in a country. The universal features used here for predicting both the HAQ Index and Deaths per 100K also have long-term nature. On the other hand, the number of COVID-19 deaths is determined by a short-term, emergency response, which does not necessarily correlate well with long-term factors as the HAQ Index and the factors used in the predictions. Thus, the different outcomes from modeling the HAQ Index and Deaths per 100K suggest the possibility that there might be other, short-term, factors which dominate the outcome of the COVID-19 epidemic in a country. Perhaps, countries which have been subjected to numerous dangerous viral outbreaks in the past have put in place a strong emergency response in the case of such outbreaks regardless of their GDP and the overall quality of healthcare.

Thus, it appears that one would need to investigate what measures are in place in countries with CEC profiles which have low GDP and HAQ scores, and yet have low numbers of COVID-19 deaths. Since such measures are difficult, if not impossible, to quantify, the best approach would be to assign a categorical value to these emergency measures for each country – for example Bad, Marginal, Good, Very Good, and Excellent (or numerically 1 through 5) – and incorporate these values in the model. In our opinion, if possible to be done, this would definitely affect the outcome of the predictions and could give us a meaningful insight on the most efficient ways to combat the COVID-19 epidemic. This is something which we plan to investigate as well moving forward.

Medical Factors Exacerbating the COVID-19 Outcome

One could argue that certain medical conditions like weak immune system, obesity, severe respiratory problems, and other medical conditions contribute to the increased mortality rate among people who have contracted the virus. Thus, perhaps all of these conditions need to be incorporated in the model in some fashion. However, we believe that the HAQ Index reflects in its value if not all of these conditions at least a large part of them. Because of this, at the moment, we do not intend to pursue this line of improving the model.

Accuracy of COVID-19 Data

Last, but not least, any data could possibly contain inaccurate observations or records. The COVID-19 data is no exception. Thus, by necessity, one should consider the accuracy of the COVID-19 data as reported by different countries.

Without any intention to single out particular countries, we would like to present an example of data which strongly suggest that the recorded numbers may be inaccurate.

The COVID-19 data for Peru is presented in Fig. 17 below.

**Fig. 17: COVID-19 data for Peru: a) confirmed daily cases; b) number of deaths per day** (https://covid19.who.int/region/amro/country/pe)

As observed in the bottom graph, there are two extreme data points in the number of deaths which are approximately two orders of magnitude higher than the rest of the recorder numbers. We note that the sum of these two numbers exceeds the sum of all other data points. These spikes in the number of deaths cannot be explained with increase in the number of confirmed cases, since no spikes of similar magnitude are observed in the top graph in any of the preceding days. The one preceding spike in the confirmed cases is not nearly of the same magnitude to explain the number of deaths recorded on Aug. 15.

One could possibly attribute these two outliers in the recorded deaths per day to severe backlog in record keeping. However, if such severe backlog is possible, there is a very high likelihood that the death numbers presented here lack accuracy in general. We would like to note that there are other reports of (severe) backlogs in the recording of COVID-19 numbers in different countries, including USA.

Thus, it appears that in order to have confidence in the model performance and make meaningful conclusions, a separate model predicting the likelihood of data accuracy (from 0 to 100%) will need to be employed before making the more general predictions.

All of the factors mentioned above require additional research and additional work in order to be implemented in our analysis and model. Thus, it appears that this study is very likely to extend into several more parts before being completed in full.

To be continued …

Acknowledgments

I would like to thank Victoria Wong, Temenuzka Gesheva, Julian Geshev, and Stoytcho Stoytchev for the insightful discussions regarding the methodology and the results from this study. Their invaluable inputs made this work much better as a whole.