Live coding session taught by Dr Hugo Bowne-Anderson on April 10, 2020 via DataCamp.

Imports and data

Let's import the necessary packages from the SciPy stack and get the data.

In [1]:
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Set style & figures inline
sns.set()
%matplotlib inline
In [2]:
# Data urls
base_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/'
confirmed_cases_data_url = base_url + 'time_series_covid19_confirmed_global.csv'
death_cases_data_url = base_url + 'time_series_covid19_deaths_global.csv'
recovery_cases_data_url = base_url+ 'time_series_covid19_recovered_global.csv'
# Import datasets as pandas dataframes
raw_data_confirmed = pd.read_csv(confirmed_cases_data_url)
raw_data_deaths = pd.read_csv(death_cases_data_url)
raw_data_recovered = pd.read_csv(recovery_cases_data_url)

Confirmed cases of COVID-19

We'll first check out the confirmed cases data by looking at the head of the dataframe:

In [3]:
raw_data_confirmed.head(n=10)
Out[3]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 3/31/20 4/1/20 4/2/20 4/3/20 4/4/20 4/5/20 4/6/20 4/7/20 4/8/20 4/9/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 ... 174 237 273 281 299 349 367 423 444 484
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 ... 243 259 277 304 333 361 377 383 400 409
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 ... 716 847 986 1171 1251 1320 1423 1468 1572 1666
3 NaN Andorra 42.5063 1.5218 0 0 0 0 0 0 ... 376 390 428 439 466 501 525 545 564 583
4 NaN Angola -11.2027 17.8739 0 0 0 0 0 0 ... 7 8 8 8 10 14 16 17 19 19
5 NaN Antigua and Barbuda 17.0608 -61.7964 0 0 0 0 0 0 ... 7 7 9 15 15 15 15 19 19 19
6 NaN Argentina -38.4161 -63.6167 0 0 0 0 0 0 ... 1054 1054 1133 1265 1451 1451 1554 1628 1715 1795
7 NaN Armenia 40.0691 45.0382 0 0 0 0 0 0 ... 532 571 663 736 770 822 833 853 881 921
8 Australian Capital Territory Australia -35.4735 149.0124 0 0 0 0 0 0 ... 80 84 87 91 93 96 96 96 99 100
9 New South Wales Australia -33.8688 151.2093 0 0 0 0 3 4 ... 2032 2182 2298 2389 2493 2580 2637 2686 2734 2773

10 rows × 83 columns

Discuss: What do you see here? We can also see a lot about the data by using the .info() and .describe() dataframe methods:

In [4]:
raw_data_confirmed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263 entries, 0 to 262
Data columns (total 83 columns):
Province/State    82 non-null object
Country/Region    263 non-null object
Lat               263 non-null float64
Long              263 non-null float64
1/22/20           263 non-null int64
1/23/20           263 non-null int64
1/24/20           263 non-null int64
1/25/20           263 non-null int64
1/26/20           263 non-null int64
1/27/20           263 non-null int64
1/28/20           263 non-null int64
1/29/20           263 non-null int64
1/30/20           263 non-null int64
1/31/20           263 non-null int64
2/1/20            263 non-null int64
2/2/20            263 non-null int64
2/3/20            263 non-null int64
2/4/20            263 non-null int64
2/5/20            263 non-null int64
2/6/20            263 non-null int64
2/7/20            263 non-null int64
2/8/20            263 non-null int64
2/9/20            263 non-null int64
2/10/20           263 non-null int64
2/11/20           263 non-null int64
2/12/20           263 non-null int64
2/13/20           263 non-null int64
2/14/20           263 non-null int64
2/15/20           263 non-null int64
2/16/20           263 non-null int64
2/17/20           263 non-null int64
2/18/20           263 non-null int64
2/19/20           263 non-null int64
2/20/20           263 non-null int64
2/21/20           263 non-null int64
2/22/20           263 non-null int64
2/23/20           263 non-null int64
2/24/20           263 non-null int64
2/25/20           263 non-null int64
2/26/20           263 non-null int64
2/27/20           263 non-null int64
2/28/20           263 non-null int64
2/29/20           263 non-null int64
3/1/20            263 non-null int64
3/2/20            263 non-null int64
3/3/20            263 non-null int64
3/4/20            263 non-null int64
3/5/20            263 non-null int64
3/6/20            263 non-null int64
3/7/20            263 non-null int64
3/8/20            263 non-null int64
3/9/20            263 non-null int64
3/10/20           263 non-null int64
3/11/20           263 non-null int64
3/12/20           263 non-null int64
3/13/20           263 non-null int64
3/14/20           263 non-null int64
3/15/20           263 non-null int64
3/16/20           263 non-null int64
3/17/20           263 non-null int64
3/18/20           263 non-null int64
3/19/20           263 non-null int64
3/20/20           263 non-null int64
3/21/20           263 non-null int64
3/22/20           263 non-null int64
3/23/20           263 non-null int64
3/24/20           263 non-null int64
3/25/20           263 non-null int64
3/26/20           263 non-null int64
3/27/20           263 non-null int64
3/28/20           263 non-null int64
3/29/20           263 non-null int64
3/30/20           263 non-null int64
3/31/20           263 non-null int64
4/1/20            263 non-null int64
4/2/20            263 non-null int64
4/3/20            263 non-null int64
4/4/20            263 non-null int64
4/5/20            263 non-null int64
4/6/20            263 non-null int64
4/7/20            263 non-null int64
4/8/20            263 non-null int64
4/9/20            263 non-null int64
dtypes: float64(2), int64(79), object(2)
memory usage: 170.7+ KB
In [5]:
raw_data_confirmed.describe()
Out[5]:
Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 ... 3/31/20 4/1/20 4/2/20 4/3/20 4/4/20 4/5/20 4/6/20 4/7/20 4/8/20 4/9/20
count 263.000000 263.000000 263.000000 263.000000 263.000000 263.000000 263.000000 263.000000 263.000000 263.000000 ... 263.000000 263.000000 263.000000 263.000000 263.000000 263.000000 263.000000 263.000000 263.000000 263.000000
mean 21.339244 22.068133 2.110266 2.486692 3.577947 5.452471 8.053232 11.129278 21.209125 23.444867 ... 3260.406844 3546.026616 3853.482890 4166.984791 4552.882129 4836.939163 5114.452471 5422.418251 5745.642586 6065.969582
std 24.779585 70.785949 27.434015 27.532888 34.275498 47.702207 66.662110 89.815834 220.427512 221.769901 ... 16274.718201 17892.269613 19747.178551 21707.026686 23984.073766 25717.561274 27517.452168 29418.401918 31466.358777 33481.088534
min -51.796300 -135.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 6.938500 -21.031300 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 15.000000 17.000000 19.500000 20.500000 21.000000 22.000000 24.000000 27.000000 29.500000 30.500000
50% 23.634500 20.168300 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 143.000000 168.000000 176.000000 184.000000 195.000000 214.000000 226.000000 237.000000 248.000000 255.000000
75% 41.178850 79.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 715.000000 780.000000 881.000000 949.000000 983.500000 1020.000000 1068.500000 1135.500000 1193.500000 1235.500000
max 71.706900 178.065000 444.000000 444.000000 549.000000 761.000000 1058.000000 1423.000000 3554.000000 3554.000000 ... 188172.000000 213372.000000 243762.000000 275586.000000 308853.000000 337072.000000 366667.000000 396223.000000 429052.000000 461437.000000

8 rows × 81 columns

Number of confirmed cases by country

Look at the head (or tail) of our dataframe again and notice that each row is the data for a particular province or state of a given country:

In [6]:
raw_data_confirmed.head()
Out[6]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 3/31/20 4/1/20 4/2/20 4/3/20 4/4/20 4/5/20 4/6/20 4/7/20 4/8/20 4/9/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 ... 174 237 273 281 299 349 367 423 444 484
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 ... 243 259 277 304 333 361 377 383 400 409
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 ... 716 847 986 1171 1251 1320 1423 1468 1572 1666
3 NaN Andorra 42.5063 1.5218 0 0 0 0 0 0 ... 376 390 428 439 466 501 525 545 564 583
4 NaN Angola -11.2027 17.8739 0 0 0 0 0 0 ... 7 8 8 8 10 14 16 17 19 19

5 rows × 83 columns

We want the numbers for each country, though. So the way to think about this is, for each country, we want to take all the rows (regions/provinces) that correspond to that country and add up the numbers for each. To put this in data-analytic-speak, we want to group by the country column and sum up all the values for the other columns.

This is a common pattern in data analysis that we humans have been using for centuries. Interestingly, it was only formalized in 2011 by Hadley Wickham in his seminal paper The Split-Apply-Combine Strategy for Data Analysis. The pattern we're discussing is now called Split-Apply-Combine and, in the case at hand, we

  • Split the data into new datasets for each country,
  • Apply the function of "sum" for each new dataset (that is, we add/sum up the values for each column) to sum over territories/provinces/states for each country, and
  • Combine these datasets into a new dataframe.

The pandas API has the groupby method, which allows us to do this.

Side note: For more on split-apply-combine and pandas check out my post here.

In [7]:
# Group by region (also drop 'Lat', 'Long' as it doesn't make sense to sum them here)
confirmed_country = raw_data_confirmed.groupby(by=['Country/Region']).sum().drop(['Lat', 'Long'], axis=1)
confirmed_country.head()
Out[7]:
1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 ... 3/31/20 4/1/20 4/2/20 4/3/20 4/4/20 4/5/20 4/6/20 4/7/20 4/8/20 4/9/20
Country/Region
Afghanistan 0 0 0 0 0 0 0 0 0 0 ... 174 237 273 281 299 349 367 423 444 484
Albania 0 0 0 0 0 0 0 0 0 0 ... 243 259 277 304 333 361 377 383 400 409
Algeria 0 0 0 0 0 0 0 0 0 0 ... 716 847 986 1171 1251 1320 1423 1468 1572 1666
Andorra 0 0 0 0 0 0 0 0 0 0 ... 376 390 428 439 466 501 525 545 564 583
Angola 0 0 0 0 0 0 0 0 0 0 ... 7 8 8 8 10 14 16 17 19 19

5 rows × 79 columns

So each row of our new dataframe confirmed_country is a time series of the number of confirmed cases for each country. Cool! Now a dataframe has an associated object called an Index, which is essentially a set of unique indentifiers for each row. Let's check out the index of confirmed_country:

In [8]:
confirmed_country.index
Out[8]:
Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'United Arab Emirates', 'United Kingdom', 'Uruguay', 'Uzbekistan',
       'Venezuela', 'Vietnam', 'West Bank and Gaza', 'Western Sahara',
       'Zambia', 'Zimbabwe'],
      dtype='object', name='Country/Region', length=184)

It's indexed by Country/Region. That's all good but if we index by date instead, it will allow us to produce some visualizations almost immediately. This is a nice aspect of the pandas API: you can make basic visualizations with it and, if your index consists of DateTimes, it knows that you're plotting time series and plays nicely with them. To make the index the set of dates, notice that the column names are the dates. To turn column names into the index, we essentially want to make the columns the rows (and the rows the columns). This corresponds to taking the transpose of the dataframe:

In [9]:
confirmed_country = confirmed_country.transpose()
confirmed_country.head()
Out[9]:
Country/Region Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria ... United Arab Emirates United Kingdom Uruguay Uzbekistan Venezuela Vietnam West Bank and Gaza Western Sahara Zambia Zimbabwe
1/22/20 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1/23/20 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 2 0 0 0 0
1/24/20 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 2 0 0 0 0
1/25/20 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 2 0 0 0 0
1/26/20 0 0 0 0 0 0 0 0 4 0 ... 0 0 0 0 0 2 0 0 0 0

5 rows × 184 columns

Let's have a look at our index to see whether it actually consists of DateTimes:

In [10]:
confirmed_country.index
Out[10]:
Index(['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       '1/28/20', '1/29/20', '1/30/20', '1/31/20', '2/1/20', '2/2/20',
       '2/3/20', '2/4/20', '2/5/20', '2/6/20', '2/7/20', '2/8/20', '2/9/20',
       '2/10/20', '2/11/20', '2/12/20', '2/13/20', '2/14/20', '2/15/20',
       '2/16/20', '2/17/20', '2/18/20', '2/19/20', '2/20/20', '2/21/20',
       '2/22/20', '2/23/20', '2/24/20', '2/25/20', '2/26/20', '2/27/20',
       '2/28/20', '2/29/20', '3/1/20', '3/2/20', '3/3/20', '3/4/20', '3/5/20',
       '3/6/20', '3/7/20', '3/8/20', '3/9/20', '3/10/20', '3/11/20', '3/12/20',
       '3/13/20', '3/14/20', '3/15/20', '3/16/20', '3/17/20', '3/18/20',
       '3/19/20', '3/20/20', '3/21/20', '3/22/20', '3/23/20', '3/24/20',
       '3/25/20', '3/26/20', '3/27/20', '3/28/20', '3/29/20', '3/30/20',
       '3/31/20', '4/1/20', '4/2/20', '4/3/20', '4/4/20', '4/5/20', '4/6/20',
       '4/7/20', '4/8/20', '4/9/20'],
      dtype='object')

Note that dtype='object'which means that these are strings, not DateTimes. We can use pandas to turn it into a DateTimeIndex:

In [11]:
# Set index as DateTimeIndex
datetime_index = pd.DatetimeIndex(confirmed_country.index)
confirmed_country.set_index(datetime_index)
# Check out index
confirmed_country.index
Out[11]:
Index(['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       '1/28/20', '1/29/20', '1/30/20', '1/31/20', '2/1/20', '2/2/20',
       '2/3/20', '2/4/20', '2/5/20', '2/6/20', '2/7/20', '2/8/20', '2/9/20',
       '2/10/20', '2/11/20', '2/12/20', '2/13/20', '2/14/20', '2/15/20',
       '2/16/20', '2/17/20', '2/18/20', '2/19/20', '2/20/20', '2/21/20',
       '2/22/20', '2/23/20', '2/24/20', '2/25/20', '2/26/20', '2/27/20',
       '2/28/20', '2/29/20', '3/1/20', '3/2/20', '3/3/20', '3/4/20', '3/5/20',
       '3/6/20', '3/7/20', '3/8/20', '3/9/20', '3/10/20', '3/11/20', '3/12/20',
       '3/13/20', '3/14/20', '3/15/20', '3/16/20', '3/17/20', '3/18/20',
       '3/19/20', '3/20/20', '3/21/20', '3/22/20', '3/23/20', '3/24/20',
       '3/25/20', '3/26/20', '3/27/20', '3/28/20', '3/29/20', '3/30/20',
       '3/31/20', '4/1/20', '4/2/20', '4/3/20', '4/4/20', '4/5/20', '4/6/20',
       '4/7/20', '4/8/20', '4/9/20'],
      dtype='object')

Now we have a DateTimeIndex and Countries for columns, we can use the dataframe plotting method to visualize the time series of confirmed number of cases by country. As there are so many coutries, we'll plot a subset of them:

Plotting confirmed cases by country

In [12]:
# Plot time series of several countries of interest
poi = ['China', 'US', 'Italy', 'France', 'Spain', 'Australia']
confirmed_country[poi].plot(figsize=(20, 10), linewidth=5, colormap='brg', fontsize=20)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cfe5630160>

Let's label our axes and give the figure a title. We'll also thin the line and add points for the data so that the sampling is evident in our plots:

In [13]:
# Plot time series of several countries of interest
confirmed_country[poi].plot(figsize=(20,10), linewidth=2, marker='.', colormap='brg', fontsize=20)
plt.xlabel('Date', fontsize=20);
plt.ylabel('Reported Confirmed cases count', fontsize=20);
plt.title('Reported Confirmed Cases Time Series', fontsize=20);

Let's do this again but make the y-axis logarithmic:

In [14]:
# Plot time series of several countries of interest
confirmed_country[poi].plot(figsize=(20,10), linewidth=2, marker='.', fontsize=20, logy=True)
plt.xlabel('Date', fontsize=20);
plt.ylabel('Reported Confirmed cases count', fontsize=20);
plt.title('Reported Confirmed Cases Time Series', fontsize=20);

Discuss: Why do we plot with a log y-axis? How do we interpret the log plot? Key points:

  • If a variable takes on values over several orders of magnitude (e.g. in the 10s, 100s, and 1000s), we use a log axis so that the data is not all crammed into a small region of the visualization.
  • If a curve is approximately linear on a log axis, then its approximately exponential growth and the gradient/slope of the line tells us about the exponent.

ESSENTIAL POINT: A logarithm scale is good for visualization BUT remember, in the thoughtful words of Justin Bois, "on the ground, in the hospitals, we live with the linear scale. The flattening of the US curve, for example is more evident on the log scale, but the growth is still rapid on a linear scale, which is what we feel."

Summary: We've

  • looked at the JHU data repository and imported the data,
  • looked at the dataset containing the number of reported confirmed cases for each region,
  • wrangled the data to look at the number of reported confirmed cases by country,
  • plotted the number of reported confirmed cases by country (both log and semi-log),
  • discussed why log plots are important for visualization and that we need to remember that we, as humans, families, communities, and society, experience COVID-19 linearly.

Number of reported deaths

As we did above for raw_data_confirmed, let's check out the head and the info of the raw_data_deaths dataframe:

In [15]:
raw_data_deaths.head()
Out[15]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 3/31/20 4/1/20 4/2/20 4/3/20 4/4/20 4/5/20 4/6/20 4/7/20 4/8/20 4/9/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 ... 4 4 6 6 7 7 11 14 14 15
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 ... 15 15 16 17 20 20 21 22 22 23
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 ... 44 58 86 105 130 152 173 193 205 235
3 NaN Andorra 42.5063 1.5218 0 0 0 0 0 0 ... 12 14 15 16 17 18 21 22 23 25
4 NaN Angola -11.2027 17.8739 0 0 0 0 0 0 ... 2 2 2 2 2 2 2 2 2 2

5 rows × 83 columns

In [16]:
raw_data_deaths.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263 entries, 0 to 262
Data columns (total 83 columns):
Province/State    82 non-null object
Country/Region    263 non-null object
Lat               263 non-null float64
Long              263 non-null float64
1/22/20           263 non-null int64
1/23/20           263 non-null int64
1/24/20           263 non-null int64
1/25/20           263 non-null int64
1/26/20           263 non-null int64
1/27/20           263 non-null int64
1/28/20           263 non-null int64
1/29/20           263 non-null int64
1/30/20           263 non-null int64
1/31/20           263 non-null int64
2/1/20            263 non-null int64
2/2/20            263 non-null int64
2/3/20            263 non-null int64
2/4/20            263 non-null int64
2/5/20            263 non-null int64
2/6/20            263 non-null int64
2/7/20            263 non-null int64
2/8/20            263 non-null int64
2/9/20            263 non-null int64
2/10/20           263 non-null int64
2/11/20           263 non-null int64
2/12/20           263 non-null int64
2/13/20           263 non-null int64
2/14/20           263 non-null int64
2/15/20           263 non-null int64
2/16/20           263 non-null int64
2/17/20           263 non-null int64
2/18/20           263 non-null int64
2/19/20           263 non-null int64
2/20/20           263 non-null int64
2/21/20           263 non-null int64
2/22/20           263 non-null int64
2/23/20           263 non-null int64
2/24/20           263 non-null int64
2/25/20           263 non-null int64
2/26/20           263 non-null int64
2/27/20           263 non-null int64
2/28/20           263 non-null int64
2/29/20           263 non-null int64
3/1/20            263 non-null int64
3/2/20            263 non-null int64
3/3/20            263 non-null int64
3/4/20            263 non-null int64
3/5/20            263 non-null int64
3/6/20            263 non-null int64
3/7/20            263 non-null int64
3/8/20            263 non-null int64
3/9/20            263 non-null int64
3/10/20           263 non-null int64
3/11/20           263 non-null int64
3/12/20           263 non-null int64
3/13/20           263 non-null int64
3/14/20           263 non-null int64
3/15/20           263 non-null int64
3/16/20           263 non-null int64
3/17/20           263 non-null int64
3/18/20           263 non-null int64
3/19/20           263 non-null int64
3/20/20           263 non-null int64
3/21/20           263 non-null int64
3/22/20           263 non-null int64
3/23/20           263 non-null int64
3/24/20           263 non-null int64
3/25/20           263 non-null int64
3/26/20           263 non-null int64
3/27/20           263 non-null int64
3/28/20           263 non-null int64
3/29/20           263 non-null int64
3/30/20           263 non-null int64
3/31/20           263 non-null int64
4/1/20            263 non-null int64
4/2/20            263 non-null int64
4/3/20            263 non-null int64
4/4/20            263 non-null int64
4/5/20            263 non-null int64
4/6/20            263 non-null int64
4/7/20            263 non-null int64
4/8/20            263 non-null int64
4/9/20            263 non-null int64
dtypes: float64(2), int64(79), object(2)
memory usage: 170.7+ KB

It seems to be structured similarly to raw_data_confirmed. I have checked it out in detail and can confirm that it is! This is good data design as it means that users like can explore, munge, and visualize it in a fashion analogous to the above. Can you remember what we did? We

  • Split-Apply-Combined it (and dropped 'Lat'/'Long'),
  • Transposed it,
  • Made the index a DateTimeIndex, and
  • Visualized it (linear and semi-log).

Let's now do the first three steps here for raw_data_deaths and see how we go:

Number of reported deaths by country

In [17]:
# Split-Apply-Combine
deaths_country = raw_data_deaths.groupby(by=['Country/Region']).sum().drop(['Lat', 'Long'], axis=1)

# Transpose
deaths_country = deaths_country.transpose()

# Set index as DateTimeIndex
datetime_index = pd.DatetimeIndex(deaths_country.index)
deaths_country.set_index(datetime_index)

# Check out head
deaths_country.head()
Out[17]:
Country/Region Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria ... United Arab Emirates United Kingdom Uruguay Uzbekistan Venezuela Vietnam West Bank and Gaza Western Sahara Zambia Zimbabwe
1/22/20 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1/23/20 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1/24/20 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1/25/20 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1/26/20 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 184 columns

In [18]:
# Check out the index
deaths_country.index
Out[18]:
Index(['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       '1/28/20', '1/29/20', '1/30/20', '1/31/20', '2/1/20', '2/2/20',
       '2/3/20', '2/4/20', '2/5/20', '2/6/20', '2/7/20', '2/8/20', '2/9/20',
       '2/10/20', '2/11/20', '2/12/20', '2/13/20', '2/14/20', '2/15/20',
       '2/16/20', '2/17/20', '2/18/20', '2/19/20', '2/20/20', '2/21/20',
       '2/22/20', '2/23/20', '2/24/20', '2/25/20', '2/26/20', '2/27/20',
       '2/28/20', '2/29/20', '3/1/20', '3/2/20', '3/3/20', '3/4/20', '3/5/20',
       '3/6/20', '3/7/20', '3/8/20', '3/9/20', '3/10/20', '3/11/20', '3/12/20',
       '3/13/20', '3/14/20', '3/15/20', '3/16/20', '3/17/20', '3/18/20',
       '3/19/20', '3/20/20', '3/21/20', '3/22/20', '3/23/20', '3/24/20',
       '3/25/20', '3/26/20', '3/27/20', '3/28/20', '3/29/20', '3/30/20',
       '3/31/20', '4/1/20', '4/2/20', '4/3/20', '4/4/20', '4/5/20', '4/6/20',
       '4/7/20', '4/8/20', '4/9/20'],
      dtype='object')

Plotting number of reported deaths by country

Let's now visualize the number of reported deaths:

In [19]:
# Plot time series of several countries of interest
deaths_country[poi].plot(figsize=(20, 10), linewidth=2, marker='.', colormap='brg', fontsize=20)
plt.xlabel('Date', fontsize=20);
plt.ylabel('Number of Reported Deaths', fontsize=20);
plt.title('Reported Deaths Time Series', fontsize=20);

Now on a semi-log plot:

In [20]:
# Plot time series of several countries of interest
deaths_country[poi].plot(figsize=(20,10), linewidth=2, marker='.', fontsize=20, colormap='brg', logy=True)
plt.xlabel('Date', fontsize=20);
plt.ylabel('Number of Reported Deaths', fontsize=20);
plt.title('Reported Deaths Time Series', fontsize=20);

Aligning growth curves to start with day of number of known deaths ≥ 25

To compare what's happening in different countries, we can align each country's growth curves to all start on the day when the number of known deaths ≥ 25, such as reported in the first figure here. To achieve this, first off, let's set set all values less than 25 to NaN so that the associated data points don't get plotted at all when we visualize the data:

In [21]:
# Loop over columns & set values < 25 to None
for col in deaths_country.columns:
    deaths_country.loc[(deaths_country[col] < 25), col] = None

# Check out tail
deaths_country.tail()
Out[21]:
Country/Region Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria ... United Arab Emirates United Kingdom Uruguay Uzbekistan Venezuela Vietnam West Bank and Gaza Western Sahara Zambia Zimbabwe
4/5/20 NaN NaN 152.0 NaN NaN NaN 44.0 NaN 35.0 204.0 ... NaN 4943.0 NaN NaN NaN NaN NaN NaN NaN NaN
4/6/20 NaN NaN 173.0 NaN NaN NaN 48.0 NaN 40.0 220.0 ... NaN 5385.0 NaN NaN NaN NaN NaN NaN NaN NaN
4/7/20 NaN NaN 193.0 NaN NaN NaN 56.0 NaN 45.0 243.0 ... NaN 6171.0 NaN NaN NaN NaN NaN NaN NaN NaN
4/8/20 NaN NaN 205.0 NaN NaN NaN 63.0 NaN 50.0 273.0 ... NaN 7111.0 NaN NaN NaN NaN NaN NaN NaN NaN
4/9/20 NaN NaN 235.0 25.0 NaN NaN 72.0 NaN 51.0 295.0 ... NaN 7993.0 NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 184 columns

Now let's plot as above to make sure we see what we think we should see:

In [22]:
# Plot time series of several countries of interest
poi = ['China', 'US', 'Italy', 'France', 'Australia']
deaths_country[poi].plot(figsize=(20,10), linewidth=2, marker='.', colormap='brg', fontsize=20)
plt.xlabel('Date', fontsize=20);
plt.ylabel('Number of Reported Deaths', fontsize=20);
plt.title('Reported Deaths Time Series', fontsize=20);

The countries that have seen less than 25 total deaths will have columns of all NaNs now so let's drop these and then see how many columns we have left:

In [23]:
# Drop columns that are all NaNs (i.e. countries that haven't yet reached 25 deaths)
deaths_country.dropna(axis=1, how='all', inplace=True)
deaths_country.info()
<class 'pandas.core.frame.DataFrame'>
Index: 79 entries, 1/22/20 to 4/9/20
Data columns (total 60 columns):
Algeria                   15 non-null float64
Andorra                   1 non-null float64
Argentina                 10 non-null float64
Australia                 7 non-null float64
Austria                   17 non-null float64
Belgium                   21 non-null float64
Bosnia and Herzegovina    4 non-null float64
Brazil                    19 non-null float64
Canada                    18 non-null float64
Chile                     6 non-null float64
China                     77 non-null float64
Colombia                  7 non-null float64
Czechia                   10 non-null float64
Denmark                   17 non-null float64
Dominican Republic        13 non-null float64
Ecuador                   17 non-null float64
Egypt                     14 non-null float64
Finland                   6 non-null float64
France                    31 non-null float64
Germany                   23 non-null float64
Greece                    15 non-null float64
Hungary                   7 non-null float64
India                     12 non-null float64
Indonesia                 22 non-null float64
Iran                      43 non-null float64
Iraq                      17 non-null float64
Ireland                   13 non-null float64
Israel                    9 non-null float64
Italy                     41 non-null float64
Japan                     25 non-null float64
Korea, South              39 non-null float64
Luxembourg                9 non-null float64
Malaysia                  14 non-null float64
Mexico                    10 non-null float64
Moldova                   2 non-null float64
Morocco                   13 non-null float64
Netherlands               24 non-null float64
North Macedonia           3 non-null float64
Norway                    12 non-null float64
Pakistan                  10 non-null float64
Panama                    10 non-null float64
Peru                      10 non-null float64
Philippines               19 non-null float64
Poland                    11 non-null float64
Portugal                  17 non-null float64
Romania                   14 non-null float64
Russia                    8 non-null float64
San Marino                11 non-null float64
Saudi Arabia              7 non-null float64
Serbia                    9 non-null float64
Slovenia                  5 non-null float64
Spain                     32 non-null float64
Sweden                    18 non-null float64
Switzerland               24 non-null float64
Thailand                  4 non-null float64
Tunisia                   1 non-null float64
Turkey                    19 non-null float64
US                        31 non-null float64
Ukraine                   7 non-null float64
United Kingdom            25 non-null float64
dtypes: float64(60)
memory usage: 37.6+ KB

As we're going to align the countries from the day they first had at least 25 deaths, we won't need the DateTimeIndex. In fact, we won't need the date at all. So we can

  • Reset the Index, which will give us an ordinal index (which turns the date into a regular column) and
  • Drop the date column (which will be called 'index) after the reset.
In [24]:
# drop index, sort date column
deaths_country_drop = deaths_country.reset_index().drop(['index'], axis=1)
deaths_country_drop.head()
Out[24]:
Country/Region Algeria Andorra Argentina Australia Austria Belgium Bosnia and Herzegovina Brazil Canada Chile ... Slovenia Spain Sweden Switzerland Thailand Tunisia Turkey US Ukraine United Kingdom
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 60 columns

Now it's time to shift each column so that the first entry is the first NaN value that it contains! To do this, we can use the shift() method on each column. How much do we shift each column, though? The magnitude of the shift is given by how many NaNs there are at the start of the column, which we can retrieve using the first_valid_index() method on the column but we want to shift up, which is negative in direction (by convention and perhaps intuition). SO let's do it.

In [25]:
# shift
for col in deaths_country_drop.columns:
    deaths_country_drop[col] = deaths_country_drop[col].shift(-deaths_country_drop[col].first_valid_index())
# check out head
deaths_country_drop.head()
Out[25]:
Country/Region Algeria Andorra Argentina Australia Austria Belgium Bosnia and Herzegovina Brazil Canada Chile ... Slovenia Spain Sweden Switzerland Thailand Tunisia Turkey US Ukraine United Kingdom
0 25.0 25.0 27.0 28.0 28.0 37.0 29.0 25.0 25.0 27.0 ... 28.0 28.0 25.0 27.0 26.0 25.0 30.0 28.0 27.0 56.0
1 26.0 NaN 28.0 30.0 30.0 67.0 33.0 34.0 26.0 34.0 ... 30.0 35.0 36.0 28.0 27.0 NaN 37.0 36.0 32.0 56.0
2 29.0 NaN 36.0 35.0 49.0 75.0 34.0 46.0 30.0 37.0 ... 36.0 54.0 62.0 41.0 30.0 NaN 44.0 40.0 37.0 72.0
3 31.0 NaN 39.0 40.0 58.0 88.0 35.0 59.0 38.0 43.0 ... 40.0 55.0 77.0 54.0 32.0 NaN 59.0 47.0 38.0 138.0
4 35.0 NaN 43.0 45.0 68.0 122.0 NaN 77.0 54.0 48.0 ... 43.0 133.0 105.0 75.0 NaN NaN 75.0 54.0 45.0 178.0

5 rows × 60 columns

Side note: instead of looping over columns, we could have applied a lambda function to the columns of the dataframe, as follows:

In [26]:
# shift using lambda function
#deaths_country = deaths_country.apply(lambda x: x.shift(-x.first_valid_index()))

Now we get to plot our time series, first with linear axes, then semi-log:

In [27]:
# Plot time series 
ax = deaths_country_drop.plot(figsize=(20,10), linewidth=2, marker='.', fontsize=20)
ax.legend(ncol=3, loc='upper right')
plt.xlabel('Days', fontsize=20);
plt.ylabel('Number of Reported Deaths', fontsize=20);
plt.title('Total reported coronavirus deaths for places with at least 25 deaths', fontsize=20);
In [28]:
# Plot semi log time series 
ax = deaths_country_drop.plot(figsize=(20,10), linewidth=2, marker='.', fontsize=20, logy=True)
ax.legend(ncol=3, loc='upper right')
plt.xlabel('Days', fontsize=20);
plt.ylabel('Deaths Patients count', fontsize=20);
plt.title('Total reported coronavirus deaths for places with at least 25 deaths', fontsize=20);