This is a notebook from the live coding session by Dr Hugo Bowne-Anderson on April 10, 2020 via DataCamp.

Imports and data¶

Let's import the necessary packages from the SciPy stack and get the data.

In [1]:

# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Set style & figures inline
sns.set()
%matplotlib inline

In [2]:

# Data urls
base_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/'
confirmed_cases_data_url = base_url + 'time_series_covid19_confirmed_global.csv'
death_cases_data_url = base_url + 'time_series_covid19_deaths_global.csv'
recovery_cases_data_url = base_url+ 'time_series_covid19_recovered_global.csv'
# Import datasets as pandas dataframes
raw_data_confirmed = pd.read_csv(confirmed_cases_data_url)
raw_data_deaths = pd.read_csv(death_cases_data_url)
raw_data_recovered = pd.read_csv(recovery_cases_data_url)

Confirmed cases of COVID-19¶

We'll first check out the confirmed cases data by looking at the head of the dataframe:

In [3]:

raw_data_confirmed.head(n=10)

Out[3]:

	Province/State	Country/Region	Lat	Long	1/26/20	1/27/20	...	3/31/20	4/1/20	4/2/20	4/3/20	4/4/20	4/5/20	4/6/20	4/7/20	4/8/20	4/9/20
0	NaN	Afghanistan	33.0000	65.0000	0	0	...	174	237	273	281	299	349	367	423	444	484
1	NaN	Albania	41.1533	20.1683	0	0	...	243	259	277	304	333	361	377	383	400	409
2	NaN	Algeria	28.0339	1.6596	0	0	...	716	847	986	1171	1251	1320	1423	1468	1572	1666
3	NaN	Andorra	42.5063	1.5218	0	0	...	376	390	428	439	466	501	525	545	564	583
4	NaN	Angola	-11.2027	17.8739	0	0	...	7	8	8	8	10	14	16	17	19	19
5	NaN	Antigua and Barbuda	17.0608	-61.7964	0	0	...	7	7	9	15	15	15	15	19	19	19
6	NaN	Argentina	-38.4161	-63.6167	0	0	...	1054	1054	1133	1265	1451	1451	1554	1628	1715	1795
7	NaN	Armenia	40.0691	45.0382	0	0	...	532	571	663	736	770	822	833	853	881	921
8	Australian Capital Territory	Australia	-35.4735	149.0124	0	0	...	80	84	87	91	93	96	96	96	99	100
9	New South Wales	Australia	-33.8688	151.2093	3	4	...	2032	2182	2298	2389	2493	2580	2637	2686	2734	2773

10 rows × 83 columns

Discuss: What do you see here? We can also see a lot about the data by using the .info() and .describe() dataframe methods:

In [4]:

raw_data_confirmed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263 entries, 0 to 262
Data columns (total 83 columns):
Province/State    82 non-null object
Country/Region    263 non-null object
Lat               263 non-null float64
Long              263 non-null float64
1/22/20           263 non-null int64
1/23/20           263 non-null int64
1/24/20           263 non-null int64
1/25/20           263 non-null int64
1/26/20           263 non-null int64
1/27/20           263 non-null int64
1/28/20           263 non-null int64
1/29/20           263 non-null int64
1/30/20           263 non-null int64
1/31/20           263 non-null int64
2/1/20            263 non-null int64
2/2/20            263 non-null int64
2/3/20            263 non-null int64
2/4/20            263 non-null int64
2/5/20            263 non-null int64
2/6/20            263 non-null int64
2/7/20            263 non-null int64
2/8/20            263 non-null int64
2/9/20            263 non-null int64
2/10/20           263 non-null int64
2/11/20           263 non-null int64
2/12/20           263 non-null int64
2/13/20           263 non-null int64
2/14/20           263 non-null int64
2/15/20           263 non-null int64
2/16/20           263 non-null int64
2/17/20           263 non-null int64
2/18/20           263 non-null int64
2/19/20           263 non-null int64
2/20/20           263 non-null int64
2/21/20           263 non-null int64
2/22/20           263 non-null int64
2/23/20           263 non-null int64
2/24/20           263 non-null int64
2/25/20           263 non-null int64
2/26/20           263 non-null int64
2/27/20           263 non-null int64
2/28/20           263 non-null int64
2/29/20           263 non-null int64
3/1/20            263 non-null int64
3/2/20            263 non-null int64
3/3/20            263 non-null int64
3/4/20            263 non-null int64
3/5/20            263 non-null int64
3/6/20            263 non-null int64
3/7/20            263 non-null int64
3/8/20            263 non-null int64
3/9/20            263 non-null int64
3/10/20           263 non-null int64
3/11/20           263 non-null int64
3/12/20           263 non-null int64
3/13/20           263 non-null int64
3/14/20           263 non-null int64
3/15/20           263 non-null int64
3/16/20           263 non-null int64
3/17/20           263 non-null int64
3/18/20           263 non-null int64
3/19/20           263 non-null int64
3/20/20           263 non-null int64
3/21/20           263 non-null int64
3/22/20           263 non-null int64
3/23/20           263 non-null int64
3/24/20           263 non-null int64
3/25/20           263 non-null int64
3/26/20           263 non-null int64
3/27/20           263 non-null int64
3/28/20           263 non-null int64
3/29/20           263 non-null int64
3/30/20           263 non-null int64
3/31/20           263 non-null int64
4/1/20            263 non-null int64
4/2/20            263 non-null int64
4/3/20            263 non-null int64
4/4/20            263 non-null int64
4/5/20            263 non-null int64
4/6/20            263 non-null int64
4/7/20            263 non-null int64
4/8/20            263 non-null int64
4/9/20            263 non-null int64
dtypes: float64(2), int64(79), object(2)
memory usage: 170.7+ KB

In [5]:

raw_data_confirmed.describe()

Out[5]:

	Lat	Long	1/22/20	1/23/20	1/24/20	1/25/20	1/26/20	1/27/20	1/28/20	1/29/20	...	3/31/20	4/1/20	4/2/20	4/3/20	4/4/20	4/5/20	4/6/20	4/7/20	4/8/20	4/9/20
count	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	...	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000	263.000000
mean	21.339244	22.068133	2.110266	2.486692	3.577947	5.452471	8.053232	11.129278	21.209125	23.444867	...	3260.406844	3546.026616	3853.482890	4166.984791	4552.882129	4836.939163	5114.452471	5422.418251	5745.642586	6065.969582
std	24.779585	70.785949	27.434015	27.532888	34.275498	47.702207	66.662110	89.815834	220.427512	221.769901	...	16274.718201	17892.269613	19747.178551	21707.026686	23984.073766	25717.561274	27517.452168	29418.401918	31466.358777	33481.088534
min	-51.796300	-135.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	6.938500	-21.031300	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	15.000000	17.000000	19.500000	20.500000	21.000000	22.000000	24.000000	27.000000	29.500000	30.500000
50%	23.634500	20.168300	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	143.000000	168.000000	176.000000	184.000000	195.000000	214.000000	226.000000	237.000000	248.000000	255.000000
75%	41.178850	79.500000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	715.000000	780.000000	881.000000	949.000000	983.500000	1020.000000	1068.500000	1135.500000	1193.500000	1235.500000
max	71.706900	178.065000	444.000000	444.000000	549.000000	761.000000	1058.000000	1423.000000	3554.000000	3554.000000	...	188172.000000	213372.000000	243762.000000	275586.000000	308853.000000	337072.000000	366667.000000	396223.000000	429052.000000	461437.000000

8 rows × 81 columns

Number of confirmed cases by country¶

Look at the head (or tail) of our dataframe again and notice that each row is the data for a particular province or state of a given country:

In [6]:

raw_data_confirmed.head()

Out[6]:

	Province/State	Country/Region	Lat	Long	...	3/31/20	4/1/20	4/2/20	4/3/20	4/4/20	4/5/20	4/6/20	4/7/20	4/8/20	4/9/20
0	NaN	Afghanistan	33.0000	65.0000	...	174	237	273	281	299	349	367	423	444	484
1	NaN	Albania	41.1533	20.1683	...	243	259	277	304	333	361	377	383	400	409
2	NaN	Algeria	28.0339	1.6596	...	716	847	986	1171	1251	1320	1423	1468	1572	1666
3	NaN	Andorra	42.5063	1.5218	...	376	390	428	439	466	501	525	545	564	583
4	NaN	Angola	-11.2027	17.8739	...	7	8	8	8	10	14	16	17	19	19

5 rows × 83 columns

We want the numbers for each country, though. So the way to think about this is, for each country, we want to take all the rows (regions/provinces) that correspond to that country and add up the numbers for each. To put this in data-analytic-speak, we want to group by the country column and sum up all the values for the other columns.

This is a common pattern in data analysis that we humans have been using for centuries. Interestingly, it was only formalized in 2011 by Hadley Wickham in his seminal paper The Split-Apply-Combine Strategy for Data Analysis. The pattern we're discussing is now called Split-Apply-Combine and, in the case at hand, we

Split the data into new datasets for each country,
Apply the function of "sum" for each new dataset (that is, we add/sum up the values for each column) to sum over territories/provinces/states for each country, and
Combine these datasets into a new dataframe.

The pandas API has the groupby method, which allows us to do this.

Side note: For more on split-apply-combine and pandas check out my post here.

In [7]:

# Group by region (also drop 'Lat', 'Long' as it doesn't make sense to sum them here)
confirmed_country = raw_data_confirmed.groupby(by=['Country/Region']).sum().drop(['Lat', 'Long'], axis=1)
confirmed_country.head()

Out[7]:

	1/22/20	1/23/20	1/24/20	1/25/20	1/26/20	1/27/20	1/28/20	1/29/20	1/30/20	1/31/20	...	3/31/20	4/1/20	4/2/20	4/3/20	4/4/20	4/5/20	4/6/20	4/7/20	4/8/20	4/9/20
Country/Region
Afghanistan	0	0	0	0	0	0	0	0	0	0	...	174	237	273	281	299	349	367	423	444	484
Albania	0	0	0	0	0	0	0	0	0	0	...	243	259	277	304	333	361	377	383	400	409
Algeria	0	0	0	0	0	0	0	0	0	0	...	716	847	986	1171	1251	1320	1423	1468	1572	1666
Andorra	0	0	0	0	0	0	0	0	0	0	...	376	390	428	439	466	501	525	545	564	583
Angola	0	0	0	0	0	0	0	0	0	0	...	7	8	8	8	10	14	16	17	19	19

5 rows × 79 columns

So each row of our new dataframe confirmed_country is a time series of the number of confirmed cases for each country. Cool! Now a dataframe has an associated object called an Index, which is essentially a set of unique indentifiers for each row. Let's check out the index of confirmed_country:

In [8]:

confirmed_country.index

Out[8]:

Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'United Arab Emirates', 'United Kingdom', 'Uruguay', 'Uzbekistan',
       'Venezuela', 'Vietnam', 'West Bank and Gaza', 'Western Sahara',
       'Zambia', 'Zimbabwe'],
      dtype='object', name='Country/Region', length=184)

It's indexed by Country/Region. That's all good but if we index by date instead, it will allow us to produce some visualizations almost immediately. This is a nice aspect of the pandas API: you can make basic visualizations with it and, if your index consists of DateTimes, it knows that you're plotting time series and plays nicely with them. To make the index the set of dates, notice that the column names are the dates. To turn column names into the index, we essentially want to make the columns the rows (and the rows the columns). This corresponds to taking the transpose of the dataframe:

In [9]:

confirmed_country = confirmed_country.transpose()
confirmed_country.head()

Out[9]:

Country/Region	Australia	...	Vietnam
1/22/20	0	...	0
1/23/20	0	...	2
1/24/20	0	...	2
1/25/20	0	...	2
1/26/20	4	...	2

5 rows × 184 columns

Let's have a look at our index to see whether it actually consists of DateTimes:

In [10]:

confirmed_country.index

Out[10]:

Index(['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       '1/28/20', '1/29/20', '1/30/20', '1/31/20', '2/1/20', '2/2/20',
       '2/3/20', '2/4/20', '2/5/20', '2/6/20', '2/7/20', '2/8/20', '2/9/20',
       '2/10/20', '2/11/20', '2/12/20', '2/13/20', '2/14/20', '2/15/20',
       '2/16/20', '2/17/20', '2/18/20', '2/19/20', '2/20/20', '2/21/20',
       '2/22/20', '2/23/20', '2/24/20', '2/25/20', '2/26/20', '2/27/20',
       '2/28/20', '2/29/20', '3/1/20', '3/2/20', '3/3/20', '3/4/20', '3/5/20',
       '3/6/20', '3/7/20', '3/8/20', '3/9/20', '3/10/20', '3/11/20', '3/12/20',
       '3/13/20', '3/14/20', '3/15/20', '3/16/20', '3/17/20', '3/18/20',
       '3/19/20', '3/20/20', '3/21/20', '3/22/20', '3/23/20', '3/24/20',
       '3/25/20', '3/26/20', '3/27/20', '3/28/20', '3/29/20', '3/30/20',
       '3/31/20', '4/1/20', '4/2/20', '4/3/20', '4/4/20', '4/5/20', '4/6/20',
       '4/7/20', '4/8/20', '4/9/20'],
      dtype='object')

Note that dtype='object'which means that these are strings, not DateTimes. We can use pandas to turn it into a DateTimeIndex:

In [11]:

# Set index as DateTimeIndex
datetime_index = pd.DatetimeIndex(confirmed_country.index)
confirmed_country.set_index(datetime_index)
# Check out index
confirmed_country.index

Out[11]:

Index(['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       '1/28/20', '1/29/20', '1/30/20', '1/31/20', '2/1/20', '2/2/20',
       '2/3/20', '2/4/20', '2/5/20', '2/6/20', '2/7/20', '2/8/20', '2/9/20',
       '2/10/20', '2/11/20', '2/12/20', '2/13/20', '2/14/20', '2/15/20',
       '2/16/20', '2/17/20', '2/18/20', '2/19/20', '2/20/20', '2/21/20',
       '2/22/20', '2/23/20', '2/24/20', '2/25/20', '2/26/20', '2/27/20',
       '2/28/20', '2/29/20', '3/1/20', '3/2/20', '3/3/20', '3/4/20', '3/5/20',
       '3/6/20', '3/7/20', '3/8/20', '3/9/20', '3/10/20', '3/11/20', '3/12/20',
       '3/13/20', '3/14/20', '3/15/20', '3/16/20', '3/17/20', '3/18/20',
       '3/19/20', '3/20/20', '3/21/20', '3/22/20', '3/23/20', '3/24/20',
       '3/25/20', '3/26/20', '3/27/20', '3/28/20', '3/29/20', '3/30/20',
       '3/31/20', '4/1/20', '4/2/20', '4/3/20', '4/4/20', '4/5/20', '4/6/20',
       '4/7/20', '4/8/20', '4/9/20'],
      dtype='object')

Now we have a DateTimeIndex and Countries for columns, we can use the dataframe plotting method to visualize the time series of confirmed number of cases by country. As there are so many coutries, we'll plot a subset of them:

Plotting confirmed cases by country¶

In [12]:

# Plot time series of several countries of interest
poi = ['China', 'US', 'Italy', 'France', 'Spain', 'Australia']
confirmed_country[poi].plot(figsize=(20, 10), linewidth=5, colormap='brg', fontsize=20)

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x1cfe5630160>

Let's label our axes and give the figure a title. We'll also thin the line and add points for the data so that the sampling is evident in our plots:

In [13]:

# Plot time series of several countries of interest
confirmed_country[poi].plot(figsize=(20,10), linewidth=2, marker='.', colormap='brg', fontsize=20)
plt.xlabel('Date', fontsize=20);
plt.ylabel('Reported Confirmed cases count', fontsize=20);
plt.title('Reported Confirmed Cases Time Series', fontsize=20);

Let's do this again but make the y-axis logarithmic:

In [14]:

# Plot time series of several countries of interest
confirmed_country[poi].plot(figsize=(20,10), linewidth=2, marker='.', fontsize=20, logy=True)
plt.xlabel('Date', fontsize=20);
plt.ylabel('Reported Confirmed cases count', fontsize=20);
plt.title('Reported Confirmed Cases Time Series', fontsize=20);

Discuss: Why do we plot with a log y-axis? How do we interpret the log plot? Key points:

If a variable takes on values over several orders of magnitude (e.g. in the 10s, 100s, and 1000s), we use a log axis so that the data is not all crammed into a small region of the visualization.
If a curve is approximately linear on a log axis, then its approximately exponential growth and the gradient/slope of the line tells us about the exponent.

ESSENTIAL POINT: A logarithm scale is good for visualization BUT remember, in the thoughtful words of Justin Bois, "on the ground, in the hospitals, we live with the linear scale. The flattening of the US curve, for example is more evident on the log scale, but the growth is still rapid on a linear scale, which is what we feel."

Summary: We've

looked at the JHU data repository and imported the data,
looked at the dataset containing the number of reported confirmed cases for each region,
wrangled the data to look at the number of reported confirmed cases by country,
plotted the number of reported confirmed cases by country (both log and semi-log),
discussed why log plots are important for visualization and that we need to remember that we, as humans, families, communities, and society, experience COVID-19 linearly.

Number of reported deaths¶

As we did above for raw_data_confirmed, let's check out the head and the info of the raw_data_deaths dataframe:

In [15]:

raw_data_deaths.head()

Out[15]:

	Province/State	Country/Region	Lat	Long	...	3/31/20	4/1/20	4/2/20	4/3/20	4/4/20	4/5/20	4/6/20	4/7/20	4/8/20	4/9/20
0	NaN	Afghanistan	33.0000	65.0000	...	4	4	6	6	7	7	11	14	14	15
1	NaN	Albania	41.1533	20.1683	...	15	15	16	17	20	20	21	22	22	23
2	NaN	Algeria	28.0339	1.6596	...	44	58	86	105	130	152	173	193	205	235
3	NaN	Andorra	42.5063	1.5218	...	12	14	15	16	17	18	21	22	23	25
4	NaN	Angola	-11.2027	17.8739	...	2	2	2	2	2	2	2	2	2	2

5 rows × 83 columns

In [16]:

raw_data_deaths.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263 entries, 0 to 262
Data columns (total 83 columns):
Province/State    82 non-null object
Country/Region    263 non-null object
Lat               263 non-null float64
Long              263 non-null float64
1/22/20           263 non-null int64
1/23/20           263 non-null int64
1/24/20           263 non-null int64
1/25/20           263 non-null int64
1/26/20           263 non-null int64
1/27/20           263 non-null int64
1/28/20           263 non-null int64
1/29/20           263 non-null int64
1/30/20           263 non-null int64
1/31/20           263 non-null int64
2/1/20            263 non-null int64
2/2/20            263 non-null int64
2/3/20            263 non-null int64
2/4/20            263 non-null int64
2/5/20            263 non-null int64
2/6/20            263 non-null int64
2/7/20            263 non-null int64
2/8/20            263 non-null int64
2/9/20            263 non-null int64
2/10/20           263 non-null int64
2/11/20           263 non-null int64
2/12/20           263 non-null int64
2/13/20           263 non-null int64
2/14/20           263 non-null int64
2/15/20           263 non-null int64
2/16/20           263 non-null int64
2/17/20           263 non-null int64
2/18/20           263 non-null int64
2/19/20           263 non-null int64
2/20/20           263 non-null int64
2/21/20           263 non-null int64
2/22/20           263 non-null int64
2/23/20           263 non-null int64
2/24/20           263 non-null int64
2/25/20           263 non-null int64
2/26/20           263 non-null int64
2/27/20           263 non-null int64
2/28/20           263 non-null int64
2/29/20           263 non-null int64
3/1/20            263 non-null int64
3/2/20            263 non-null int64
3/3/20            263 non-null int64
3/4/20            263 non-null int64
3/5/20            263 non-null int64
3/6/20            263 non-null int64
3/7/20            263 non-null int64
3/8/20            263 non-null int64
3/9/20            263 non-null int64
3/10/20           263 non-null int64
3/11/20           263 non-null int64
3/12/20           263 non-null int64
3/13/20           263 non-null int64
3/14/20           263 non-null int64
3/15/20           263 non-null int64
3/16/20           263 non-null int64
3/17/20           263 non-null int64
3/18/20           263 non-null int64
3/19/20           263 non-null int64
3/20/20           263 non-null int64
3/21/20           263 non-null int64
3/22/20           263 non-null int64
3/23/20           263 non-null int64
3/24/20           263 non-null int64
3/25/20           263 non-null int64
3/26/20           263 non-null int64
3/27/20           263 non-null int64
3/28/20           263 non-null int64
3/29/20           263 non-null int64
3/30/20           263 non-null int64
3/31/20           263 non-null int64
4/1/20            263 non-null int64
4/2/20            263 non-null int64
4/3/20            263 non-null int64
4/4/20            263 non-null int64
4/5/20            263 non-null int64
4/6/20            263 non-null int64
4/7/20            263 non-null int64
4/8/20            263 non-null int64
4/9/20            263 non-null int64
dtypes: float64(2), int64(79), object(2)
memory usage: 170.7+ KB

It seems to be structured similarly to raw_data_confirmed. I have checked it out in detail and can confirm that it is! This is good data design as it means that users like can explore, munge, and visualize it in a fashion analogous to the above. Can you remember what we did? We

Split-Apply-Combined it (and dropped 'Lat'/'Long'),
Transposed it,
Made the index a DateTimeIndex, and
Visualized it (linear and semi-log).

Let's now do the first three steps here for raw_data_deaths and see how we go:

Number of reported deaths by country¶

In [17]:

# Split-Apply-Combine
deaths_country = raw_data_deaths.groupby(by=['Country/Region']).sum().drop(['Lat', 'Long'], axis=1)

# Transpose
deaths_country = deaths_country.transpose()

# Set index as DateTimeIndex
datetime_index = pd.DatetimeIndex(deaths_country.index)
deaths_country.set_index(datetime_index)

# Check out head
deaths_country.head()

Out[17]:

Country/Region	Afghanistan	Albania	Algeria	Andorra	Angola	Antigua and Barbuda	Argentina	Armenia	Australia	Austria	...	United Arab Emirates	United Kingdom	Uruguay	Uzbekistan	Venezuela	Vietnam	West Bank and Gaza	Western Sahara	Zambia	Zimbabwe
1/22/20	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1/23/20	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1/24/20	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1/25/20	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1/26/20	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 184 columns

In [18]:

# Check out the index
deaths_country.index

Out[18]:

Index(['1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       '1/28/20', '1/29/20', '1/30/20', '1/31/20', '2/1/20', '2/2/20',
       '2/3/20', '2/4/20', '2/5/20', '2/6/20', '2/7/20', '2/8/20', '2/9/20',
       '2/10/20', '2/11/20', '2/12/20', '2/13/20', '2/14/20', '2/15/20',
       '2/16/20', '2/17/20', '2/18/20', '2/19/20', '2/20/20', '2/21/20',
       '2/22/20', '2/23/20', '2/24/20', '2/25/20', '2/26/20', '2/27/20',
       '2/28/20', '2/29/20', '3/1/20', '3/2/20', '3/3/20', '3/4/20', '3/5/20',
       '3/6/20', '3/7/20', '3/8/20', '3/9/20', '3/10/20', '3/11/20', '3/12/20',
       '3/13/20', '3/14/20', '3/15/20', '3/16/20', '3/17/20', '3/18/20',
       '3/19/20', '3/20/20', '3/21/20', '3/22/20', '3/23/20', '3/24/20',
       '3/25/20', '3/26/20', '3/27/20', '3/28/20', '3/29/20', '3/30/20',
       '3/31/20', '4/1/20', '4/2/20', '4/3/20', '4/4/20', '4/5/20', '4/6/20',
       '4/7/20', '4/8/20', '4/9/20'],
      dtype='object')

Plotting number of reported deaths by country¶

Let's now visualize the number of reported deaths:

In [19]:

# Plot time series of several countries of interest
deaths_country[poi].plot(figsize=(20, 10), linewidth=2, marker='.', colormap='brg', fontsize=20)
plt.xlabel('Date', fontsize=20);
plt.ylabel('Number of Reported Deaths', fontsize=20);
plt.title('Reported Deaths Time Series', fontsize=20);

Now on a semi-log plot:

In [20]:

# Plot time series of several countries of interest
deaths_country[poi].plot(figsize=(20,10), linewidth=2, marker='.', fontsize=20, colormap='brg', logy=True)
plt.xlabel('Date', fontsize=20);
plt.ylabel('Number of Reported Deaths', fontsize=20);
plt.title('Reported Deaths Time Series', fontsize=20);

Aligning growth curves to start with day of number of known deaths ≥ 25¶

To compare what's happening in different countries, we can align each country's growth curves to all start on the day when the number of known deaths ≥ 25, such as reported in the first figure here. To achieve this, first off, let's set set all values less than 25 to NaN so that the associated data points don't get plotted at all when we visualize the data:

In [21]:

# Loop over columns & set values < 25 to None
for col in deaths_country.columns:
    deaths_country.loc[(deaths_country[col] < 25), col] = None

# Check out tail
deaths_country.tail()

Out[21]:

Country/Region	Afghanistan	Albania	Algeria	Andorra	Angola	Antigua and Barbuda	Argentina	Armenia	Australia	Austria	...	United Arab Emirates	United Kingdom	Uruguay	Uzbekistan	Venezuela	Vietnam	West Bank and Gaza	Western Sahara	Zambia	Zimbabwe
4/5/20	NaN	NaN	152.0	NaN	NaN	NaN	44.0	NaN	35.0	204.0	...	NaN	4943.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4/6/20	NaN	NaN	173.0	NaN	NaN	NaN	48.0	NaN	40.0	220.0	...	NaN	5385.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4/7/20	NaN	NaN	193.0	NaN	NaN	NaN	56.0	NaN	45.0	243.0	...	NaN	6171.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4/8/20	NaN	NaN	205.0	NaN	NaN	NaN	63.0	NaN	50.0	273.0	...	NaN	7111.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4/9/20	NaN	NaN	235.0	25.0	NaN	NaN	72.0	NaN	51.0	295.0	...	NaN	7993.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 184 columns

Now let's plot as above to make sure we see what we think we should see:

In [22]:

# Plot time series of several countries of interest
poi = ['China', 'US', 'Italy', 'France', 'Australia']
deaths_country[poi].plot(figsize=(20,10), linewidth=2, marker='.', colormap='brg', fontsize=20)
plt.xlabel('Date', fontsize=20);
plt.ylabel('Number of Reported Deaths', fontsize=20);
plt.title('Reported Deaths Time Series', fontsize=20);

The countries that have seen less than 25 total deaths will have columns of all NaNs now so let's drop these and then see how many columns we have left:

In [23]:

# Drop columns that are all NaNs (i.e. countries that haven't yet reached 25 deaths)
deaths_country.dropna(axis=1, how='all', inplace=True)
deaths_country.info()

<class 'pandas.core.frame.DataFrame'>
Index: 79 entries, 1/22/20 to 4/9/20
Data columns (total 60 columns):
Algeria                   15 non-null float64
Andorra                   1 non-null float64
Argentina                 10 non-null float64
Australia                 7 non-null float64
Austria                   17 non-null float64
Belgium                   21 non-null float64
Bosnia and Herzegovina    4 non-null float64
Brazil                    19 non-null float64
Canada                    18 non-null float64
Chile                     6 non-null float64
China                     77 non-null float64
Colombia                  7 non-null float64
Czechia                   10 non-null float64
Denmark                   17 non-null float64
Dominican Republic        13 non-null float64
Ecuador                   17 non-null float64
Egypt                     14 non-null float64
Finland                   6 non-null float64
France                    31 non-null float64
Germany                   23 non-null float64
Greece                    15 non-null float64
Hungary                   7 non-null float64
India                     12 non-null float64
Indonesia                 22 non-null float64
Iran                      43 non-null float64
Iraq                      17 non-null float64
Ireland                   13 non-null float64
Israel                    9 non-null float64
Italy                     41 non-null float64
Japan                     25 non-null float64
Korea, South              39 non-null float64
Luxembourg                9 non-null float64
Malaysia                  14 non-null float64
Mexico                    10 non-null float64
Moldova                   2 non-null float64
Morocco                   13 non-null float64
Netherlands               24 non-null float64
North Macedonia           3 non-null float64
Norway                    12 non-null float64
Pakistan                  10 non-null float64
Panama                    10 non-null float64
Peru                      10 non-null float64
Philippines               19 non-null float64
Poland                    11 non-null float64
Portugal                  17 non-null float64
Romania                   14 non-null float64
Russia                    8 non-null float64
San Marino                11 non-null float64
Saudi Arabia              7 non-null float64
Serbia                    9 non-null float64
Slovenia                  5 non-null float64
Spain                     32 non-null float64
Sweden                    18 non-null float64
Switzerland               24 non-null float64
Thailand                  4 non-null float64
Tunisia                   1 non-null float64
Turkey                    19 non-null float64
US                        31 non-null float64
Ukraine                   7 non-null float64
United Kingdom            25 non-null float64
dtypes: float64(60)
memory usage: 37.6+ KB

As we're going to align the countries from the day they first had at least 25 deaths, we won't need the DateTimeIndex. In fact, we won't need the date at all. So we can

Reset the Index, which will give us an ordinal index (which turns the date into a regular column) and
Drop the date column (which will be called 'index) after the reset.

In [24]:

# drop index, sort date column
deaths_country_drop = deaths_country.reset_index().drop(['index'], axis=1)
deaths_country_drop.head()

Out[24]:

Country/Region	Algeria	Andorra	Argentina	Australia	Austria	Belgium	Bosnia and Herzegovina	Brazil	Canada	Chile	...	Slovenia	Spain	Sweden	Switzerland	Thailand	Tunisia	Turkey	US	Ukraine	United Kingdom
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 60 columns

Now it's time to shift each column so that the first entry is the first NaN value that it contains! To do this, we can use the shift() method on each column. How much do we shift each column, though? The magnitude of the shift is given by how many NaNs there are at the start of the column, which we can retrieve using the first_valid_index() method on the column but we want to shift up, which is negative in direction (by convention and perhaps intuition). SO let's do it.

In [25]:

# shift
for col in deaths_country_drop.columns:
    deaths_country_drop[col] = deaths_country_drop[col].shift(-deaths_country_drop[col].first_valid_index())
# check out head
deaths_country_drop.head()

Out[25]:

Country/Region	Algeria	Andorra	Argentina	Australia	Austria	Belgium	Bosnia and Herzegovina	Brazil	Canada	Chile	...	Slovenia	Spain	Sweden	Switzerland	Thailand	Tunisia	Turkey	US	Ukraine	United Kingdom
0	25.0	25.0	27.0	28.0	28.0	37.0	29.0	25.0	25.0	27.0	...	28.0	28.0	25.0	27.0	26.0	25.0	30.0	28.0	27.0	56.0
1	26.0	NaN	28.0	30.0	30.0	67.0	33.0	34.0	26.0	34.0	...	30.0	35.0	36.0	28.0	27.0	NaN	37.0	36.0	32.0	56.0
2	29.0	NaN	36.0	35.0	49.0	75.0	34.0	46.0	30.0	37.0	...	36.0	54.0	62.0	41.0	30.0	NaN	44.0	40.0	37.0	72.0
3	31.0	NaN	39.0	40.0	58.0	88.0	35.0	59.0	38.0	43.0	...	40.0	55.0	77.0	54.0	32.0	NaN	59.0	47.0	38.0	138.0
4	35.0	NaN	43.0	45.0	68.0	122.0	NaN	77.0	54.0	48.0	...	43.0	133.0	105.0	75.0	NaN	NaN	75.0	54.0	45.0	178.0

5 rows × 60 columns

Side note: instead of looping over columns, we could have applied a lambda function to the columns of the dataframe, as follows:

In [26]:

# shift using lambda function
#deaths_country = deaths_country.apply(lambda x: x.shift(-x.first_valid_index()))

Now we get to plot our time series, first with linear axes, then semi-log:

In [27]:

# Plot time series 
ax = deaths_country_drop.plot(figsize=(20,10), linewidth=2, marker='.', fontsize=20)
ax.legend(ncol=3, loc='upper right')
plt.xlabel('Days', fontsize=20);
plt.ylabel('Number of Reported Deaths', fontsize=20);
plt.title('Total reported coronavirus deaths for places with at least 25 deaths', fontsize=20);

In [28]:

# Plot semi log time series 
ax = deaths_country_drop.plot(figsize=(20,10), linewidth=2, marker='.', fontsize=20, logy=True)
ax.legend(ncol=3, loc='upper right')
plt.xlabel('Days', fontsize=20);
plt.ylabel('Deaths Patients count', fontsize=20);
plt.title('Total reported coronavirus deaths for places with at least 25 deaths', fontsize=20);

Note: although we have managed to plot what we wanted, the above plots are challenging to retrieve any meaningful information from. There are too many growth curves so that it's very crowded and too many colours look the same so it's difficult to tell which country is which from the legend. Below, we'll plot less curves and further down in the notebook we'll use the python package Altair to introduce interactivity into the plot in order to deal with this challenge.

In [29]:

# Plot semi log time series 
ax = deaths_country_drop.plot(figsize=(20,10), linewidth=2, marker='.', fontsize=20, logy=True)
ax.legend(ncol=3, loc='upper right')
plt.xlabel('Days', fontsize=20);
plt.ylabel('Deaths Patients count', fontsize=20);
plt.title('Total reported coronavirus deaths for places with at least 25 deaths', fontsize=20);

Summary: We've

looked at the dataset containing the number of reported deaths for each region,
wrangled the data to look at the number of reported deaths by country,
plotted the number of reported deaths by country (both log and semi-log),
aligned growth curves to start with day of number of known deaths ≥ 25.

Plotting number of recovered people¶

The third dataset in the Hopkins repository is the number of recovered. We want to do similar data wrangling as in the two cases above so we could copy and paste our code again but, if you're writing the same code three times, it's likely time to write a function.

In [30]:

# Function for grouping countries by region
def group_by_country(raw_data):
    """Returns data for countries indexed by date"""
    # Group by
    data = raw_data.groupby(by='Country/Region').sum().drop(['Lat', 'Long'], axis=1)
    # Transpose
    data = data.transpose()
    # Set index as DateTimeIndex
    datetime_index = pd.DatetimeIndex(data.index)
    data.set_index(datetime_index, inplace=True)
    return data

In [31]:

# Function to align growth curves
def align_curves(data, min_val):
    """Align growth curves  to start on the day when the number of known deaths = min_val"""
    # Loop over columns & set values < min_val to None
    for col in data.columns:
        data.loc[(data[col] < min_val), col] = None
    # Drop columns with all NaNs
    data.dropna(axis=1, how='all', inplace=True)
    # Reset index, drop date
    data = data.reset_index().drop(['index'], axis=1)
    # Shift each column to begin with first valid index
    for col in data.columns:
        data[col] = data[col].shift(-data[col].first_valid_index())
    return data

In [32]:

# Function to plot time series
def plot_time_series(df, plot_title, x_label, y_label, logy=False):
    """Plot time series and make looks a bit nice"""
    ax = df.plot(figsize=(20,10), linewidth=2, marker='.', fontsize=20, logy=logy)
    ax.legend(ncol=3, loc='lower right')
    plt.xlabel(x_label, fontsize=20);
    plt.ylabel(y_label, fontsize=20);
    plt.title(plot_title, fontsize=20);

For a sanity check, let's see these functions at work on the 'number of deaths' data:

In [33]:

deaths_country_drop = group_by_country(raw_data_deaths)
deaths_country_drop = align_curves(deaths_country_drop, min_val=25)
plot_time_series(deaths_country_drop, 'Number of Reported Deaths', 'Days', 'Reported Deaths by Country', logy=True)

Now let's check use our functions to group, wrangle, and plot the recovered patients data:

In [34]:

# group by country and check out tail
recovered_country = group_by_country(raw_data_recovered)
recovered_country.tail()

Out[34]:

Country/Region	Afghanistan	Albania	Algeria	Andorra	Angola	Argentina	Armenia	Australia	Austria	...	United Arab Emirates	United Kingdom	Uruguay	Uzbekistan	Venezuela	Vietnam	West Bank and Gaza	Zambia
2020-04-05	15	104	90	26	2	280	57	757	2998	...	144	229	93	30	52	90	25	3
2020-04-06	18	116	90	31	2	325	62	1080	3463	...	167	287	104	30	65	95	24	5
2020-04-07	18	131	113	39	2	338	87	1080	4046	...	186	325	150	30	65	123	42	7
2020-04-08	29	154	237	52	2	358	114	1080	4512	...	239	345	150	30	65	126	44	7
2020-04-09	32	165	347	58	2	365	138	1472	5240	...	268	359	192	38	84	128	44	24

5 rows × 184 columns

In [35]:

# align curves and check out head
recovered_country_drop = align_curves(recovered_country, min_val=25)
recovered_country_drop.head()

Out[35]:

Country/Region	Afghanistan	Albania	Algeria	Andorra	Argentina	Armenia	Australia	Austria	Azerbaijan	Bahrain	...	Turkey	US	Ukraine	United Arab Emirates	United Kingdom	Uruguay	Uzbekistan	Venezuela	Vietnam	West Bank and Gaza
0	29.0	31.0	32.0	26.0	52.0	28.0	26.0	112.0	26.0	35.0	...	26.0	105.0	25.0	26.0	53.0	41.0	25.0	31.0	25.0	25.0
1	32.0	31.0	32.0	31.0	52.0	30.0	26.0	225.0	26.0	35.0	...	26.0	121.0	28.0	31.0	67.0	41.0	25.0	39.0	55.0	NaN
2	NaN	33.0	32.0	39.0	63.0	30.0	26.0	225.0	26.0	44.0	...	42.0	147.0	28.0	31.0	67.0	62.0	25.0	39.0	58.0	42.0
3	NaN	44.0	65.0	52.0	72.0	30.0	88.0	479.0	26.0	44.0	...	70.0	176.0	28.0	38.0	67.0	68.0	30.0	39.0	63.0	44.0
4	NaN	52.0	65.0	58.0	72.0	30.0	88.0	636.0	32.0	60.0	...	105.0	178.0	35.0	38.0	67.0	93.0	30.0	39.0	75.0	44.0

5 rows × 105 columns

Plot time series:

In [36]:

plot_time_series(recovered_country_drop, 'Recovered Patients Time Series', 'Days', 'Recovered Patients count')

In [37]:

plot_time_series(recovered_country_drop, 'Recovered Patients Time Series', 'Days', 'Recovered Patients count', True)

Note: once again, the above plots are challenging to retrieve any meaningful information from. There are too many growth curves so that it's very crowded and too many colours look the same so it's difficult to tell which country is which from the legend. Let's plot less curves and in the next section we'll use the python package Altair to introduce interactivity into such a plot in order to deal with this challenge.

In [38]:

plot_time_series(recovered_country_drop[poi], 'Recovered Patients Time Series', 'Days', 'Recovered Patients count', True)

Summary: We've

looked at the dataset containing the number of reported recoveries for each region,
written function for grouping, wrangling, and plotting the data,
grouped, wrangled, and plotted the data for the number of reported recoveries.

Interactive plots with altair¶

We're now going to build some interactive data visualizations. I was recently inspired by this one in the NYTimes, a chart of confirmed number of deaths by country for places with at least 25 deaths, similar to ours above, but with informative hover tools. This one is also interesting.

We're going to use a tool called Altair. I like Altair for several reasons, including precisely what they state on their website:

With Altair, you can spend more time understanding your data and its meaning. Altair’s API is simple, friendly and consistent and built on top of the powerful Vega-Lite visualization grammar. This elegant simplicity produces beautiful and effective visualizations with a minimal amount of code.

Before jumping into Altair, let's reshape our deaths_country dataset. Notice that it's currently in wide data format, with a column for each country and a row for each "day" (where day 1 is the first day with over 25 confirmed deaths). This worked with the pandas plotting API for reasons discussed above.

In [39]:

# Look at head
deaths_country_drop.head()

Out[39]:

Country/Region	Algeria	Andorra	Argentina	Australia	Austria	Belgium	Bosnia and Herzegovina	Brazil	Canada	Chile	...	Slovenia	Spain	Sweden	Switzerland	Thailand	Tunisia	Turkey	US	Ukraine	United Kingdom
0	25.0	25.0	27.0	28.0	28.0	37.0	29.0	25.0	25.0	27.0	...	28.0	28.0	25.0	27.0	26.0	25.0	30.0	28.0	27.0	56.0
1	26.0	NaN	28.0	30.0	30.0	67.0	33.0	34.0	26.0	34.0	...	30.0	35.0	36.0	28.0	27.0	NaN	37.0	36.0	32.0	56.0
2	29.0	NaN	36.0	35.0	49.0	75.0	34.0	46.0	30.0	37.0	...	36.0	54.0	62.0	41.0	30.0	NaN	44.0	40.0	37.0	72.0
3	31.0	NaN	39.0	40.0	58.0	88.0	35.0	59.0	38.0	43.0	...	40.0	55.0	77.0	54.0	32.0	NaN	59.0	47.0	38.0	138.0
4	35.0	NaN	43.0	45.0	68.0	122.0	NaN	77.0	54.0	48.0	...	43.0	133.0	105.0	75.0	NaN	NaN	75.0	54.0	45.0	178.0

5 rows × 60 columns

For Altair, we'll want to convert the data into long data format. What this will do essentially have a row for each country/day pair so our columns will be 'Day', 'Country', and number of 'Deaths'. We do this using the dataframe method .melt() as follows:

In [40]:

# create long data for deaths
deaths_long = deaths_country_drop.reset_index().melt(id_vars='index', value_name='Deaths').rename(columns={ 'index': 'Day' })
deaths_long.head()

Out[40]:

	Day	Country/Region	Deaths
0	0	Algeria	25.0
1	1	Algeria	26.0
2	2	Algeria	29.0
3	3	Algeria	31.0
4	4	Algeria	35.0

We'll see the power of having long data when using Altair. Such transformations have been performed for a long time, however it wasn't until 2014 that Hadley Wickham formalized the language in his paper Tidy Data. Note that Wickham prefers to avoid the terms long and wide because, in his words, 'they are imprecise'. I generally agree but for our purposes here of giving the flavour, they suffice.

Now having transformed our data, let's import Altair and get a sense of its API.

In [41]:

import altair as alt

# altair plot 
alt.Chart(deaths_long).mark_line().encode(
    x='Day', 
    y='Deaths', 
    color='Country/Region'
)

Out[41]:

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

It is nice to be able to build such an informative and elegant chart in four lines of code (which is also elegant). And, looking at the simplicity of the code we just wrote, we can see why it was great to have long data: a column for each variable allowed us to explicitly and easily tell Altair what we wanted on each axis and what we wanted for the colour.

As the Altair documentation (which is great, by the way!) states,

The key idea is that you are declaring links between data columns and visual encoding channels, such as the x-axis, y-axis, color, etc. The rest of the plot details are handled automatically. Building on this declarative plotting idea, a surprising range of simple to sophisticated plots and visualizations can be created using a relatively concise grammar.

We can now customize the code to thicken the line width, to alter the opacity, and to make the chart larger:

In [42]:

# altair plot 
alt.Chart(deaths_long).mark_line(strokeWidth=4, opacity=0.7).encode(
    x='Day',
    y='Deaths',
    color='Country/Region'
).properties(
    width=800,
    height=650
)

Out[42]:

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

We can also add a log y-axis. To do this, The long-form, we express the types using the long-form alt.X('Day',...), which is, in the words of the Altair documentation

useful when doing more fine-tuned adjustments to the encoding, such as binning, axis and scale properties, or more.

We'll also now add a hover tooltip so that, when we hover our cursor over any point on any of the lines, it will tell us the 'Country', the 'Day', and the number of 'Deaths'.

In [43]:

# altair plot 
alt.Chart(deaths_long).mark_line(strokeWidth=4, opacity=0.7).encode(
    x=alt.X('Day'),
    y=alt.Y('Deaths', scale=alt.Scale(type='log')),
    color='Country/Region',
    tooltip=['Country/Region', 'Day', 'Deaths']
).properties(
    width=800,
    height=650
)

Out[43]:

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

It's great that we could add that useful hover tooltip with one line of code tooltip=['Country/Region', 'Day','Deaths'], particularly as it adds such information rich interaction to the chart. One useful aspect of the NYTimes chart was that, when you hovered over a particular curve, it made it stand out against the other. We're going to do something similar here: in the resulting chart, when you click on a curve, the others turn grey.

Note: When first attempting to build this chart, I discovered here that "multiple conditional values in one encoding are not allowed by the Vega-Lite spec," which is what Altair uses. For this reason, we build the chart, then an overlay, and then combine them.

In [44]:

# Selection tool
selection = alt.selection_single(fields=['Country/Region'])
# Color change when clicked
color = alt.condition(selection,
                     alt.Color('Country/Region:N'),
                     alt.value('lightgray'))


# Base altair plot 
base = alt.Chart(deaths_long).mark_line(strokeWidth=4, opacity=0.7).encode(
    x=alt.X('Day'),
    y=alt.Y('Deaths', scale=alt.Scale(type='log')),
    color='Country/Region',
    tooltip=['Country/Region', 'Day','Deaths']
).properties(
    width=800,
    height=650
)

# Chart
chart = base.encode(
    color=alt.condition(selection, 'Country/Region:N', alt.value('lightgray'))
).add_selection(
    selection
)

# Overlay
overlay = base.encode(
    color='Country/Region',
  opacity=alt.value(0.5),
  tooltip=['Country/Region:N', 'Name:N']
).transform_filter(
  selection
)

# Sum em up!
chart + overlay

Out[44]:

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

It's not super easy to line up the legend with the curves on the chart so let's put the labels on the chart itself. Thanks to Jake Vanderplas for this suggestion, and for the code.

In [45]:

# drop NaNs
deaths_long = deaths_long.dropna()

# Selection tool
selection = alt.selection_single(fields=['Country/Region'])
# Color change when clicked
color = alt.condition(selection,
                    alt.Color('Country/Region:N'),
                    alt.value('lightgray'))


# Base altair plot 
base = alt.Chart(deaths_long).mark_line(strokeWidth=4, opacity=0.7).encode(
    x=alt.X('Day'),
    y=alt.Y('Deaths', scale=alt.Scale(type='log')),
    color=alt.Color('Country/Region', legend=None),
).properties(
    width=800,
    height=650
)

# Chart
chart = base.encode(
  color=alt.condition(selection, 'Country/Region:N', alt.value('lightgray'))
).add_selection(
  selection
)

# Overlay
overlay = base.encode(
  color='Country/Region',
  opacity=alt.value(0.5),
  tooltip=['Country/Region:N', 'Name:N']
).transform_filter(
  selection
)

# Text labels
text = base.mark_text(
    align='left',
    dx=5,
    size=10
).encode(
    x=alt.X('Day', aggregate='max',  axis=alt.Axis(title='Day')),
    y=alt.Y('Deaths', aggregate={'argmax': 'Day'}, axis=alt.Axis(title='Reported Deaths')),
    text='Country/Region',  
).transform_filter(
    selection
)

# Sum em up!
chart + overlay + text

Out[45]:

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Summary: We've

melted the data into long format,
used Altair to make interactive plots of increasing richness,
admired the elegance & simplicity of the Altair API and the visualizations produced.

That's all for the time being. I'd be interested to see how you all can make these charts more information rich and comprehensible. I encourage you to raise ideas in issues on the issue tracker in this github repository and then to make pull requests. A couple of ideas are

Adding lines to the above chart that show curves for deaths doubling each X days, as in the first chart here,
Figuring out a way to make the chart less crowded with names by perhaps only showing 10 of them.

Country/Region	Algeria	Andorra	Argentina	Australia	Austria	Belgium	Bosnia and Herzegovina	Brazil	Canada	Chile	...	Slovenia	Spain	Sweden	Switzerland	Thailand	Tunisia	Turkey	US	Ukraine	United Kingdom
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Country/Region	Algeria	Andorra	Argentina	Australia	Austria	Belgium	Bosnia and Herzegovina	Brazil	Canada	Chile	...	Slovenia	Spain	Sweden	Switzerland	Thailand	Tunisia	Turkey	US	Ukraine	United Kingdom
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

COVID-19 Exploratory Data Analysis

Imports and data¶

Confirmed cases of COVID-19¶

Number of confirmed cases by country¶

Plotting confirmed cases by country¶

Number of reported deaths¶

Number of reported deaths by country¶

Plotting number of reported deaths by country¶

Aligning growth curves to start with day of number of known deaths ≥ 25¶

Plotting number of recovered people¶

Interactive plots with altair¶

Country/Region	Algeria	Andorra	Argentina	Australia	Austria	Belgium	Bosnia and Herzegovina	Brazil	Canada	Chile	...	Slovenia	Spain	Sweden	Switzerland	Thailand	Tunisia	Turkey	US	Ukraine	United Kingdom
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN