-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.qmd
407 lines (322 loc) · 20.2 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
---
title: "Annotated area charts"
subtitle: "A visualisation of coal production data for the 2024 `plotnine` contest"
author: "Nicola Rennie"
format:
html:
theme: lux
toc: true
highlight-style: pygments
---
[Plotnine](https://plotnine.org/) is a visualisation library that brings the Grammar of Graphics to Python, and the [2024 Plotnine Contest](https://posit.co/blog/announcing-the-2024-plotnine-contest/) aims to bring the community together to create and share with others great `plotnine` examples! In this tutorial, I'll be walking you through the process of creating the following plot for my entry into the `plotnine` contest!
![](plot.png)
This entry for the `plotnine` contest was inspired by a visualisation I previously created using `{ggplot2}` in R for [#TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-05-21/readme.md). You can see the original R version on [GitHub](https://github.com/nrennie/tidytuesday/tree/main/2024/2024-05-21).
# Preparing for plotting
Before we dive into plotting, we need to get some data and tidy it up!
## Python libraries
Let's start by loading the Python libraries that we'll need to create the plot. Beyond `plotnine` for actually plotting, we also need `pandas` for data wrangling, `textwrap` for wrapping long strings of text, `matplotlib.pyplot` for some additional plot tinkering, `highlight_text` for adding coloured and bold text to annotations, and `matplotlib.font_manager` for dealing with fonts.
```{python}
import plotnine as gg
import pandas as pd
import textwrap
import matplotlib.pyplot as plt
import highlight_text as ht
import matplotlib.font_manager
```
## Reading in data
The data set we'll be using is the *Carbon Majors* emissions data. As I mentioned earlier, this dataset was used as a [#TidyTuesday dataset](https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-05-21/readme.md). This means we can read the data in directly from the #TidyTuesday GitHub repository:
```{python}
emissions = pd.read_csv(
'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-05-21/emissions.csv')
```
You can download the `emissions.csv` file from [GitHub](https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-05-21/emissions.csv) if you prefer to load the data from a local file.
Carbon Majors is a database of historical production data from 122 of the world’s largest oil, gas, coal, and cement producers. The dataset has 12,551 rows and 7 columns with the following variables: `year`, `parent_entity`, `parent_type`, `commodity`, `production_value`, `production_unit`, and `total_emissions_MtCO2e`. The #TidyTuesday [README](https://github.com/rfordatascience/tidytuesday/blob/master/data/2024/2024-05-21/readme.md#emissionscsv) file has more information on what the variables are.
::: {.callout-note}
## Carbon Majors Data
The original data can be downloaded from [carbonmajors.org/Downloads](https://carbonmajors.org/Downloads). The Carbon Majors dataset is available for download and for non-commercial use, subject to InfluenceMap's [Terms and Conditions](https://influencemap.org/terms). The version used in this plot was downloaded in May 2024.
:::
## Data wrangling
For this plot, let's focus on emissions relating to coal only. There are six different types of coal in the data set and we'll filter the data to keep only rows relating to those six types of coal. We'll also remove the word `"Coal"` from each of the category labels since it will be obvious they are types of coal.
```{python}
# Prep data for plotting
plot_data = emissions[
emissions['commodity'].isin([
'Sub-Bituminous Coal', 'Metallurgical Coal', 'Bituminous Coal',
'Thermal Coal', 'Anthracite Coal', 'Lignite Coal'
])
].copy()
plot_data['commodity'] = plot_data['commodity'].str.replace(' Coal', '')
```
We'll also focus on production levels, so we'll keep only the columns we need to plot: `year`, type of coal (`commodity`), and amount produced (`production_value`). The data README tells us that the `production_value` column is given in million tonnes for coal production.
We need to sum up production across the different entities to get a total production value per year. We'll also filter the data to consider only the years since 1900:
```{python}
# Total production per year since 1900
plot_data = plot_data[['year', 'commodity', 'production_value']]
plot_data = plot_data.groupby(['year', 'commodity'], as_index=False).agg(
{'production_value': 'sum'}).rename(columns={'production_value': 'n'})
plot_data = plot_data[plot_data['year'] >= 1900]
```
Let's also sort the categories based on their production amount in the last year of the data (2022), to make our plot a little bit clearer:
```{python}
# Sort values by 2022 levels
orders = plot_data[plot_data['year'] == 2022].sort_values(
by='n', ascending=False)['commodity']
plot_data['commodity'] = pd.Categorical(
plot_data['commodity'],
categories=orders,
ordered=True)
```
We're going to be adding some annotations to the plot, and one of the annotations will show the first year that production exceeded 100 million tonnes per year. Let's calculate that year (and therefore the position of the annotation) in a data-driven way:
```{python}
# Values for annotations
exceeds100 = plot_data.groupby('year')['n'].sum()
exceeds100 = exceeds100[exceeds100 > 100].index.min()
```
I often find that traditional grid lines make a plot look a little bit too *busy*. Instead, we're going to create our own *grid lines* by drawing some vertical lines. To do this, let's create some data with the desired positions of where the vertical lines hit the x-axis:
```{python}
# Create data for x-axis labels
segment_data = pd.DataFrame({
'year': list(range(1900, 2021, 20))
})
```
We'll do something similar with the y-axis, by creating custom labels that encompass both the values and the units, instead of having a separate y-axis title. We'll put these values on the right hand side of the plot (rather than the left as normal) because this is often where people's eyes end up looking when they're reading a chart of data over time.
> Note the `\n` creates a linebreak in the text.
```{python}
# y-axis labels
y_axis_data = pd.DataFrame({
'value': [0, 2000, 4000, 6000, 8000],
'label': ['0', '2,000', '4,000', '6,000', '8,000\nmillion\ntonnes']
})
```
## Colour, font, and text variables
To make it easier to change up colour schemes, it's useful to define colours as variables. Here, we define a background colour (`bg_col`), a text colour (`text_col`), and a colour palette (`col_palette`) that will be used to colour the different areas.
```{python}
# Define background colour, text colour, and colour palette
bg_col = '#FFFFFA'
text_col = '#0D5C63'
col_palette = [
'#E58606',
'#5D69B1',
'#52BCA3',
'#99C945',
'#CC61B0',
'#24796C']
```
Similarly, we can define variable(s) for which font we want to use. But we also need to make sure that the font we want is actually available. We can get a list of available fonts using `matplotlib.font_manager.findSystemFonts()`.
```{python}
# Available fonts
flist = matplotlib.font_manager.findSystemFonts()
```
If `'Arial'` is installed, we'll use that as the main font (`body_font`) and if not, we'll use the default sans serif font.
```{python}
# Check if 'Arial' in list of installed fonts
flist = ''.join(flist).lower()
if 'arial' in flist:
body_font = 'Arial'
else:
body_font = 'sans'
```
> If you're running this code on your own laptop, feel free to choose a different font!
Let's also create some variables to store our title and subtitle text. Although these could be passed straight into title and subtitle arguments in the plotting function, keeping them separate leaves the plotting code looking a little bit cleaner. The subtitle text is also quite long, so we'll use `textwrap.wrap()` to wrap the text to 50 characters (without breaking words onto multiple lines):
```{python}
# title, subtitle
title_text = 'Coal production since 1900'
st = 'Carbon Majors is a database of historical production data from 122 of the world’s largest oil, gas, coal, and cement producers. This data is used to quantify the direct operational emissions and emissions from the combustion of marketed products that can be attributed to these entities.'
wrapped_subtitle = '\n'.join(textwrap.wrap(st, width=50))
```
Instead of a traditional legend, we're going to use coloured text to show which categories the different colours map to. So let's prepare some text that we'll use later in functions from the [`highlight-text` library](https://pypi.org/project/highlight-text/).
We create a string as normal with the text we want to display, then format words within the strings in the following way:
```{python}
#| eval: false
'Normal text then <coloured text::{"color": "red"}>'
```
Instead, of `"red"` we can use our different hex colours. We also add line breaks using `\n`. We can add some text formatting to the caption in a similar way. Instead of colouring the text, we'll make some of it bold. Therefore, we use `"fontweight": "bold"` instead of `"color": "red"`:
```{python}
# annotation labels
coal_types_label = 'Total coal production includes\nproduction of <Bituminous::{"color": "#E58606"}>,\n<Sub-bituminous::{"color": "#5D69B1"}>, <Metallurgical::{"color": "#52BCA3"}>,\n<Lignite::{"color": "#99C945"}>, <Anthracite::{"color": "#CC61B0"}>, and <Thermal::{"color": "#24796C"}>\ncoal. Bituminous accounts\nfor around half.'
# caption
cap = '<Data::{"fontweight": "bold"}>: Carbon Majors\n<Graphic::{"fontweight": "bold"}>: Nicola Rennie (@nrennie)'
```
Now we're ready to plot!
# Plotting with `plotnine`
`Plotnine` creates plots by adding *layers* e.g. a layer for the background, a layer for points, a layer for text, and so on. For each layer, we need to tell `plotnine` which data should be used for the layer, and how the variables in the data map to the different *aesthetics* in the plot using the `aes()` function.
We start every `plotnine` plot using the `ggplot()` function, and in here we can define the data and the aesthetic mapping *globally* i.e. for all layers in the plot. Let's pass in the `plot_data` and put the `year` on the x-axis and `n` on the `y-axis`.
> If you're a {ggplot2} user, note that the column names needs to passed in quotation marks.
This creates an empty plot, but with the axes limits set according to the `year` and `n` columns:
```{python}
p = (gg.ggplot(plot_data, gg.aes(x='year', y='n')))
```
```{python}
#| echo: false
#| eval: true
p + gg.theme(figure_size = (8, 6))
```
We then start adding on more layers, and we can override the global data and aesthetic mappings that we passed into `ggplot()` for each individual layer. Let's start by adding our custom gridlines. We already created the data for this in the `segment_data` DataFrame earlier so we pass this into the `data` argument of `geom_segment()`. The `geom_segment()` function needs four aesthetics: the x- and y- co-ordinates of the start and end points of the lines. Since the lines are vertical, the `year` will be both the start and end x-coordinates. We want the lines to start below the area chart we'll be adding so the y-values go between `0` and `-1700` (this took a little bit of trial and error!)
Again, although the axis labels are added automatically, we'll add our own custom text instead. We can add text using the `geom_text()` function, where the required aesthetics are the x- and y- coordinates of where the text should be as well as the `label` defining what text should appear. We do this for both the x- and y- axis labels. Further arguments are used to define how the text appears: `color` defines the colour of the text, `size` defines the size of the text, `family` defines the font family to be used, `ha='left'` left aligns the text horizontally, and `va='top'` aligns the text vertically with the top of the y- coordinate given.
> If you're used to {ggplot2} in R, note that these alignment arguments are different: for example `hjust = 0` is the R equivalent to `ha='left'`. Similarly for `vjust` and `va`.
```{python}
p = (p +
# Axis lines
gg.geom_segment(data=segment_data, mapping=gg.aes(x='year', xend='year', y=0, yend=-1700),
linetype='dashed', alpha=0.4, color=text_col) +
# Axis labels
gg.geom_text(data=segment_data, mapping=gg.aes(x='year', y=-1900, label='year'),
color=text_col, size=8, family=body_font, ha='left') +
gg.geom_text(data=y_axis_data, mapping=gg.aes(x=2023, y='value', label='label'),
color=text_col, size=8, family=body_font, ha='left', va='top'))
```
```{python}
#| echo: false
#| eval: true
p + gg.theme(figure_size = (8, 6))
```
We'll deal with the fact that there are now duplicated grid lines a little bit later. Let's add some annotations. We'll add two annotations, with each annotation consisting of three parts: (i) a line indicating a time point, (ii) a title, and (iii) some more information. We can add these using the `annotate()` function - similar to the `geom_()` functions we've used already but designed to add a small number of plot elements directly instead of from a DataFrame. We need to specify which *geometry* we want to use, any required aesthetics, and some optional aesthetics.
> We *could* create a DataFrame with this information, and use the `geom_()` functions instead, but since it's only two annotations, `annotate()` will work just fine.
We add the lines using the `'segment'` geometry where we specify the x- and y-coordinates again, as well as the optional `size` (width of the line) and `color` (text colour) arguments. We add the text annotations using the `'text'` geometry, passing in the x- and y- coordinates and well as the text to appear in the `label` argument. The `fontweight='bold'` argument sets the font to bold.
For the second annotation, the larger text will be the coloured text we defined earlier. Unfortunately, this doesn't (yet) work directly in `plotnine` so we'll leave it out for now and add it a little bit later.
```{python}
p = (p +
# Annotation 1
gg.annotate(
'segment',
x=exceeds100, xend=exceeds100,
y=0, yend=5000,
size=1,
color=text_col
) +
gg.annotate(
'text',
x=exceeds100 + 2, y=5000,
label=exceeds100,
color=text_col,
family=body_font,
ha='left',
va='top',
size=10,
fontweight='bold'
) +
gg.annotate(
'text',
x=exceeds100 + 2, y=5000 - 600,
label='Total coal production first\nexceeds 100 million tonnes\nper year.',
color=text_col,
family=body_font,
ha='left',
size=9,
va='top'
) +
# Annotation 2
gg.annotate(
'segment',
size=1,
x=1975, xend=1975,
y=0, yend=10000,
color=text_col
) +
gg.annotate(
'text',
x=1975 + 2, y=10000,
label='Coal types',
color=text_col,
family=body_font,
ha='left',
va='top',
size=10,
fontweight='bold'
))
```
```{python}
#| echo: false
#| eval: true
p + gg.theme(figure_size = (8, 6))
```
Now let's finally add the area chart! The `geom_area()` function inherits the `data` and `mapping` that we passed into `ggplot()` earlier, so the only additional mapping we need to specify is the `fill` option. We want to colour the different areas based on the `commodity` column. In `plotnine`, `color` is typically used for outline colours, and `fill` is used for inner shape colours. We also pass in the hex colours we defined earlier in `col_palette` to the `scale_fill_manual()` function to tell `plotnine` which colours to use. You'll notice that a legend is added automatically to the right hand side - we'll remove this and replace with coloured text a little bit later.
We also use the `scale_x_continuous()` and `scale_y_continuous()` functions to set the range of the x- and y- axes, respectively. We add a little bit of padding to each side to leave room for some extra text.
Finally, we also add the title and subtitle. Although the `labs()` function is typically used to add title and subtitle in `plotnine`, here I wanted a little bit more control over the positioning of the text (within the plot area) so I've used the `annotate()` function again. Again, the choice of y values for the positioning were a little bit of trial and error.
```{python}
p = (p +
# Add area plot
gg.geom_area(gg.aes(fill='commodity')) +
# Colour, x-axis, and y-axis scales
gg.scale_fill_manual(values=col_palette) +
gg.scale_x_continuous(limits=(1896, 2034)) +
gg.scale_y_continuous(limits=(-3300, 12000)) +
# Text for title and subtitle
gg.annotate(
'text',
x=1900, y=11400,
label=title_text,
color=text_col,
family=body_font,
ha='left',
va='top',
size=13,
fontweight='bold'
) +
gg.annotate(
'text',
x=1900, y=10500,
label=wrapped_subtitle,
color=text_col,
family=body_font,
size=9.5,
ha='left',
va='top'
))
```
```{python}
#| echo: false
#| eval: true
p + gg.theme(figure_size = (8, 6))
```
Let's add some final styling with `plotnine` by trimming the white space between the plot and the axis using `coord_cartesian(expand=False)`. In `plotnine` (and {ggplot2} in R) themes are used to control the styling of all non-data related elements e.g. the background colour, the grid lines, and so on. We add `theme_void()` to remove all axes, gridlines, and unwanted text. Then we add a little bit more styling using different arguments of `theme()` to remove the legend and set the background colour to the variable we defined earlier. Note that the background elements use `element_rect` since the background is essentially just a big rectangle!
```{python}
p = (p +
# Styling
gg.coord_cartesian(expand=False) +
gg.theme_void(base_size=8) +
gg.theme(
legend_position='none',
plot_background=gg.element_rect(fill=bg_col, color=bg_col),
panel_background=gg.element_rect(fill=bg_col, color=bg_col)
))
```
Here's what our plot looks like now:
```{python}
#| echo: false
#| eval: true
p + gg.theme(figure_size = (8, 6))
```
We still need to add the caption, and the coloured text as an alternative to a traditional legend. Unfortunately, there isn't currently a native way in `plotnine` to add coloured text through `highlight-text`. Luckily, `plotnine` is built on top of `matplotlib` - and we can exploit that to add the text directly using `matplotlib`.
## Applying text styling with `highlight-text`
We start by converting from a `plotnine` plot to a `matplotlib` plot using the `draw()` function. We can also set the size of the plot (and resolution if you want using `fig.set_dpi(300)`). The `plt.gca()` extracts the axes from the existing `plotnine` plot.
The `ax_text()` function from `highlight-text` adds text at the desired x- and y- coordinates, where the coordinates are based on the data range in the original plot. The `vsep` argument controls the line spacing, and `fontname` is used instead of `family`, but otherwise the arguments work similarly as in `plotnine`. We use the `ax_text()` function twice - once to add the coloured annotation, and once to add the caption. Finally, we show the plot.
```{python}
# Add coloured text with matplotlib and highlight-text
# Convert to matplotlib and set plot options
fig = p.draw()
fig.set_size_inches(8, 6, forward=True)
ax = plt.gca()
# add coloured text to annotation
ht.ax_text(
1977,
9400,
coal_types_label,
vsep=3,
color=text_col,
fontname=body_font,
fontsize=9,
va='top')
# add caption
ht.ax_text(1900, -2300, cap, color=text_col,
fontname=body_font, fontsize=7.5, va='top')
plt.show()
```
If we want to save our plot to a file, we can use `plt.savefig()` and specify the resolution and `bbox_inches='tight'` to avoid any extra whitespace around the edges of the plot.
```{python}
#| eval: false
# Save image
plt.savefig('plot.png', dpi=300, bbox_inches='tight')
```
Check out the [`plotnine` Gallery](https://plotnine.org/gallery/) for more examples of plots created using `plotnine`!
> You can also read this tutorial on my [website](https://nrennie.rbind.io/blog/plotnine-annotated-area-chart/)!