Population growth and Doubling times with tidyverse

Feb 14, 2018 00:00 · 3071 words · 15 minute read tidyverse population studies demographic data

Roses are red, violets are blue

This is a forced rhyme, here’s blog post two!

Background

Ever since I worked on data about populations at my internship at Perscio, a healthcare data analysis firm in Indianapolis, as well as worked with a Professor of Demography and Social Policy on a paper about demographic data, I have gained interest in population problems - mostly through readings.

The best way to restart this journey would be to do so using what population problems often involve: Data analysis. In this post, we define and calculate population growth rates as well as doubling times of several countries and then finally produce intuitive visualizations of these numbers.

Loading libraries and data

The data used throughout this post is from United Nations’ Population Divison and consists of population numbers of all countries between 1970 and 2015 (in intervals of 5 years).

library(tidyverse)
library(kani)
library(scales)
library(geofacet)

population_raw <- read_csv("../../static/data/population.csv")

population_raw
## # A tibble: 273 x 68
##    Country    code `1950` `1951` `1952` `1953` `1954` `1955` `1956` `1957`
##    <chr>     <int> <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
##  1 WORLD       900 2 536… 2 583… 2 630… 2 677… 2 724… 2 772… 2 821… 2 871…
##  2 More dev…   901 814 8… 824 2… 834 0… 844 2… 854 6… 865 0… 875 5… 885 9…
##  3 Less dev…   902 1 721… 1 759… 1 796… 1 832… 1 869… 1 907… 1 945… 1 986…
##  4 Least de…   941 195 2… 199 0… 202 9… 206 8… 211 0… 215 4… 220 0… 224 8…
##  5 Less dev…   934 1 526… 1 560… 1 593… 1 626… 1 658… 1 691… 1 725… 1 761…
##  6 Less dev…   948 1 157… 1 179… 1 203… 1 229… 1 256… 1 284… 1 314… 1 344…
##  7 High-inc…  1503 672 8… 680 6… 688 8… 697 4… 706 2… 715 2… 724 3… 733 4…
##  8 Middle-i…  1517 1 734… 1 772… 1 808… 1 844… 1 880… 1 916… 1 954… 1 992…
##  9 Upper-mi…  1502 956 2… 980 1… 1 001… 1 022… 1 041… 1 060… 1 079… 1 099…
## 10 Lower-mi…  1501 778 2… 792 1… 807 0… 822 6… 839 1… 856 4… 874 4… 893 2…
## # ... with 263 more rows, and 58 more variables: `1958` <chr>,
## #   `1959` <chr>, `1960` <chr>, `1961` <chr>, `1962` <chr>, `1963` <chr>,
## #   `1964` <chr>, `1965` <chr>, `1966` <chr>, `1967` <chr>, `1968` <chr>,
## #   `1969` <chr>, `1970` <chr>, `1971` <chr>, `1972` <chr>, `1973` <chr>,
## #   `1974` <chr>, `1975` <chr>, `1976` <chr>, `1977` <chr>, `1978` <chr>,
## #   `1979` <chr>, `1980` <chr>, `1981` <chr>, `1982` <chr>, `1983` <chr>,
## #   `1984` <chr>, `1985` <chr>, `1986` <chr>, `1987` <chr>, `1988` <chr>,
## #   `1989` <chr>, `1990` <chr>, `1991` <chr>, `1992` <chr>, `1993` <chr>,
## #   `1994` <chr>, `1995` <chr>, `1996` <chr>, `1997` <chr>, `1998` <chr>,
## #   `1999` <chr>, `2000` <chr>, `2001` <chr>, `2002` <chr>, `2003` <chr>,
## #   `2004` <chr>, `2005` <chr>, `2006` <chr>, `2007` <chr>, `2008` <chr>,
## #   `2009` <chr>, `2010` <chr>, `2011` <chr>, `2012` <chr>, `2013` <chr>,
## #   `2014` <chr>, `2015` <chr>

The data looks a little weird:

  1. It’s in a wide format than a long one, each year seems to be a single column.
  2. The population values look to be parsed as characters, this is mostly because I didn’t provide any parsing formats to read_csv()

Tidying data

We can all fix this using some of the helper functions in the tidyverse package!

population <- population_raw %>%
  gather(`1950`:`2015`, key = "year", value = "population") %>%
  mutate(
    population = as.numeric(str_replace_all(population, " ", "")),
    year = as.numeric(year)
  )

population
## # A tibble: 18,018 x 4
##    Country                                           code  year population
##    <chr>                                            <int> <dbl>      <dbl>
##  1 WORLD                                              900  1950    2536275
##  2 More developed regions                             901  1950     814865
##  3 Less developed regions                             902  1950    1721410
##  4 Least developed countries                          941  1950     195259
##  5 Less developed regions, excluding least develop…   934  1950    1526151
##  6 Less developed regions, excluding China            948  1950    1157197
##  7 High-income countries                             1503  1950     672896
##  8 Middle-income countries                           1517  1950    1734481
##  9 Upper-middle-income countries                     1502  1950     956204
## 10 Lower-middle-income countries                     1501  1950     778277
## # ... with 18,008 more rows

Now that the data is more readable, we can look at what each column describes:

  1. Country: Country names (also contains data about regions and world)
  2. code: Country code specified by the UN
  3. population: Total population of the country in 1000s
  4. Year: .. The year

As an example, we can now plot how the population grew for the world, as well as countries with different income situations:

population %>%
  filter(str_detect(Country, "WORLD|income")) %>%
  ggplot(aes(year, population/1000, group = Country, color = Country)) + 
  geom_line(size = 1) + 
  scale_y_continuous(breaks = pretty_breaks(n = 6)) +
  scale_x_continuous(breaks = pretty_breaks(n = 6)) +
  scale_color_kani() + 
  theme_minimal() + 
  theme(
    plot.title = element_text(face = "bold", size = rel(1.8), family = "Merriweather"),
    plot.subtitle = element_text(size = rel(1.2), family = "Merriweather Light", margin = margin(0,0,20,0)),
    text = element_text(family = "Noto Sans CJK JP Light"),
    axis.title.x = element_text(margin = margin(20, 0, 0, 0)),
    axis.text = element_text(size = rel(1)),
    legend.position = "top",
    panel.grid.minor = element_blank(),
    legend.text = element_text(size = rel(0.8))
  ) + 
  labs(
    title = "Population growth rates in countries\ndifferentiated by income",
    y = "Population per million",
    x = "Year",
    color = "",
    subtitle = "Middle income countries have been experiencing\nhigher population growth than other countries"
  )

Population Growth Rate

Studying total population numbers is great, but what’s even useful is to look at the rate by which the population changes in regions. The population growth rate of a country can be defined as the rate at which the number of individuals changes over a period of time expressed as a percentage of the population at the beginning of that time period.

Mathematically,

\[ Population\ growth \ rate = \frac{Pop(t_2) - Pop(t_1)}{Pop(t_1)(t_2 - t_1)} \]

where,

\(t_1\) and \(t_2\) are beginning and end times of the time period. In our data these are successive years so the difference is always 1.

and

\(Pop(t)\) is the number of individuals at time \(t\).

We can use the lag() function in dplyr to calculate the yearly growth rate for each country/region in the dataset. As an example, we can see the population growth rate of the world starting from 1950 as shown in this plot:

population %>%
  filter(Country == "WORLD") %>%
  mutate(growth_rate = population/lag(population, 1) - 1) %>%
  ggplot(aes(year, growth_rate)) +
  geom_line(size = 1, color = "#f15c5c") + 
  scale_y_continuous(labels = percent_format(), limits = c(0, 0.022)) +
  scale_x_continuous(breaks = pretty_breaks(n = 6)) + 
  theme_minimal() + 
  theme(
    plot.title = element_text(face = "bold", size = rel(1.8), family = "Merriweather"),
    plot.subtitle = element_text(size = rel(1.2), family = "Merriweather Light", margin = margin(0,0,20,0)),
    text = element_text(family = "Noto Sans CJK JP Light"),
    axis.title.x = element_text(margin = margin(20, 0, 0, 0)),
    axis.text = element_text(size = rel(1)),
    panel.grid.minor = element_blank()
  ) + 
  labs(
    x = "Year",
    y = "Population Growth Rate (%)",
    title = "Population Growth Rate of the World",
    subtitle = "Average yearly change in population between 1950-2015"
  )

But this was for one region in the entire dataset! How can we fit this model for all regions? Easy, we just use map() from the purrr package which lets us extend a function to different kinds of groups within the data which in this case are countries/regions. This can be done by first nesting all the yearly population changes for each country as a dataframe, fitting the desired function for each country, and then unnesting to get rates for all countries.

growth_rate <- function(df) {
  return(df %>% transmute(growth_rate = population/lag(population, 1) - 1))
}

population_growth <- population %>%
  group_by(Country) %>%
  nest() %>%
  mutate(growth = map(data, growth_rate)) %>%
  unnest()

population_growth
## # A tibble: 18,018 x 5
##    Country  code  year population growth_rate
##    <chr>   <int> <dbl>      <dbl>       <dbl>
##  1 WORLD     900  1950    2536275     NA     
##  2 WORLD     900  1951    2583817      0.0187
##  3 WORLD     900  1952    2630584      0.0181
##  4 WORLD     900  1953    2677230      0.0177
##  5 WORLD     900  1954    2724302      0.0176
##  6 WORLD     900  1955    2772243      0.0176
##  7 WORLD     900  1956    2821383      0.0177
##  8 WORLD     900  1957    2871952      0.0179
##  9 WORLD     900  1958    2924081      0.0182
## 10 WORLD     900  1959    2977825      0.0184
## # ... with 18,008 more rows

Let’s look at the first plot in this post, but from the perspective of population growth rate:

population_growth %>%
  filter(str_detect(Country, "WORLD|income")) %>%
  ggplot(aes(year, growth_rate, group = Country, color = Country)) +
  geom_line(size = 1) + 
  scale_y_continuous(breaks = seq(0, 0.03, by = 0.005), limits = c(0, 0.03), labels = percent_format()) +
  scale_x_continuous(breaks = pretty_breaks(n = 6)) +
  scale_color_kani() + 
  theme_minimal() + 
  theme(
    plot.title = element_text(face = "bold", size = rel(1.8), family = "Merriweather"),
    plot.subtitle = element_text(size = rel(1.2), family = "Merriweather Light", margin = margin(0,0,20,0)),
    text = element_text(family = "Noto Sans CJK JP Light"),
    axis.title.x = element_text(margin = margin(20, 0, 0, 0)),
    axis.text = element_text(size = rel(1)),
    legend.position = "top",
    panel.grid.minor = element_blank(),
    legend.text = element_text(size = rel(0.8))
  ) + 
  labs(
    title = "Population growth rates in countries\ndifferentiated by income",
    y = "Population Growth Rate (%)",
    x = "Year",
    color = "",
    subtitle = "As the world population growth rate falls,\nlow income countries are experiencing higher growth rates."
  )

We see that while the low-income countries line was at the bottom of the chart in the first plot indicating their population numbers havent gone up by much, they still experience the highest percentage changes in their population. Low income counties started at 1.4% growth rate and then jumped up to being the highest in comparison to countries with higher income, 2.7%. This is mostly because of a dual effect: high birth rates and presence of a younger population compared to the rest, but I will most probably explore this further in future posts.

Doubling Times

We now focus on doubling times, or the metric which looks at how long does it take for a region to double its population. This is important because the onset of modernity (starting in mid 20th century), something that brought in better standards of living and health has resulted in a rapid population growth, but that historical growth has now slowed down greatly. The peak growth rate was in 1960s at about 2.1% and has since fallen to about half of that. It would be interesting to see how long it took for the population to double in the 60s versus now.

Mathematically, the doubling time for a given year can be given as follows:

\[ Doubling \ Time = \frac{\ln(2)}{r_t} \]

Where \(r\) is the growth rate of the region at time \(t\). We assume that human population growth follows a exponential curve that explains the \(\ln(2)\) component.

We can now use this and fit it to all regions described in the dataset.

population_rates <- population_growth %>%
  group_by(Country) %>%
  nest() %>%
  mutate(doubling_time = map(data, function(df) {return(log(2)/df$growth_rate)})) %>%
  unnest()

population_rates
## # A tibble: 18,018 x 6
##    Country doubling_time  code  year population growth_rate
##    <chr>           <dbl> <int> <dbl>      <dbl>       <dbl>
##  1 WORLD            NA     900  1950    2536275     NA     
##  2 WORLD            37.0   900  1951    2583817      0.0187
##  3 WORLD            38.3   900  1952    2630584      0.0181
##  4 WORLD            39.1   900  1953    2677230      0.0177
##  5 WORLD            39.4   900  1954    2724302      0.0176
##  6 WORLD            39.4   900  1955    2772243      0.0176
##  7 WORLD            39.1   900  1956    2821383      0.0177
##  8 WORLD            38.7   900  1957    2871952      0.0179
##  9 WORLD            38.2   900  1958    2924081      0.0182
## 10 WORLD            37.7   900  1959    2977825      0.0184
## # ... with 18,008 more rows

Let’s look at the doubling times of countries based differentiated by income levels as an example:

population_rates %>%
  filter(str_detect(Country, "WORLD|income")) %>%
  ggplot(aes(year, doubling_time, group = Country, color = Country)) +
  geom_line(size = 1) + 
  scale_y_continuous(breaks = seq(0, 150, by = 25), limits = c(0, 150)) +
  scale_x_continuous(breaks = pretty_breaks(n = 6)) +
  scale_color_kani() + 
  theme_minimal() + 
  theme(
    plot.title = element_text(face = "bold", size = rel(1.8), family = "Merriweather"),
    plot.subtitle = element_text(size = rel(1.2), family = "Merriweather Light", margin = margin(0,0,20,0)),
    text = element_text(family = "Noto Sans CJK JP Light"),
    axis.title.x = element_text(margin = margin(20, 0, 0, 0)),
    axis.text = element_text(size = rel(1)),
    legend.position = "top",
    panel.grid.minor = element_blank(),
    legend.text = element_text(size = rel(0.8))
  ) + 
  labs(
    title = "Population Doubling times in the world",
    subtitle = "Higher income countries take the longest time to double their\npopulation while the lower income ones take the least time",
    y = "Doubling time in years",
    x = "Year",
    color = ""
  )

Visualizing Growth and Doubling Times in Different regions

So far, we’ve seen growth rates in countries grouped together in bins or buckets based on income levels, what if we wanted to decompose these and actually look at countries? We can always select a bunch of countries and show them in a single graph, or even make separate graphs and show them in the same plot as different boxes using facet_wrap().

This is great, but it can also mask regional patterns, what if all Scandinavian countries experienced similar trends? What is an intelligent way to group them together? One way is to manually do it, but this is where the geofacet package comes into play. With the geofacet package, one can create a grid as we will see below and pre-define the positions of each country/region so that they can mimic a world map!

As an example, we look at European Countries:

europe_grid <- data.frame(
  row = c(1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9, 9),
  col = c(1, 4, 5, 6, 7, 1, 2, 5, 7, 8, 4, 5, 6, 7, 8, 2, 3, 4, 5, 6, 7, 8, 1, 2, 4, 5, 6, 7, 8, 4, 6, 7, 8, 7, 8, 6, 7, 8),
  code = c("ISL", "NOR", "SWE", "FIN", "EST", "IRL", "GBR", "DEN", "LAT", "RUS", "NLD", "DEU", "POL", "LTU", "BLR", "FRA", "BEL", "LUX", "AUT", "CZE", "SVK", "UKR", "PRT", "ESP", "CHE", "SVN", "HUN", "ROU", "MDA", "ITA", "HRV", "SRB", "BGR", "MNE", "MKD", "BIH", "ALB", "GRC"),
  name = c("Iceland", "Norway", "Sweden", "Finland", "Estonia", "Ireland", "United Kingdom", "Denmark", "Latvia", "Russian Federation", "Netherlands", "Germany", "Poland", "Lithuania", "Belarus", "France", "Belgium", "Luxembourg", "Austria", "Czechia", "Slovakia", "Ukraine", "Portugal", "Spain", "Switzerland", "Slovenia", "Hungary", "Romania", "Republic of Moldova", "Italy", "Croatia", "Serbia", "Bulgaria", "Montenegro", "TFYR Macedonia", "Bosnia and Herzegovina", "Albania", "Greece"),
  stringsAsFactors = FALSE
)

euro_facets <- population_rates %>%
  filter(Country %in% europe_grid$name) %>%
  ggplot(aes(year, growth_rate, group = Country)) + 
  geom_line(color = "#79bd9a", size = 1) + 
  scale_y_continuous(labels = percent_format()) +
  facet_geo(~Country, grid = europe_grid) + 
  theme_kani() + 
  theme(
    plot.title = element_text(face = "bold", size = rel(1.8), family = "Merriweather"),
    plot.subtitle = element_text(size = rel(1.2), family = "Merriweather Light", margin = margin(0,0,20,0)),
    text = element_text(family = "Noto Sans CJK JP Light"),
    axis.title.x = element_text(margin = margin(20, 0, 0, 0)),
    axis.text = element_text(size = rel(1)),
    panel.grid.minor = element_blank(),
    plot.background = element_rect(fill = "white"),
    panel.background = element_rect(fill = "white"),
    strip.background = element_rect(fill = "white"),
    strip.text.x = element_text(face = "bold")
  ) + 
  labs(
    title = "Population growth rates in Europe",
    y = "Population Growth Rate (%)",
    x = "",
    color = "",
    subtitle = "Europe has been facing a bit of a population decline. "
  )

ggsave("../../static/img/eu_population_growth.png", euro_facets, height = 15, width = 20)

euro_facets

Check enlarged version. We see that most of Europe is beginning to enter the population decline phase, there is a small upward trend in some countries, but this is mostly because of the mass-migration. Most of Europe has already entered the phase of population decline.

What about the doubling times in South America?

south_america_grid <- data.frame(
  row = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4),
  col = c(1, 2, 3, 4, 5, 2, 3, 4, 5, 3, 4, 5, 4),
  code = c("COL", "VEN", "GUY", "SUR", "GUF", "ECU", "PER", "BOL", "BRA", "CHL", "PRY", "URY", "ARG"),
  name = c("Colombia", "Venezuela (Bolivarian Republic of)", "Guyana", "Suriname", "French Guiana", "Ecuador", "Peru", "Bolivia (Plurinational State of)", "Brazil", "Chile", "Paraguay", "Uruguay", "Argentina"),
  stringsAsFactors = FALSE
)

sa_facets <- population_rates %>%
  filter(Country %in% south_america_grid$name) %>%
  ggplot(aes(year, doubling_time, group = Country)) + 
  geom_line(color = "#8283a7", size = 1) + 
  scale_x_continuous(breaks = seq(1950, 2010, length = 5)) +
  scale_y_continuous(breaks = pretty_breaks(7)) +
  facet_geo(~Country, grid = south_america_grid, scales = "free") + 
  theme_kani() + 
  theme(
    plot.title = element_text(face = "bold", size = rel(1.8), family = "Merriweather"),
    plot.subtitle = element_text(size = rel(1.2), family = "Merriweather Light", margin = margin(0,0,20,0)),
    text = element_text(family = "Noto Sans CJK JP Light"),
    axis.title.x = element_text(margin = margin(20, 0, 0, 0)),
    axis.text = element_text(size = rel(1)),
    panel.grid.minor = element_blank(),
    plot.background = element_rect(fill = "white"),
    panel.background = element_rect(fill = "white"),
    strip.background = element_rect(fill = "white"),
    strip.text.x = element_text(face = "bold", size = rel(1.1))
  ) + 
  labs(
    title = "Population Doubling Times in South America",
    x = "",
    y = "Doubling Time in years",
    color = "",
    subtitle = "More stable trend for larger countries, less so for the smaller ones"
  )

ggsave("../../static/img/sa_doubling.png", sa_facets, height = 12, width = 16)

sa_facets

Check enlarged version. The population boom between 60s and 80s did affect most of South America since there was a decline in the doubling times (for at least the larger countries, by size). This was also a time when the fastest doubling of the world population happened, from 2.5 billion people to 5 billion people in just 37 years (1950 - 1987)! The UN projections with the most likely scenario (SSP2) indicate that by 2088, it will take another 100 years for the world population to double (Our World in Data, 2015).

Conclusion

This was a simple post that introduces some helpful rates and measures to understand population change in the world. The visualizations in the post showed how countries with different income levels (as categorized by the UN) differ in their respective population growth rates as well as doubling times, and then we further decomposed these groupings by plotting the country specific measures using geofacet.

It is exciting to see what the future holds in terms of population changes and hope to continue working with more complex demographic data to produce interesting analyses to blog about! I am very happy to get feedback on this post so please reach out to me via Twitter if you have any comments to make!

tweet Share