Hello! For this week’s exercise, I will explore the gapminder dataset that contains health and income outcomes from 184 countries between 1960 and 2016. This exercise will further evaluate any association between life expectancy and infant mortality or population size.

First, let’s load all R packages that will be used for this exercise

## load R packages
library(tidyverse) # tidyverse offers great tools for data wrangling and visualization
library(scales) # to access break formatting functions
library(dslabs) # to use the gapminder data from this package

I then loaded the gapminder data and checked the data structure, summary and class

#Use help to access to the documentation page for gapminder data
help(gapminder)

## starting httpd help server ... done

#get an overview of data structure
str(gapminder)

## 'data.frame':    10545 obs. of  9 variables:
##  $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
##  $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
##  $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
##  $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
##  $ population      : num  1636054 11124892 5270844 54681 20619075 ...
##  $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
##  $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
##  $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...

#Here is the data summary
summary(gapminder)

##                 country           year      infant_mortality life_expectancy
##  Albania            :   57   Min.   :1960   Min.   :  1.50   Min.   :13.20  
##  Algeria            :   57   1st Qu.:1974   1st Qu.: 16.00   1st Qu.:57.50  
##  Angola             :   57   Median :1988   Median : 41.50   Median :67.54  
##  Antigua and Barbuda:   57   Mean   :1988   Mean   : 55.31   Mean   :64.81  
##  Argentina          :   57   3rd Qu.:2002   3rd Qu.: 85.10   3rd Qu.:73.00  
##  Armenia            :   57   Max.   :2016   Max.   :276.90   Max.   :83.90  
##  (Other)            :10203                  NA's   :1453                    
##    fertility       population             gdp               continent   
##  Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07   Africa  :2907  
##  1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09   Americas:2052  
##  Median :3.750   Median :5.009e+06   Median :7.794e+09   Asia    :2679  
##  Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11   Europe  :2223  
##  3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10   Oceania : 684  
##  Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13                  
##  NA's   :187     NA's   :185         NA's   :2972                       
##              region    
##  Western Asia   :1026  
##  Eastern Africa : 912  
##  Western Africa : 912  
##  Caribbean      : 741  
##  South America  : 684  
##  Southern Europe: 684  
##  (Other)        :5586

#determine the type of object gapminder is
class(gapminder)

## [1] "data.frame"

Processing data

Before working on some data visualization and fitting model. I ran a couple lines of code to create new objects and check their data structure and summary. First of all, I use subset function to select only African countries by specifying where the continent variable is “Africa”, and named the object africadata. Next, I ran str() and summary() to see the structure and summary of both objects.

#assign African countries to a new object
africadata <- subset(gapminder, continent == "Africa")

#create two new objects
imr.le <- africadata[,c("infant_mortality","life_expectancy")]#first one contains "infant_mortality" and "life_expectancy"
pop.le <- africadata[,c("population","life_expectancy")]#first one contains "population " and "life_expectancy"

#get an overview of the data
str(africadata)

## 'data.frame':    2907 obs. of  9 variables:
##  $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
##  $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
##  $ infant_mortality: num  148 208 187 116 161 ...
##  $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
##  $ fertility       : num  7.65 7.32 6.28 6.62 6.29 6.95 5.65 6.89 5.84 6.25 ...
##  $ population      : num  11124892 5270844 2431620 524029 4829291 ...
##  $ gdp             : num  1.38e+10 NA 6.22e+08 1.24e+08 5.97e+08 ...
##  $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...

#check the summary of the data
summary(africadata)

##          country          year      infant_mortality life_expectancy
##  Algeria     :  57   Min.   :1960   Min.   : 11.40   Min.   :13.20  
##  Angola      :  57   1st Qu.:1974   1st Qu.: 62.20   1st Qu.:48.23  
##  Benin       :  57   Median :1988   Median : 93.40   Median :53.98  
##  Botswana    :  57   Mean   :1988   Mean   : 95.12   Mean   :54.38  
##  Burkina Faso:  57   3rd Qu.:2002   3rd Qu.:124.70   3rd Qu.:60.10  
##  Burundi     :  57   Max.   :2016   Max.   :237.40   Max.   :77.60  
##  (Other)     :2565                  NA's   :226                     
##    fertility       population             gdp               continent   
##  Min.   :1.500   Min.   :    41538   Min.   :4.659e+07   Africa  :2907  
##  1st Qu.:5.160   1st Qu.:  1605232   1st Qu.:8.373e+08   Americas:   0  
##  Median :6.160   Median :  5570982   Median :2.448e+09   Asia    :   0  
##  Mean   :5.851   Mean   : 12235961   Mean   :9.346e+09   Europe  :   0  
##  3rd Qu.:6.860   3rd Qu.: 13888152   3rd Qu.:6.552e+09   Oceania :   0  
##  Max.   :8.450   Max.   :182201962   Max.   :1.935e+11                  
##  NA's   :51      NA's   :51          NA's   :637                        
##                        region   
##  Eastern Africa           :912  
##  Western Africa           :912  
##  Middle Africa            :456  
##  Northern Africa          :342  
##  Southern Africa          :285  
##  Australia and New Zealand:  0  
##  (Other)                  :  0

#take a look at data structure and summary for the new object imr.le
str(imr.le)

## 'data.frame':    2907 obs. of  2 variables:
##  $ infant_mortality: num  148 208 187 116 161 ...
##  $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...

summary(imr.le)

##  infant_mortality life_expectancy
##  Min.   : 11.40   Min.   :13.20  
##  1st Qu.: 62.20   1st Qu.:48.23  
##  Median : 93.40   Median :53.98  
##  Mean   : 95.12   Mean   :54.38  
##  3rd Qu.:124.70   3rd Qu.:60.10  
##  Max.   :237.40   Max.   :77.60  
##  NA's   :226

#take a look at data structure and summary for the new object pop.le
str(pop.le)

## 'data.frame':    2907 obs. of  2 variables:
##  $ population     : num  11124892 5270844 2431620 524029 4829291 ...
##  $ life_expectancy: num  47.5 36 38.3 50.3 35.2 ...

summary(pop.le)

##    population        life_expectancy
##  Min.   :    41538   Min.   :13.20  
##  1st Qu.:  1605232   1st Qu.:48.23  
##  Median :  5570982   Median :53.98  
##  Mean   : 12235961   Mean   :54.38  
##  3rd Qu.: 13888152   3rd Qu.:60.10  
##  Max.   :182201962   Max.   :77.60  
##  NA's   :51

Plotting

I used ggplot() to create two separate scatter plots from the objects I just created, The first one plotted life expectancy against infant mortality, while the second one plotted life expectancy against population size.

#scatter plot of infant mortality and life expectancy from 1960 to 2016
imr.le %>% 
  ggplot(aes(x=infant_mortality, y=life_expectancy)) +
  geom_point() +
  #add plot title
  ggtitle("Relationship between infant mortality and life expectancy for African countries\n from 1960 to 2016") + 
  #change x-axis label
  xlab("Infant mortality (death per 1000)") + 
  #change y-axis label
  ylab("Life expectancy (years)")

## Warning: Removed 226 rows containing missing values (geom_point).

The plot shows that there was a negative correlation between infant mortality and life expectancy.

#scatterplot of population and life expectancy from 1960 to 2016
pop.le %>% 
  ggplot(aes(x=population, y=life_expectancy)) +
  geom_point() +
  #transform and reformat x-axis to log scale
  scale_x_continuous(trans = 'log10',
                     labels = trans_format("log10", math_format(10^.x))) + 
  #add plot title
  ggtitle("Relationship between population size and life expectancy for African countries\n from 1960 to 2016") + 
  #change x-axis label
  xlab("Population (log scale)") + 
  #change y-axis label
  ylab("Life expectancy (years)")

## Warning: Removed 51 rows containing missing values (geom_point).

We can see that there were some streaks following a positive and linear trend between population and life expectancy. This could result from the source that mixed data from different countries across multiple years, so my next step is to try to avoid this noise and only look into the data from a particular year.

More data processing and plotting

I grouped the data by year and summarized how many missing data in infant mortality across each year, then I filtered out only those years with missing data.

#write code that figures out which years have missing data for infant mortality
africadata %>% 
      group_by(year) %>% 
      summarise(infant_mortality_na = sum(is.na(infant_mortality))) %>% 
      filter(infant_mortality_na > 0) #missing up to 1981 and 2016

## # A tibble: 23 x 2
##     year infant_mortality_na
##    <int>               <int>
##  1  1960                  10
##  2  1961                  17
##  3  1962                  16
##  4  1963                  16
##  5  1964                  15
##  6  1965                  14
##  7  1966                  13
##  8  1967                  11
##  9  1968                  11
## 10  1969                   7
## # ... with 13 more rows

There were some missing data in infant mortality from 1960 to 1981 and 2016 (n=226). Hence, I will only use the data from 2000 to create the same two plots again.

#create new object to extract data for 2000
africadata2000 <- africadata[africadata$year == 2000,]

#take a look at data structure and summary
str(africadata2000)

## 'data.frame':    51 obs. of  9 variables:
##  $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
##  $ year            : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ infant_mortality: num  33.9 128.3 89.3 52.4 96.2 ...
##  $ life_expectancy : num  73.3 52.3 57.2 47.6 52.6 46.7 54.3 68.4 45.3 51.5 ...
##  $ fertility       : num  2.51 6.84 5.98 3.41 6.59 7.06 5.62 3.7 5.45 7.35 ...
##  $ population      : num  31183658 15058638 6949366 1736579 11607944 ...
##  $ gdp             : num  5.48e+10 9.13e+09 2.25e+09 5.63e+09 2.61e+09 ...
##  $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...

summary(africadata2000)

##          country        year      infant_mortality life_expectancy
##  Algeria     : 1   Min.   :2000   Min.   : 12.30   Min.   :37.60  
##  Angola      : 1   1st Qu.:2000   1st Qu.: 60.80   1st Qu.:51.75  
##  Benin       : 1   Median :2000   Median : 80.30   Median :54.30  
##  Botswana    : 1   Mean   :2000   Mean   : 78.93   Mean   :56.36  
##  Burkina Faso: 1   3rd Qu.:2000   3rd Qu.:103.30   3rd Qu.:60.00  
##  Burundi     : 1   Max.   :2000   Max.   :143.30   Max.   :75.00  
##  (Other)     :45                                                  
##    fertility       population             gdp               continent 
##  Min.   :1.990   Min.   :    81154   Min.   :2.019e+08   Africa  :51  
##  1st Qu.:4.150   1st Qu.:  2304687   1st Qu.:1.274e+09   Americas: 0  
##  Median :5.550   Median :  8799165   Median :3.238e+09   Asia    : 0  
##  Mean   :5.156   Mean   : 15659800   Mean   :1.155e+10   Europe  : 0  
##  3rd Qu.:5.960   3rd Qu.: 17391242   3rd Qu.:8.654e+09   Oceania : 0  
##  Max.   :7.730   Max.   :122876723   Max.   :1.329e+11                
##                                                                       
##                        region  
##  Eastern Africa           :16  
##  Western Africa           :16  
##  Middle Africa            : 8  
##  Northern Africa          : 6  
##  Southern Africa          : 5  
##  Australia and New Zealand: 0  
##  (Other)                  : 0

#scatterplot of infant mortality and life expectancy only for year of 2000
africadata2000 %>% 
  ggplot(aes(x=infant_mortality, y=life_expectancy)) +
  geom_point() +
  ggtitle("Relationship between infant mortality and life expectancy for African countries\n in 2000") + #add plot title
  xlab("Infant mortality (death per 1000)") + #change x-axis label
  ylab("Life expectancy (years)") #change y-axis label

We still observed a negative correlation between infant mortality and life expectancy from the data in 2000.

#scatterplot of population and life expectancy only for year of 2000
africadata2000 %>% 
  ggplot(aes(x=population, y=life_expectancy)) +
  geom_point() +
  scale_x_continuous(trans = 'log10',
              labels = trans_format("log10", math_format(10^.x))) + #transform and reformat x-axis to log scale
  ggtitle("Relationship between population size and life expectancy for African countries\n in 2000") + #add plot title
  xlab("Population (log scale)") + #change x-axis label
  ylab("Life expectancy (years)") #change y-axis label

However, we no longer saw any apparent correlation between population size and life expectancy when we looked into only the data in 2000.

Model fit

The final step is to fit two linear regression models by regressing life expectancy on infant mortality and population size separately.

# fit a linear model to regress life expectancy on infant mortality using only 2000 data
fit1 <- lm(life_expectancy ~ infant_mortality, data = africadata2000)
summary(fit1)

## 
## Call:
## lm(formula = life_expectancy ~ infant_mortality, data = africadata2000)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.6651  -3.7087   0.9914   4.0408   8.6817 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      71.29331    2.42611  29.386  < 2e-16 ***
## infant_mortality -0.18916    0.02869  -6.594 2.83e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.221 on 49 degrees of freedom
## Multiple R-squared:  0.4701, Adjusted R-squared:  0.4593 
## F-statistic: 43.48 on 1 and 49 DF,  p-value: 2.826e-08

Conclusion

The significant p-value (p=<0.0001) indicated a linear relationship between infant mortality and life expectancy, and the coefficient showed that the life expectancy goes down by 0.19 for every 1 unit increase in infant mortality.

# fit a linear model to regress life expectancy on population size using only 2000 data
fit2 <- lm(life_expectancy ~ population, data = africadata2000)
summary(fit2)

## 
## Call:
## lm(formula = life_expectancy ~ population, data = africadata2000)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.429  -4.602  -2.568   3.800  18.802 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.593e+01  1.468e+00  38.097   <2e-16 ***
## population  2.756e-08  5.459e-08   0.505    0.616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.524 on 49 degrees of freedom
## Multiple R-squared:  0.005176,   Adjusted R-squared:  -0.01513 
## F-statistic: 0.2549 on 1 and 49 DF,  p-value: 0.6159

Conclusion

The insignificant p-value (p=0.6158781) and a coefficient close to zero indicated that an absence of linear association between population size and life expectancy.

More Analysis (by Savannah)

I am interested in seeing a similar analysis to what we have already done with the Africa data, but this time apply it to other continents!

I will use one additional package, broom, to help present my output, so I’ll load that now.

library(broom)

First, I want to visualize the relationship between infant mortality and life expectancy, and I’ll use colors to differentiate between the continents.

# Create an object with just the infant mortality, life expectancy, and continent data
v1smh <- 
  select(gapminder, infant_mortality, life_expectancy, continent)

# Plot the data with different colors being different continents
ggplot(data = v1smh, aes(infant_mortality, life_expectancy, color = continent)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 1453 rows containing non-finite values (stat_smooth).

## Warning: Removed 1453 rows containing missing values (geom_point).

This is pretty messy, and doesn’t show much, so I want to try and visualize it a different way.

# Plotting the data the same way, this time with different continents each having their own plots 
ggplot(data = v1smh, aes(infant_mortality, life_expectancy)) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~ continent)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 1453 rows containing non-finite values (stat_smooth).

## Warning: Removed 1453 rows containing missing values (geom_point).

This is a little more clear, and shows the same basic trend (which we would expect) - as infant mortality increases, life expectancy decreases. One noticible difference between the continents is Africa and Asia seem to have increased infant mortality as compared to the others.

Now I’ll do the same thing with the population and life expectancy.

#Create another object, now with population instead of infant mortality 
v2smh <- 
  select(gapminder, population, life_expectancy, continent)

# Plot the data with different colors being different continents, with population on a log scale 
ggplot(data = v2smh, aes(population, life_expectancy, color = continent)) +
  scale_x_continuous(trans = 'log10') +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 185 rows containing non-finite values (stat_smooth).

## Warning: Removed 185 rows containing missing values (geom_point).

Again, this is a MESS. Let’s try it again.

# Plot life expectancy against log population
ggplot(data = v2smh, aes(population, life_expectancy)) +
  scale_x_continuous(trans = 'log10') +
  geom_point() +
  geom_smooth() +
  facet_wrap(~ continent) +
  xlab('Population (log scale)') +
  ylab('Life Expectancy')

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 185 rows containing non-finite values (stat_smooth).

## Warning: Removed 185 rows containing missing values (geom_point).

Definitely more readable, but as with our first run-through on the analysis, there doesn’t seem to be a true trend. I’m going to focus on this for an actual analysis.

First I want to see if there are any years that are missing data, so that I can choose one year to focus on without missing data.

# Create object that only contains missing population information according to year and continent 
whatsmissing <- 
  gapminder %>%
  select(year, population, continent) %>%
  filter(is.na(population))

2016 appears to be the only year with missing population data, so for consistency’s sake, I’ll use the year 2000 again.

#Create an object containing population, life expectancy, continent, and year data, and then narrow that down to only data from the year 2000 
poplife2000 <- 
  select(gapminder, population, life_expectancy, continent, year) %>%
  filter(year == 2000)

Now I’ll recreate the plots with the year 2000 data.

#Plot life expectancy against log population for the year 2000
ggplot(data = poplife2000, aes(population, life_expectancy)) +
  scale_x_continuous(trans = 'log10') +
  geom_point() +
  geom_smooth() +
  facet_wrap(~ continent) +
  xlab('Population (log scale)') +
  ylab('Life Expectancy')

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Less cluttered, but not super clear. I’ll run a simple linear regression to see if there’s any significant relationship overall.

# Fit a simple linear model with population and life expectancy across all continents 
fitall <- lm(life_expectancy ~ population, data = poplife2000)
summary(fitall)

## 
## Call:
## lm(formula = life_expectancy ~ population, data = poplife2000)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30.19  -6.09   3.01   6.91  13.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.779e+01  7.303e-01  92.826   <2e-16 ***
## population  3.197e-11  5.677e-09   0.006    0.996    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.613 on 183 degrees of freedom
## Multiple R-squared:  1.734e-07,  Adjusted R-squared:  -0.005464 
## F-statistic: 3.173e-05 on 1 and 183 DF,  p-value: 0.9955

tidy(fitall)

## # A tibble: 2 x 5
##   term        estimate     std.error statistic   p.value
##   <chr>          <dbl>         <dbl>     <dbl>     <dbl>
## 1 (Intercept) 6.78e+ 1 0.730          92.8     7.42e-156
## 2 population  3.20e-11 0.00000000568   0.00563 9.96e-  1

If we look at the relationship overall, there is definitely no significant relationship here. What if we take it to individual continents?

# Create object with just Africa from year 2000 data
afr2000 <-
  poplife2000 %>%
  filter(continent == "Africa") 

# Fit simple linear model with population and life expectancy for Africa 
fitafr <- lm(life_expectancy ~ population, data = afr2000)
summary(fitafr)

## 
## Call:
## lm(formula = life_expectancy ~ population, data = afr2000)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.429  -4.602  -2.568   3.800  18.802 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.593e+01  1.468e+00  38.097   <2e-16 ***
## population  2.756e-08  5.459e-08   0.505    0.616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.524 on 49 degrees of freedom
## Multiple R-squared:  0.005176,   Adjusted R-squared:  -0.01513 
## F-statistic: 0.2549 on 1 and 49 DF,  p-value: 0.6159

tidy(fitafr)

## # A tibble: 2 x 5
##   term             estimate    std.error statistic  p.value
##   <chr>               <dbl>        <dbl>     <dbl>    <dbl>
## 1 (Intercept) 55.9          1.47            38.1   4.51e-38
## 2 population   0.0000000276 0.0000000546     0.505 6.16e- 1

# Create object with just Americas from year 2000 data
ame2000 <-
  poplife2000 %>%
  filter(continent == "Americas") 

# Fit simple linear model with population and life expectancy for Americas 
fitame <- lm(life_expectancy ~ population, data = ame2000)
summary(fitame)

## 
## Call:
## lm(formula = life_expectancy ~ population, data = ame2000)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5126  -2.4481   0.9723   2.1696   6.6652 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.209e+01  7.293e-01  98.846   <2e-16 ***
## population  1.454e-08  1.224e-08   1.188    0.243    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.027 on 34 degrees of freedom
## Multiple R-squared:  0.03986,    Adjusted R-squared:  0.01162 
## F-statistic: 1.411 on 1 and 34 DF,  p-value: 0.2431

tidy(fitame)

## # A tibble: 2 x 5
##   term             estimate    std.error statistic  p.value
##   <chr>               <dbl>        <dbl>     <dbl>    <dbl>
## 1 (Intercept) 72.1          0.729            98.8  2.06e-43
## 2 population   0.0000000145 0.0000000122      1.19 2.43e- 1

# Create object with just Asia from year 2000 data
asi2000 <-
  poplife2000 %>%
  filter(continent == "Asia") 

# Fit simple linear model with population and life expectancy for Asia 
fitasi <- lm(life_expectancy ~ population, data = asi2000)
summary(fitasi)

## 
## Call:
## lm(formula = life_expectancy ~ population, data = asi2000)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.695  -4.915   1.533   4.466  10.732 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.082e+01  9.298e-01  76.164   <2e-16 ***
## population  -4.377e-09  3.775e-09  -1.159    0.252    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.058 on 45 degrees of freedom
## Multiple R-squared:  0.02901,    Adjusted R-squared:  0.007432 
## F-statistic: 1.344 on 1 and 45 DF,  p-value: 0.2524

tidy(fitasi)

## # A tibble: 2 x 5
##   term        estimate     std.error statistic  p.value
##   <chr>          <dbl>         <dbl>     <dbl>    <dbl>
## 1 (Intercept)  7.08e+1 0.930             76.2  3.29e-49
## 2 population  -4.38e-9 0.00000000378     -1.16 2.52e- 1

# Create object with just Europe from year 2000 data
eur2000 <-
  poplife2000 %>%
  filter(continent == "Europe") 

# Fit simple linear model with population and life expectancy for Europe 
fiteur <- lm(life_expectancy ~ population, data = eur2000)
summary(fiteur)

## 
## Call:
## lm(formula = life_expectancy ~ population, data = eur2000)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2647 -3.3597  0.5178  3.1744  5.5239 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.564e+01  7.586e-01  99.707   <2e-16 ***
## population  -2.730e-08  2.212e-08  -1.234    0.225    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.979 on 37 degrees of freedom
## Multiple R-squared:  0.03954,    Adjusted R-squared:  0.01358 
## F-statistic: 1.523 on 1 and 37 DF,  p-value: 0.2249

tidy(fiteur)

## # A tibble: 2 x 5
##   term             estimate    std.error statistic  p.value
##   <chr>               <dbl>        <dbl>     <dbl>    <dbl>
## 1 (Intercept) 75.6          0.759            99.7  1.40e-46
## 2 population  -0.0000000273 0.0000000221     -1.23 2.25e- 1

# Create object with just Oceania from year 2000 data
oci2000 <-
  poplife2000 %>%
  filter(continent == "Oceania") 

# Fit simple linear model with population and life expectancy for Africa 
fitoci <- lm(life_expectancy ~ population, data = oci2000)
summary(fitoci)

## 
## Call:
## lm(formula = life_expectancy ~ population, data = oci2000)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2796  -4.3414   0.8319   4.5676   9.5473 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.649e+01  2.144e+00  31.020 2.85e-11 ***
## population  6.115e-07  3.668e-07   1.667    0.126    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.678 on 10 degrees of freedom
## Multiple R-squared:  0.2175, Adjusted R-squared:  0.1392 
## F-statistic: 2.779 on 1 and 10 DF,  p-value: 0.1265

tidy(fitoci)

## # A tibble: 2 x 5
##   term            estimate   std.error statistic  p.value
##   <chr>              <dbl>       <dbl>     <dbl>    <dbl>
## 1 (Intercept) 66.5         2.14            31.0  2.85e-11
## 2 population   0.000000611 0.000000367      1.67 1.26e- 1

Unfortunately, it doesn’t look like there are relationships, at least at this level. However, I think looking for these things is important, even if you don’t find any statistical relationships! And it’s always fun to do quick analyses to check!

Tzu-Chun’s R coding Exercise

Processing data

Plotting

More data processing and plotting

Model fit

Conclusion

Conclusion

More Analysis (by Savannah)