These are the projects and labs completed for Data Visualization (STAT 302) - Spring 2020.

Favorite Labs
Lab 05
Lab 08

L01 ggplot overview

Overview

The goals of this lab are to (1) ensure that the major software for this course is properly installed and functional, (2) develop and follow a proper workflow, and (3) work together to construct a few plots to explore a dataset using ggplot2 — demonstration of the utility and power of ggplot2.

Don’t worry if you cannot do everything here by yourself. You are just getting started and the learning curve is steep, but remember that the instructional team and your classmates will be there to provide support. Persevere and put forth an honest effort and this course will payoff.

Load Packages tidyverse, ggstance, skimr

# Load package(s) 
suppressPackageStartupMessages(library(tidyverse))
#suppressPackageStartupMessages(library(ggstance))
suppressPackageStartupMessages(library(skimr))

Dataset

We’ll be using data from the lego package which is already in the /data subdirectory, along with many other processed datasets, as part of the zipped folder for this lab.

Exercise 1

Let’s look at some interesting patterns in the history of LEGO! We’ll be using data from the lego package located data/legosets.rda. We will work through this exercise together in class.

load(file = "data/legosets.rda")

1a Inspect the data

The lego package provides a helpful dataset some interesting variables. Let’s take a quick look at the data.

#print the top of the data sets
#head(legosets)

#focus of the variables 
glimpse(legosets)

## Rows: 6,172
## Columns: 14
## $ Item_Number  <chr> "10246", "10247", "10248", "10249", "10581", "10582", ...
## $ Name         <chr> "Detective's Office", "Ferris Wheel", "Ferrari F40", "...
## $ Year         <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ...
## $ Theme        <chr> "Advanced Models", "Advanced Models", "Advanced Models...
## $ Subtheme     <chr> "Modular Buildings", "Fairground", "Vehicles", "Winter...
## $ Pieces       <int> 2262, 2464, 1158, 898, 13, 39, 32, 105, 13, 11, 52, 13...
## $ Minifigures  <int> 6, 10, NA, NA, 1, 2, 2, 3, 2, 2, 3, 1, NA, NA, NA, NA,...
## $ Image_URL    <chr> "http://images.brickset.com/sets/images/10246-1.jpg", ...
## $ GBP_MSRP     <dbl> 132.99, 149.99, 69.99, 59.99, 9.99, 16.99, 19.99, 49.9...
## $ USD_MSRP     <dbl> 159.99, 199.99, 99.99, 79.99, 9.99, 19.99, 24.99, 59.9...
## $ CAD_MSRP     <dbl> 199.99, 229.99, 119.99, NA, 12.99, 24.99, 29.99, 69.99...
## $ EUR_MSRP     <dbl> 149.99, 179.99, 89.99, 69.99, 9.99, 19.99, 24.99, 59.9...
## $ Packaging    <chr> "Box", "Box", "Box", "Box", "Box", "Box", "Box", "Box"...
## $ Availability <chr> "Retail - limited", "Retail - limited", "LEGO exclusiv...

#short eda
skim_without_charts(legosets)

Data summary
Name	legosets
Number of rows	6172
Number of columns	14
_______________________
Column type frequency:
character	7
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique	whitespace
Item_Number	1	1	13	0	5854	0
Name	1	2	73	0	5519	4
Theme	1	4	28	0	115	0
Subtheme	1	0	32	2206	358	1
Image_URL	1	46	58	0	6172	0
Packaging	1	3	21	0	14	0
Availability	1	6	21	0	8	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
Year	0	1.00	2004.71	8.91	1971.00	2000.00	2006.00	2012.00	2015.00
Pieces	112	0.98	215.17	356.20	0.00	30.00	82.00	256.25	5922.00
Minifigures	2672	0.57	2.85	2.72	1.00	1.00	2.00	4.00	32.00
GBP_MSRP	1980	0.68	23.45	31.93	0.00	5.99	12.99	29.99	509.99
USD_MSRP	355	0.94	27.90	39.32	0.00	6.00	14.99	34.99	789.99
CAD_MSRP	4190	0.32	46.34	58.46	2.99	12.99	24.99	54.99	789.99
EUR_MSRP	4399	0.29	35.98	46.61	0.00	9.99	19.99	39.99	699.99

Notice there are a lot of missing variables, especially when it comes to pricing - this will be important for when we calculate the means.

1b Price per year

First, let’s look at the average cost of LEGO sets over time. The main variable of interest here is USD_MSRP, or the manufacturer’s suggested retail price in constant dollars (i.e. not adjusted for inflation).

#`drop` NA's at top  
df.1b <- legosets %>% 
  drop_na(USD_MSRP) %>% 
  group_by(Year) %>% 
  summarize(
    AveragePrice = mean(USD_MSRP)
  )

## `summarise()` ungrouping output (override with `.groups` argument)

#`remove` NA's in mean function
#equivalent to `drop` NAs at top for this task 
#df.1b <- legosets %>%
#  group_by(Year) %>%
#  summarize(
#    AverageCost = mean(USD_MSRP, na.rm=TRUE)
#    )

ggplot(df.1b, aes(Year, AveragePrice))+
  geom_point(size=1.25)+
  geom_line(size=1)+
  ggtitle("LEGO Sets: Average Price vs Year")+
  theme_minimal()+
  labs( #lables on the plot
     x = "Year"
   , y = "Average Price (USD)"
   , subtitle = "sets without price are removed from calculation of average price"
   , caption = "Source: LEGO"
    )

Is the increase in price simply due to inflation? Could get data from federal resersve and plot inflation-adjusted price vs year.

1c Pieces per year

Next, let’s look at how the number of pieces per set has changed over time. Because Duplo sets are much smaller (since they’re designed for toddlers), we’ll make a special indicator variable for them.

All Years

df.1c <- legosets %>% 
  mutate(IsDuplo = ifelse(Theme == "Duplo", "Duplo", "Lego")) %>%
  drop_na(Pieces) %>% 
  group_by(Year, IsDuplo) %>% 
  summarize(
     AveragePieces = mean(Pieces)
    ,num_sets = n()
    )

## `summarise()` regrouping output by 'Year' (override with `.groups` argument)

#same plot 
ggplot(df.1c, aes(Year, AveragePieces, color=IsDuplo))+
  geom_point()+
  geom_line()+
  ggtitle("Lego Sets: Average Pieces vs Year")+
  labs( #lables on the plot
     x = "Year"
   , y = "Average Pieces"
   , subtitle = "sets without pieces are removed from calculation of average pieces"
   , caption = "Source: LEGO"
    )

#facet plots
ggplot(df.1c, aes(Year, AveragePieces, color=IsDuplo))+
  facet_wrap(~IsDuplo, ncol=2, scales="fixed")+
  geom_point()+
  geom_line()+
  ggtitle("Lego Sets: Average Pieces vs Year")+
  labs( #lables on the plot
     x = "Year"
   , y = "Average Pieces"
   , subtitle = "sets without pieces are removed from calculation of average pieces"
   , caption = "Source: LEGO"
    )+
  theme(
    legend.position = "none"
  )

Data after 1985

df.1c <- legosets %>% 
  mutate(IsDuplo = ifelse(Theme == "Duplo", "Duplo", "Lego")) %>%
  drop_na(Pieces) %>% 
  filter(Year > 1985) %>%
  group_by(Year, IsDuplo) %>% 
  summarize(
     AveragePieces = mean(Pieces)
    ,num_sets = n()
    )

## `summarise()` regrouping output by 'Year' (override with `.groups` argument)

#same plot 
ggplot(df.1c, aes(Year, AveragePieces, color=IsDuplo))+
  geom_point()+
  geom_line()+
  ggtitle("Lego Sets: Average Pieces vs Year")+
  labs( #lables on the plot
     x = "Year"
   , y = "Average Pieces"
   , subtitle = "sets without pieces are removed from calculation of average pieces"
   , caption = "Source: LEGO"
    )

#facet plots
ggplot(df.1c, aes(Year, AveragePieces, color=IsDuplo))+
  facet_wrap(~IsDuplo, ncol=2, scales="fixed")+
  geom_point()+
  geom_line()+
  ggtitle("Lego Sets: Average Pieces vs Year")+
  labs( #lables on the plot
     x = "Year"
   , y = "Average Pieces"
   , subtitle = "sets without pieces are removed from calculation of average pieces"
   , caption = "Source: LEGO"
    )+
  theme(
    legend.position = "none"
  )

1d LEGO set themes

In the 1990s, LEGO began partnering with famous brands and franchises to boost its own sales. First, let’s see how many different “themes” LEGO now offers:

length(unique(legosets$Theme))

## [1] 115

legosets %>% 
  distinct(Theme) %>% 
  summarize ( n_themes = n())

## # A tibble: 1 x 1
##   n_themes
##      <int>
## 1      115

#counts of the themes 
df.1d <- legosets %>% 
  #count data, automatically labeled 'n' 
  count(Theme, sort = TRUE
        , name ="theme_count" #personalize name if wnat 
        )  %>% 
  #turning character chr into factor fct -- meaning a categorical variable 
  mutate(Theme = fct_inorder(Theme, ordered = TRUE)) 



df.1d %>% 
  filter(theme_count > 150, Theme != 'Duplo') %>% #only look at top ones
  ggplot(aes(x=fct_rev(Theme), y= theme_count)) +
    geom_col()+
    coord_flip()+
    labs(y="Number of Sets", x=NULL)

L02 aesthetics

Overview

The goal of this lab is to begin the process of unlocking the power of ggplot2 through constructing and experimenting with a few basic plots.

Datasets

We’ll be using data from the blue_jays.rda dataset which is already in the /data subdirectory in our data_vis_labs project. Below is a description of the variables contained in the dataset.

BirdID - ID tag for bird
KnownSex - Sex coded as F or M
BillDepth - Thickness of the bill measured at the nostril (in mm)
BillWidth - Width of the bill (in mm)
BillLength - Length of the bill (in mm)
Head - Distance from tip of bill to back of head (in mm)
Mass - Body mass (in grams)
Skull - Distance from base of bill to back of skull (in mm)
Sex - Sex coded as 0 = female or 1 = male

We’ll also be using a subset of the BRFSS (Behavioral Risk Factor Surveillance System) survey collected annually by the Centers for Disease Control and Prevention (CDC). The data can be found in the provided cdc.txt file — place this file in your /data subdirectory. The dataset contains 20,000 complete observations/records of 9 variables/fields, described below.

genhlth - How would you rate your general health? (excellent, very good, good, fair, poor)
exerany - Have you exercised in the past month? (1 = yes, 0 = no)
hlthplan - Do you have some form of health coverage? (1 = yes, 0 = no)
smoke100 - Have you smoked at least 100 cigarettes in your life time? (1 = yes, 0 = no)
height - height in inches
weight - weight in pounds
wtdesire - weight desired in pounds
age - in years
gender - m for males and f for females

Notice we are setting a seed. This signifies we will be doing something that relies on a random process (e.g., random sampling). In order for our results to be reproducible we set the seed. This ensures that every time you run the code or someone else does, it will produce the exact same output. It is good coding etiquette to set the seed towards the top of your document/code.

# Set the seed for reproducibility
set.seed(31412718)

# Load package(s)
library(tidyverse)

Exercises

Complete the following exercises.

Exercise 1

Using blue_jay dataset, construct the following scatterplots of Head by Mass:

One with the color aesthetic set to Northwestern purple (#4E2A84), shape aesthetic set a solid/filled triangle, and size aesthetic set to 2.
One using Sex or KnownSex mapped to the color aesthetic. That is, determine which is more appropriate and explain why. Also set the size aesthetic to 2.

Consider the color aesthetic in the plots for (1) and (2). Explain why these two usages of the color aesthetic are meaningfully different.

load(file = "data/blue_jays.rda")
head(blue_jays)

##       BirdID KnownSex BillDepth BillWidth BillLength  Head  Mass Skull Sex
## 1 0000-00000        M      8.26      9.21      25.92 56.58 73.30 30.66   1
## 2 1142-05901        M      8.54      8.76      24.99 56.36 75.10 31.38   1
## 3 1142-05905        M      8.39      8.78      26.07 57.32 70.25 31.25   1
## 4 1142-05907        F      7.78      9.30      23.48 53.77 65.50 30.29   0
## 5 1142-05909        M      8.71      9.84      25.47 57.32 74.90 31.85   1
## 6 1142-05911        F      7.28      9.30      22.25 52.25 63.90 30.00   0

Exercise 1.1

ggplot(blue_jays, aes(Mass, Head))+
  geom_point(color="#4E2A84", shape=17, size=2)+
  ggtitle("Blue Jays: Head vs Mass")

Exercise 1.2

ggplot(blue_jays, aes(Mass, Head, color=KnownSex))+
  geom_point(size=2)+
  ggtitle("Blue Jays: Head vs Mass")

The first plot uses colors for a purley subjective reason, a preferance for purple, but does not have any aesthetic effect on the data. The second plot uses the aesthetic color to section the data into two groups (Female and Male); this plot is using color as part of the mapping of the data to aesthetics.

Exercise 2

Using a random subsample of size 100 from the cdc dataset (code provided below), construct a scatterplot of weight by height. Construct 5 more scatterplots of weight by height that make use of aesthetic attributes color and shape (maybe size too). You can define both aesthetics at the same time in each plot or one at a time. Just experiment. — Should be six total plots.

# Read in the cdc dataset
cdc <- read_delim(file = "data/cdc.txt", delim = "|") %>%
  mutate(genhlth = factor(genhlth,
    levels = c("excellent", "very good", "good", "fair", "poor")
  ))

# Selecting a random subset of size 100
cdc_small <- cdc %>% sample_n(100)

#plot 1 plain
ggplot(cdc_small, aes(height, weight))+
  geom_point(size=2, alpha=0.5)+
  ggtitle("Weight by Height: Plain")

#plot 2 gender
ggplot(cdc_small, aes(height, weight, shape=gender, color=gender))+
  geom_point(size=2, alpha=.5)+
  ggtitle("Weight by Height: Gender")

#plot 3 general health
ggplot(cdc_small, aes(height, weight, color=genhlth, shape=genhlth))+
  geom_point(size=2)+
  ggtitle("Weight by Height: General Health")

#plot 4 age
ggplot(cdc_small, aes(height, weight, color=age, size=age))+
  geom_point()+
  ggtitle("Weight by Height: Age")

#plot 5 smoking
ggplot(cdc_small, aes(height, weight, color=as.factor(smoke100)))+
  geom_point(size=2)+
  ggtitle("Weight by Height: Smoking")+
  scale_color_discrete(
     name="Smoked at least \n100 cigarettes"
    ,label=c("No", "Yes")
    )

#plot 5 healthcare coverage
ggplot(cdc_small, aes(height, weight, color=as.factor(hlthplan)))+
  geom_point(size=2)+
  ggtitle("Weight by Height: Healthcare")+
  scale_color_discrete(name="Health Care \nCoverage", label=c("No", "Yes"))

L03 categories & geom_smooth

# Load package(s)
library(tidyverse)
library(lubridate) #for ymd() function

Overview

The goal of this lab is to continue the process of unlocking the power of ggplot2 through constructing and experimenting with a few basic plots.

Datasets

We’ll be using data from the BA_degrees.rda and dow_jones_industrial.rda datasets which are already in the /data subdirectory in our data_vis_labs project. Below is a description of the variables contained in each dataset.

BA_degrees.rda

field - field of study
year_str - academic year (e.g. 1970-71)
year - closing year of academic year
count - number of degrees conferred within a field for the year
perc - field’s percentage of degrees conferred for the year

load(file="data/BA_degrees.rda")

dow_jones_industrial.rda

date - date
open - Dow Jones Industrial Average at open
high - Day’s high for the Dow Jones Industrial Average
low - Day’s low for the Dow Jones Industrial Average
close - Dow Jones Industrial Average at close
volume - number of trades for the day

load(file="data/dow_jones_industrial.rda")

genhlth - How would you rate your general health? (excellent, very good, good, fair, poor)
exerany - Have you exercised in the past month? (1 = yes, 0 = no)
hlthplan - Do you have some form of health coverage? (1 = yes, 0 = no)
smoke100 - Have you smoked at least 100 cigarettes in your life time? (1 = yes, 0 = no)
height - height in inches
weight - weight in pounds
wtdesire - weight desired in pounds
age - in years
gender - m for males and f for females

cdc <- read_delim(file = "data/cdc.txt", delim = "|") %>%
  mutate(genhlth = factor(genhlth,
    levels = c("excellent", "very good", "good", "fair", "poor")
  ))

Exercises

Exercise 1

The following exercises use the BA_degrees data set.

# Wrangling for plotting
ba_dat <- BA_degrees %>% 
  # mean % per field
  group_by(field) %>% 
  mutate(mean_perc = mean(perc)) %>% 
  # Only fields with mean >= 5%
  filter(mean_perc >= 0.05) %>%
  # Organizing for plotting
  arrange(desc(mean_perc), year) %>% 
  ungroup() %>%
  mutate(field = fct_inorder(field))

#take a look at the data to see variable names and types
head(ba_dat)

## # A tibble: 6 x 6
##   field    year_str  year  count  perc mean_perc
##   <fct>    <chr>    <dbl>  <dbl> <dbl>     <dbl>
## 1 Business 1970-71   1971 115396 0.137     0.204
## 2 Business 1975-76   1976 143171 0.155     0.204
## 3 Business 1980-81   1981 200521 0.214     0.204
## 4 Business 1985-86   1986 236700 0.240     0.204
## 5 Business 1990-91   1991 249165 0.228     0.204
## 6 Business 1995-96   1996 226623 0.195     0.204

Plot 1

#percent of degrees conferred by year
ggplot(ba_dat, aes(year, perc)) + 
  geom_line(size=1) + 
  #make chart for each field 
  facet_wrap(~field) + 
  xlab("Year") + 
  ylab("Proportion of degrees")

Plot 2

#percent of degrees conferred by year
ggplot(ba_dat, aes(year, perc)) + 
   #colored area may make comparison between field easier  
  geom_area(alpha=0.5, fill="red") +
  #line on areaa (may want a different color )
  geom_line(color="red", size=1) + 
  #make chart for each field 
  facet_wrap(~field) + 
  xlab("Year") + 
  ylab("Proportion of degrees")

Plot 3

#percent of degrees conferred by year
ggplot(ba_dat
       , aes(
           year
         , perc
         #sample plot, differnet line color by field 
         , color=field 
         )
       ) + 
  geom_line(size=1) + 
  xlab("Year") + 
  ylab("Proportion of degrees") + 
  #change title on legend
  labs(color="Field")

Exercise 2

The following exercises use the dow_jones-industrial data set.

# Restrict data to useful range
djia_date_range <- dow_jones_industrial %>%
  filter(date >= ymd("2008/12/31") & date <= ymd("2010/01/10"))

#take a look at the data to see variable names and types
head(djia_date_range)

## # A tibble: 6 x 6
##   date        open  high   low close    volume
##   <date>     <dbl> <dbl> <dbl> <dbl>     <int>
## 1 2008-12-31 8666. 8843. 8665. 8776. 226760000
## 2 2009-01-02 8772. 9065. 8761. 9035. 213700000
## 3 2009-01-05 9027. 9034. 8892. 8953. 233760000
## 4 2009-01-06 8955. 9088. 8941. 9015. 215410000
## 5 2009-01-07 8997. 8997. 8720. 8770. 266710000
## 6 2009-01-08 8770. 8770. 8651. 8742. 226620000

Plot 1

#closing price of DJIA by date 
ggplot(djia_date_range, aes(date, close)) + 
  geom_line(size = 1, color="purple") + 
  geom_smooth(
      method = "loess"
    , formula = y ~ x
    #line color
    , color = "green" 
    #se shading color 
    , fill = "red" 
    ) +
  xlab(NULL) + 
  ylab("Dow Jones Industrial Average")

Plot 2

#closing price of DJIA by date 
ggplot(djia_date_range, aes(date, close)) + 
  geom_line(size=1) + 
  geom_smooth(
      method="loess"
    , formula= y ~ x
    #wiggliness for loess curve  
    , span=0.3 
    #don't show se 
    , se=FALSE 
    ) + 
  xlab(NULL) + 
  ylab("Dow Jones Industrial Average")

Plot 3

ggplot(djia_date_range, aes(date, close)) + 
  geom_line(size=1) + 
  geom_smooth(
      method="lm"
      #splines makes the line more curvy and less 'chopped up' 
      #splines use piecewise polynomials
      #lowess is 'local regression'
    , formula= y ~ splines::ns(x, 6)
    , span=0.3
    , se = FALSE
    ) + 
  xlab(NULL) + 
  ylab("Dow Jones Industrial Average")

Exercise 3

The following exercises use the cdc dataset.

#take a look at the data to see variable names and types
head(cdc)

## # A tibble: 6 x 9
##   genhlth   exerany hlthplan smoke100 height weight wtdesire   age gender
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>  <dbl>    <dbl> <dbl> <chr> 
## 1 good            0        1        0     70    175      175    77 m     
## 2 good            0        1        1     64    125      115    33 f     
## 3 good            1        1        1     60    105      105    49 f     
## 4 good            1        1        0     66    132      124    42 f     
## 5 very good       0        1        0     61    150      130    55 f     
## 6 very good       1        1        0     64    114      114    55 f

Plot 1

Using cdc and `geom_bar()

ggplot(cdc, aes(genhlth)) + 
  #count genhlth 
  geom_bar()

Using genhlth_count and geom_bar(stat="identity")

#count genhlth and put into new data set 
genhlth_count <- cdc %>%
  count(genhlth)

#take a look at the data to see variable names and types
genhlth_count

## # A tibble: 5 x 2
##   genhlth       n
##   <fct>     <int>
## 1 excellent  4657
## 2 very good  6972
## 3 good       5675
## 4 fair       2019
## 5 poor        677

ggplot(genhlth_count, aes(genhlth, n)) + 
  #use stat='identity' to plot actual values; rather than counting values 
  geom_bar(stat="identity") + 
  ylab("count")

Using genhlth_count and geom_col()

ggplot(genhlth_count, aes(genhlth, n)) + 
  #geom_col will authomatically plot values 
  # where as geom_bar requires stat="identity" to do so 
  geom_col() + 
  ylab("count")

Plot 2

ggplot(cdc, aes(genhlth, fill=as.factor(hlthplan))) + 
  geom_bar(position="dodge") + 
  #change title on legend
  labs(fill="Health Plan")

Plot 3

ggplot(cdc
       , aes(
         #variable used to  calculate density 
           weight
          #color of area
         , fill = genhlth 
         #color of line
         , color = genhlth 
         )
       ) + 
  geom_density(alpha = 0.2) + 
  # create plots for each gender 
  facet_wrap(~gender)

Plot 4

The weight variable should have a lower limit of 50 and an upper limit of 300.

Fix x-axis limits

ggplot(cdc, aes(weight, fill = gender, color = gender)) + 
  geom_density(alpha = 0.5) + 
  facet_wrap(~genhlth, scales = "fixed") + 
  xlim(50, 300)

Filter cdc df to weight values [50, 300]

cdc %>%
  #filter data to only include values within the limts 
  filter(weight >= 50, weight <= 300) %>%
#pip directly into ggplot 
ggplot(aes(weight, fill = gender, color=gender)) + 
  geom_density(alpha=0.5) + 
  facet_wrap(~genhlth, scales = "fixed")

Plot 5

ggplot(cdc, aes(height, gender,  fill = gender)) + 
  geom_boxplot(alpha = 0.4) + 
  facet_wrap(~genhlth, nrow = 1) + 
  coord_flip()

Plot 6

ggplot(cdc, 
       aes(
           height
         , weight
         #put color here, inherited to geom_point and geom_smooth as well 
         , color = gender
         )
       ) + 
  geom_point(alpha = 0.2) + 
  geom_smooth(
      method = "lm"
    , formula = y~x
    , se = FALSE
    #extend line to cover whole plot 
    , fullrange = TRUE 
    )

L04 geom_text & annotation

Overview

The goal of this lab is to continue the process of unlocking the power of ggplot2 through constructing and experimenting with a few basic plots.

Load Packages: tidyverse, gridExtra, ggrepel

# Load package(s)
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(ggrepel))

knitr::opts_chunk$set(dpi = 300)

Datasets

We’ll be using data from the blue_jays.rda, tech_stocks.rda, corruption.rda, and cdc.txt datasets which are already in the /data subdirectory in our data_vis_labs project.

# Load datasets
load(file = "data/blue_jays.rda")
load(file = "data/tech_stocks.rda")
load(file = "data/corruption.rda")

# Read in the cdc dataset
cdc <- read_delim(
  file = "data/cdc.txt"
  , delim = "|"
  ) %>%
  mutate(genhlth = factor(genhlth,
    levels = c("excellent", "very good", "good", "fair", "poor")
  ))

Exercises

Exercise 1

The following plot uses the blue_jays.rda dataset.

#create caption that automatically grabs number of blue jays  
caption <- paste("Head length versus body mass for", nrow(blue_jays), "blue jays")

#add string wrap 
#will break the caption into multiple lines if longer than 40 characters
# '\n' is the line break code 
caption_print <- paste(strwrap(caption, 40), collapse ="\n") 

#create data set for top head size for each sex
topHead <- blue_jays %>% 
  #arrange largest to smallest
  arrange(desc(Head)) %>%
  # group by sex
  group_by(KnownSex) %>% 
  #take the top 2 head sizes  for each group 
  top_n(n = 2, wt = Head)

#'M' label will be put on the top male head size 
#'F' label will be put on the 2nd top female head size 
Labels <- topHead[c(1,4),]

#ANOTHER OPTIONS: lable dataframe 
#search by BirdID
Labels_anotheroption <- blue_jays  %>% 
  #select specific bird where you want the labels 
  filter(BirdID %in% c("1142-05914", "702-90567"))

#get range for x and y variables 
xrng <- range(blue_jays$Mass)
yrng <- range(blue_jays$Head)

#head length by body mass 
ggplot(blue_jays, aes(Mass, Head, color = KnownSex)) +
  geom_point(alpha = 0.6, size = 2) +
  annotate(
     "text"
    #put text in the top left corner of plot 
    , x = xrng[1], y = yrng[2]
    #label is the caption already create
    , label = caption
    #left justify 
    , hjust = 0
    # bottom justify 
    , vjust = 1
    #size the font
    , size = 4
    ) +
  xlab("Body mass (g)") +
  ylab("Head length (mm)") +
  #remove all legends; to remove just one legend put show.legend = FALSE into geom
  theme(legend.position = "none") +
  # add labels 
  geom_text(
    #use labels data set 
      data = Labels
    , aes(label = KnownSex)
    #nudget labels to the right 
    , nudge_x = 0.5
    )

Exercise 2

The following plots use the tech_stocks dataset.

Plot 1

#create caption 
caption <- paste("Stock price over time for four major tech companies")

#add string wrap 
#will break the caption into multiple lines if longer than 40 characters
# '\n' is the line break code 
caption_print <- paste(strwrap(caption, 40), collapse = "\n")  

#create labels df with most recent stock values 
Labels <- tech_stocks %>%
  ungroup() %>%
  #order by date 
  arrange(desc(date)) %>% 
  # return distinct company
  distinct(company
           #keep all variables not just company 
           , .keep_all = TRUE
           ) 

#get range for x and y variables 
xrng <- range(tech_stocks$date)
yrng <- range(tech_stocks$price_indexed)

tech_stocks <- tech_stocks %>%
  ungroup()

#stock price by date 
ggplot(tech_stocks, aes(date, price_indexed)) +
  #put color in geom_line so labels are not colored
  geom_line(aes(color = company)) +
  # remove all legends 
  theme(legend.position  =  "none") +
  #remove x label 
  xlab(NULL) +
  ylab("Stock price, indexed") +
  annotate(
    "text"
    #put text in the top left corner of plot 
    , x = xrng[1], y = yrng[2]
     #label is the caption already create
    , label = caption
    #left justify 
    , hjust = 0
    # bottom justify 
    , vjust = 1
    #use serif font 
    , family = "serif"
    #size the font
    , size = 4
    ) +
  #add company labels to the most recent stock price 
  geom_text(data = Labels, aes(label = company))

Plot 2

#stock price by date 
ggplot(tech_stocks, aes(date, price_indexed)) +
  #put color in geom_line so labels are not colored
  geom_line(aes(color = company)) +
  # remove all legends 
  theme(legend.position  =  "none") +
  #remove x label 
  xlab(NULL) +
  ylab("Stock price, indexed") +
  annotate(
    "text"
    #put text in the top left corner of plot 
    , x = xrng[1], y = yrng[2]
     #label is the caption already create
    , label = caption
    #left justify 
    , hjust = 0
    # bottom justify 
    , vjust = 1
    #use serif font 
    , family = "serif"
    #size the font
    , size = 4
    ) +
  #add company labels to the most recent stock price 
  geom_text_repel(
      data = Labels
    , aes(label = company)
    #padding around label 
    , box.padding = 0.6
    , min.segment.length = 0
    #right align text (so doesn't go off the plot)
    , hjust = 1
    #set seed for repel iterations 
    , seed = 9876
    )

Exercise 3

The following plot uses the corruption.rda dataset.

#look at 2015 data 
corruption_2015 <- corruption %>%
  #remove countries that don't have cpi or hdi values
  drop_na(cpi, hdi) %>%
  #only look at 2015 data 
  filter(year == 2015)

#the contries we want to label 
CountriesToLabel <- c(
    "Niger"
  , "Iraq"
  , "China"
  , "Ghana"
  , "Argentina"
  , "Chile"
  , "Japan"
  , "United States"
  , "Singapore"
  )

#create labels data frame 
Labels <- corruption_2015 %>%
  #only select specific countries already specified 
  filter( country %in% CountriesToLabel)

#hdi by cpi 
ggplot(corruption_2015, aes(cpi, hdi)) +
  #put color in geom_line so labels are not colored, color by geom 
  geom_point(aes(color = region), alpha = 0.6, size = 2) +
    geom_smooth(
        method = "lm"
      # use log(x) when modeling 
      , formula =  y~log(x)
      # specify grey color for line 
      , color = "grey60"
      # remove standard error bars 
      , se = FALSE
      ) +
  xlab("Corruption Perceptions Index, 2015 (100 = least corrupt)") +
  #add y label; '\n' creates line break 
  ylab("Human Development Index, 2015\n (1.0 = most developed)") +
  #add title
  ggtitle("Corruption and human development (2015)")+
  #add labels to specific countries 
  geom_text_repel(
      data = Labels
    , aes(label = country)
    #set seed for repel iterations
    , seed  =  9876
    , box.padding  =  0.6
    , min.segment.length = 0
    )

Exercise 4

The next plot uses the cdc dataset.

Using Bilbo Baggins’ responses below to the CDC BRSFF questions, add Bilbo’s data point as a transparent (0.5) solid red circle of size 4 to a scatterplot of weight by height with transparent (0.1) solid blue circles of size 2 as the plotting characters. In addition, label the point with his name in red. Left justify and rotate the label so it reads vertically from bottom to top — shift it up by 10 pounds too. Plot should use appropriately formatted axis labels. Remember that the default shape is a solid circle.

genhlth - How would you rate your general health? fair
exerany - Have you exercised in the past month? 1=yes
hlthplan - Do you have some form of health coverage? 0=no
smoke100 - Have you smoked at least 100 cigarettes in your life time? 1=yes
height - height in inches: 46
weight - weight in pounds: 120
wtdesire - weight desired in pounds: 120
age - in years: 45
gender - m for males and f for females: m

Hint: Create a new dataset (maybe call it bilbo or bilbo_baggins) using either data.frame() (base R - example in book) or tibble() (tidyverse - see help documentation for the function). Make sure to use variable names that exactly match cdc’s variable names

bilbo <- tibble(
  genhlth = "fair",
  exerancy = 1, 
  hlthplan = 0, 
  smoke100 = 1, 
  height = 46, 
  weight = 120, 
  wtdesire = 120, 
  age = 45, 
  gender = "m"
)

ggplot(cdc, aes(height, weight)) +
  geom_point(alpha = 0.1, color = "blue", size = 2) +
  ggtitle("Bilbo Baggins is on the small side") +
  labs(
      x = "Height (inches)"
    , y = "Weight (pounds)"
    , subtitle = "CDC Data: Weight by Height"
  ) +
  #add Bilbo Baggins special red point onto graph 
  geom_point(data = bilbo, color = "red", size = 4, alpha = 0.5) +
  #add sing label 
  geom_text(
      data = bilbo
    , aes(label = "Bilbo Baggins")
    , color = "red"
    #left justify label 
    , hjust = 0 
    #rotate label 
    , angle = 90
    #nudge label up 10 pounds 
    , nudge_y = 10
    )

L05 2d surface & geospatial

Overview

The goal of this lab is to explore more useful plots in ggplot2. Specifically we will be focusing on surface plots and geospatial plots (maps).

Challenges are not mandatory for students to complete. We highly recommend students attempt them though. We would expect graduate students to attempt the challenges.

# Load package(s)

library(tidyverse)
library(gridExtra)

 #for geom_hex 
library(hexbin)
library(maps)

#https://github.com/hrbrmstr/statebins
library(statebins)

#color pallet https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html
library(viridis) 

#for geom_sf; will not work with discrete fill unless the package is loaded 
library(sf)

#for challegne with census data 
library(tidycensus)
library(mapview)
library(tigris)

library(scales) #percent() 

knitr::opts_chunk$set(dpi = 300)

Datasets

We’ll be using data from the blue_jays.rda, cdc.txt, and the US_income.rda datasets which are already in the /data subdirectory in our data_vis_labs project.

# Load datasets
load(file = "data/blue_jays.rda")
load(file = "data/US_income.rda")

# Read in the cdc dataset
cdc <- read_delim(
  file = "data/cdc.txt"
  , delim = "|"
  #add col types 
  , col_types = cols(
                      genhlth = col_character(),
                      exerany = col_double(),
                      hlthplan = col_double(),
                      smoke100 = col_double(),
                      height = col_double(),
                      weight = col_double(),
                      wtdesire = col_double(),
                      age = col_double(),
                      gender = col_character()
                    )
  ) %>%
  mutate(genhlth = factor(genhlth,
    levels = c("excellent", "very good", "good", "fair", "poor")
  ))

Exercises

Complete the following exercises.

Exercise 1

The following plot uses the blue_jays.rda dataset.

#body mass by head
ggplot(blue_jays, aes(Mass, Head)) +
  #scatter plot 
  geom_point(alpha = 1/3, size = 1.5) +
  #2D density plot 
  geom_density_2d(
    #set binwidth for density plot
      binwidth = 0.004
    , color = "black"
    , size = 0.4
    ) +
  #set limits on x-axis 
  xlim(57, 82) +
  # add labels 
  labs(
      x = "Body Mass (g)"
    , y = "Head length (mm)"
  ) +
  #add minimal theme; white background on plot
  theme_minimal()

Exercise 2

The following plots uses the cdc dataset

Plot 1

#set base plot that can be used for plot1&plot2 
cdc_plot <- ggplot(cdc, aes(height, weight)) +
  #set axis titles 
  labs(
      x = "Height (in)"
    , y = "Weight (lbs)"
  )+
  #add minimal theme; white background on plot
  theme_minimal()

#2d density plot with hexagons
# requires `hexbins` package 
cdc_plot + geom_hex(bins=35)

Plot 2

cdc_plot +
  stat_density_2d(
    #make shading based on density 
    # can do after_stat(level) or ..level..
      aes(fill = ..level..)
    #fill density plot (solid not just lines)
    , geom = "polygon"
    #don't show legend 
    , show.legend = FALSE
    ) +
  #separate plots for gender
  facet_wrap(~ gender)

Exercise 3

The following plot is made using the maps library.

#requires `maps` package  

#the best state in the union 
mn <- map_data("county", "minnesota") %>% 
  select(long, lat, group, id = subregion)

ggplot(mn, aes(x = long, y = lat)) +
  #add title+
  ggtitle("Minnesota") +
  #draws the shapes 
  geom_polygon(aes(group = group), fill = "white" , color = "grey35") +
  #helps size correctely 
  coord_quickmap() +
  #shows just the plotted features 
  theme_void()

Exercise 4

The following plots use the US_income dataset.

# Setting income levels
US_income <- mutate(
  US_income,
  income_bins = cut(
    ifelse(is.na(median_income), 25000, median_income),
    breaks = c(0, 40000, 50000, 60000, 70000, 80000),
    labels = c("< $40k", "$40k to $50k", "$50k to $60k", "$60k to $70k", "> $70k"),
    right = FALSE
  )
)

Plot 1

ggplot(US_income) +
  # 'simple features', need `sf` package
  geom_sf(
      aes(
        #specify geometry shape to be used 
          geometry = geometry
        #fill based on median income 
        , fill = income_bins
        )
    #set boundary color
    , color = "grey80"
    #set boundary line size
    , size = 0.2
    ) +
  #scale color fill to viridis color pallet 
  viridis::scale_fill_viridis(
    #change to continous scale
      discrete = TRUE
    #change legend name
    , name = "Median\nIncome"
    ) +
  #helps size correctely 
  coord_sf() +
  #shows just the plotted features 
  theme_void()

Plot 2

ggplot(US_income, aes(state = name, fill = income_bins)) +
  # need `statebins` package  
  geom_statebins() +
  #scale color fill to viridis color pallet 
  viridis::scale_fill_viridis(
    #change to continous scale
      discrete = TRUE
    #change legend name
    , name = "Median\nIncome"
    ) +
  theme_statebins()

#want to make each bin an equal square? can add
#  coord_equal()

Challenge(s)

The following plots use the tidycensus package and few others, as well as using these directions.

Try using a different geographical area and a different variable from the ACS.

#GET CENSUS KEY
#install census key; ONLY NEED TO DO THIS ONCE 
census_api_key("your census data api", install = TRUE, overwrite = TRUE)

#get census data api here:  https://api.census.gov/data/key_signup.html
#run above; it is now stored and can be accessed using 
Sys.getenv("CENSUS_API_KEY")

# https://walkerke.github.io/tidycensus/articles/basic-usage.html  

#look through variables
var <- load_variables(2017, "acs5", cache = TRUE)
#then use View(var) to filter

Plot 1: Manhatten and Household Median Income

# SIZE: out.width="100%", fig.height=2.5
# https://walkerke.github.io/tidycensus/articles/spatial-data.html
options(tigris_use_cache = TRUE)

ny <- get_acs(geography = "tract", 
              variables = "B19013_001", #household median income
              state = "NY", 
              county = "New York", 
              geometry = TRUE, 
              cb = FALSE)

#remove polygons over the water 
st_erase <- function(x, y){
  st_difference(x, st_union(y))
}  

ny_water <- area_water("NY", "New York", class = "sf")
NY <- st_erase(ny, ny_water) 

mapview(NY
        , zcol = "estimate"
        , legend = TRUE
        #change size of lines 
        , lwd = 0.2)

Plot 2: Twin Cities and Percentage of Renter-occupied Units

# SIZE: out.width="100%", fig.height=2.5
options(tigris_use_cache = TRUE)

twin_cities <- c("Hennepin", "Ramsey")
vars <- c(
    renters =  "B07013_003" #total renter-occupied units 
  , total   =  "B07013_001" #TOTAL 
)

#get values 
mn_val <- get_acs(
          geography = "tract" 
        , variables =  vars
        , state = "MN", 
        , county = twin_cities
        )

#great df with new values
#can't get pivot_wider to work when geometry = TRUE
#put together values seperatly and then add to mn_geom
MN_val <- mn_val %>%
  pivot_wider(
    names_from = variable, 
    values_from = c(estimate, moe)
      ) %>%
  mutate(rent_perc = round(estimate_renters / estimate_total, 2)) %>%
  select(GEOID, rent_perc)

#get geometric files 
MN_geom <- get_acs(
          geography = "tract" 
        , variables =  vars[1]
        , state = "MN", 
        , county = twin_cities
        , geometry = TRUE
        )

MN <- inner_join(MN_geom, MN_val, by = "GEOID")

mapview(MN
        , zcol = "rent_perc"
        , legend = TRUE
        #make lines thinner 
        , lwd = 0.2
        , layer.name = "Est % of Renters"
        )

L06 errorsbars & layers

Overview

The goal of this lab is to explore more plots in ggplot2. Specifically we will be focusing on error bars for uncertainty and practice using multiple layers.

Datasets

We’ll be using the cows.rda, cdc.txt, and tech_stocks.rda datasets which are already in the /data subdirectory in our data_vis_labs project. We will also be using the mpg dataset which comes packaged with ggplot2 — use ?ggplot2::mpg to access its codebook.

#Load packages 
library(tidyverse) 

# Set seed
set.seed(9876)

#chunk options
knitr::opts_chunk$set(dpi = 300)

# load data sets  
load(file = "data/cows.rda") 
load(file = "data/tech_stocks.rda")

tech_stocks <- tech_stocks %>%
  ungroup()

# Read in the cdc dataset
cdc <- read_delim(
  file = "data/cdc.txt"
  , delim = "|"
  #add col types 
  , col_types = cols(
                      genhlth = col_character(),
                      exerany = col_double(),
                      hlthplan = col_double(),
                      smoke100 = col_double(),
                      height = col_double(),
                      weight = col_double(),
                      wtdesire = col_double(),
                      age = col_double(),
                      gender = col_character()
                    )
  ) %>%
  mutate(genhlth = factor(genhlth,
    levels = c("excellent", "very good", "good", "fair", "poor")
  ))

Exercises

Exercise 1

The following plot uses the mpg dataset.

# Additional dataset for plot
class_dat <- mpg %>%
  group_by(class) %>%
  summarise(
    n = n(),
    hwy = mean(hwy),
    label = str_c("n = ", n, sep = "")
  )

ggplot(mpg, aes(class, hwy)) +
  #addjitter plot with all of the points 
  geom_jitter(width = 0.1) +
  #add red points at mean hwy mpg for each class 
  geom_point(data = class_dat, size = 5, color = "red", alpha = 0.6) +
  #add labels with number of vechicals in each class
  geom_text(data = class_dat, aes(label = label, y = 10), vjust = "indward") +
  #change theme
  theme_minimal() +
  #add axis titles 
  labs(
      x = "Vehicle Class"
    , y = "Highway miles per gallon"
  )

Exercise 2

The following plot uses the cows dataset.

# Graphic dataset
cow_means <- cows %>%
  filter(breed != "Canadian") %>%
  group_by(breed) %>%
  summarize(
    mean = mean(butterfat),
    se = sd(butterfat) / sqrt(n())
  ) %>%
  mutate(breed = fct_reorder(factor(breed), desc(mean)))

z_star <- qnorm(0.975)

ggplot(cow_means, aes(breed, mean)) +
  #add color and width 
  geom_col(fill = "#56B4E9", width = 0.7) +
  geom_errorbar(
    #add error bars
      aes(
        ymin = mean - se*z_star
      , ymax = mean + se*z_star
      )
    , width = 0.1
    ) +
  #change theme
  theme_minimal() +
  #add axis titles 
  labs(
      x = "Cattle breed"
    , y = "Mean percent butterfat\ncontent in milk"
  )

Exercise 3

The following plot uses the tech_stocks dataset.

# percentage increase data
perc_increase <- tech_stocks %>%
  arrange(desc(date)) %>%
  distinct(company, .keep_all = TRUE) %>%
  mutate(
    perc = 100 * (price - index_price) / index_price,
    label = str_c(round(perc), "%", sep = ""),
    company = fct_reorder(factor(company), perc)
  )

ggplot(perc_increase, aes(perc, company)) +
  geom_col(fill = "#56B4E9") + 
  geom_text(aes(label = label), size = 5, hjust = 1.1, color = "white") +
  #change theme
  theme_minimal() +
  #remove axis titles 
  labs(
      x = NULL
    , y = NULL
  )

Exercise 4

The following plot uses the cdc dataset.

# 95% CI for weight for genhlth, gender groups
cdc_weight_95ci <- cdc %>%
  group_by(genhlth, gender) %>%
  summarise(
    mean_wt = mean(weight),
    se = sd(weight) / sqrt(n()),
    moe = qt(0.975, n() - 1) * se
  )

ggplot(cdc_weight_95ci, aes(mean_wt, gender, color = genhlth)) +
  geom_point(position = position_dodge(width = 0.5)) +
  #add HORIZONTAL error bars 
  geom_errorbarh(
      aes(  xmin = mean_wt - moe
          , xmax = mean_wt + moe
          #change height 
          , height = 0.1
          )
    #width refers to size of dodged position; not error bars
   , position = position_dodge(width = 0.5)
    ) +
  theme_minimal() +
  labs(
      x = "Weight (lbs)"
    , y = "Gender"
    ) +
  scale_color_discrete(name = "General health\n(self reported)")

L07 scales, axes, & legends

library(knitr)
opts_chunk$set(dpi = 300)

Overview

The goal of this lab is to explore ways to manage and manipulate scales, axes, and legends within ggplot2.

Datasets

We’ll be using the tech_stocks.rda, cdc.txt, and a few toy datasets.

# Load package(s)
library(tidyverse) 
library(scales)

# Load datasets 
load(file = "data/tech_stocks.rda")

# Read in the cdc dataset
cdc <- read_delim(file = "data/cdc.txt", delim = "|") %>%
  mutate(genhlth = factor(genhlth,
    levels = c("excellent", "very good", "good", "fair", "poor")
  ))

# Set seed
set.seed(8221984)

# Selecting a random subset of size 100
cdc_small <- cdc %>% sample_n(100)

# Generating toy datasets for exercise 2
dat1 <- tibble(theta = c(0, 2 * pi))

dat2 <- tibble(
  theta = seq(0, 2 * pi, length.out = 100),
  obs = rnorm(100, sin(theta), 0.1),
  larger_than = ifelse(abs(obs) < abs(sin(theta)), "1", "0")
)

Exercises

Exercise 1

The next plot uses the tech_stocks dataset.

Hints:

Size of lines in legend is 1.3
0.75 and 0.85 will be useful

tech_stocks %>%
  ungroup() %>%
ggplot(
      aes(
          date
        , price_indexed
        #reorder based on final values in plot 
        , color = fct_reorder2(company, date, price_indexed)
      )
  ) +
  theme_minimal() +
  geom_line() +
  #change line in legends to 1.3
  #use color as that is the what is used in aes() for lines 
  guides(color = guide_legend(override.aes = list(size = 1.3))) +
  scale_y_continuous(
      name = NULL
    , breaks = seq(0, 500, 100)
    , labels = scales::dollar
    , position = "right"
  ) +
  scale_color_discrete(
      name = NULL
      #if don't reorder factors; can manually set in limits 
#    , limits = c("Facebook", "Alphabet", "Microsoft", "Apple")
  )+
  theme(
      legend.position = c(0.75, 0.85)
  ) +
  scale_x_date(
      name = NULL
    #remove extra padding on plot 
    , expand = c(0, 0)
  )+
  ggtitle("Stock price, indexed")

Exercise 2

The next plot uses the toy datasets dat1 and dat2.

Hints:

Sizes used 2 (points) and 1.3 (line)
Transparency 0.8
Colors: #56B4E9, darkgreen, & red

dat2 %>%
  mutate(sin_theta = sin(theta)) %>%
ggplot(aes(theta, obs, color = larger_than)) +
  geom_point(
      size = 2
    , alpha = 0.8
  ) +
  stat_function(
      fun =  sin
    , size = 1.3
    , color = "#56B4E9"
  ) +
  #instead of stat_funnction could use geom_line 
  #geom_line(aes(y = sin_theta), size = 1.3, color = "#56B4E9") +
  scale_color_manual(values = c("darkgreen", "red")) +
  xlab( quote(theta) ) +
  ylab( quote(sin(theta)) ) +
  theme_minimal() +
  theme(legend.position = "none")

Exercise 3

Using cdc_smallconstruct a scatterplot of weight by height with the following requirements:

Size of plotting characters should be 3.
Color and shape should both identify genhlth.
One legend for both color and shape.
Legend title should be “General Health?” with a newline starting after general.
Legend categories should be ordered from excellent (top) to poor (bottom) with each word in category capitalized in the legend.
Legend should be placed in the lower right-hand corner of the plotting area.
Color should follow the "Set1" pallete.
Shape should have a solid triangle (17) for excellent, solid circle (19) for very good, an x (4) for poor, an hollow rotated square with an x in it (9) for fair, and a solid square (15) for good.
height values should be limited between 55 and 80.
height axis should display every 5th number between 55 and 80 and be appropriately labeled (i.e. 55 in, 60 in, …, 80 in). No axis title is necessary.
weight values should be limited between 100 and 300.
weight axis should be on log base 10 scale, but still display weights in pounds starting at 100 and displaying every 25 pounds until 300. Must be appropriately labeled (i.e. 100 lbs, 125 lbs, …, 300 lbs). No axis title is necessary.
Graph title should be CDC BRFSS: Weight by Height.
Minimal theme.

#limits and breaks for HEIGHT
htmin <- 55
htmax <- 80
htdiv <- 5
htnumb <- (htmax-htmin)/htdiv
htbreaks <- seq(htmin, htmax, htdiv)

#limits and breaks for WEIGHT
wtmin <- 100
wtmax <- 300
wtdiv <- 25
wtnumb <- (wtmax-wtmin)/wtdiv
wtbreaks <-seq(wtmin, wtmax, wtdiv)

#Legend title should be "General Health?" with a newline starting after general
leg_title <- "General\nHealth?"

#with each word in category capitalized in the legend
cap_labels <- c("Excellent", "VeryGood", "Good", "Fair", "Poor")

#PLOT 
ggplot(cdc_small, aes(height, weight)) +
  
  geom_point(aes(color = genhlth, shape = genhlth), size = 3) +
  scale_color_brewer(
      name = leg_title
    , labels = cap_labels
    , palette = "Set1"
  ) +
  scale_shape_manual(
      name = leg_title
    , labels = cap_labels
    , values = c(17, 19, 15, 9, 4) 
  ) +
  scale_x_continuous(
      name = NULL
    , limits = c(htmin, htmax)
    , breaks = htbreaks
    , labels = scales::unit_format(suffix = " in", accuracy = 1)
  ) +
  scale_y_log10(
      name = NULL
    , limits = c(wtmin, wtmax)
    , breaks = wtbreaks 
    , labels = scales::unit_format(unit = "lbs")
  ) +
  theme_minimal() +
  theme(
      legend.position = c(1, 0)
    , legend.justification = c(1, 0)
  ) +
  ggtitle("CDC BRFSS: Weight by Height")

L08 positioning

library(knitr)
opts_chunk$set(dpi = 300)

Overview

The goal of this lab is to develop an understanding of facets, position, continue the exploration of other ggplot2 options/features.

Datasets

We’ll be using the titanic.rda, Aus_athletes.rda, and cdc.txt datasets.

# Load package(s)
library(tidyverse) 
library(scales)
library(cowplot)

# Load datasets
load(file = "data/titanic.rda")
load(file = "data/Aus_athletes.rda")

# Read in the cdc dataset
cdc <- read_delim(file = "data/cdc.txt", delim = "|") %>%
  mutate(genhlth = factor(genhlth,
    levels = c("excellent", "very good", "good", "fair", "poor"),
    labels = c("Excellent", "Very Good", "Good", "Fair", "Poor")
  ))

# Set seed
set.seed(8221984)

# Selecting a random subset of size 1000
cdc_small <- cdc %>% sample_n(1000)

Exercises

Exercise 1

The following plot uses the titanic.rda dataset .

Hints:

Hex colors: #D55E00D0, #0072B2D0

ggplot(titanic, aes(sex, fill = sex)) +
  geom_bar() +
  facet_grid(
    factor(survived, labels = c("died", "survived"))
    ~
    class
  ) +
  scale_fill_manual(values = c("#D55E00D0", "#0072B2D0")) +
  theme_minimal() +
  theme(legend.position = "none")

Exercise 2

Use the athletes_dat dataset — extracted from Aus_althetes.rd — to recreate the following graphic as precisely as possible. The cowplot package will be useful.

# Get list of sports played by BOTH sexes
both_sports <- Aus_athletes %>%
  distinct(sex, sport) %>%
  count(sport) %>%
  filter(n == 2) %>%
  pull(sport)

# Process data
athletes_dat <- Aus_athletes %>%
  filter(sport %in% both_sports) %>%
  mutate(sport = case_when(
    sport == "track (400m)" ~ "track",
    sport == "track (sprint)" ~ "track",
    TRUE ~ sport
    )
  )

Hints:

Build each plot separately
Use cowplot::plot_grid() to combine them
Hex values for shading: #D55E0040 and #0072B240 (bottom plot), #D55E00D0 & #0072B2D0 (for top two plots) — no alpha
Hex values for outline of boxplots: #D55E00 and #0072B2
Boxplots should be made narrower; 0.5
Legend is in top-right corner of bottom plot
Legend shading matches hex values for top two plots
Bar plot lower limit 0, upper limit 95
rcc: red blood cell count; wcc: white blood cell count
Size 3 will be useful

plotA <- ggplot(athletes_dat, aes(sex, fill = sex)) + 
  geom_bar(show.legend = FALSE)+
  scale_fill_manual(values = c("#D55E00D0", "#0072B2D0")) +
  scale_x_discrete(
      name = NULL
    , labels = c("female", "male")
  ) +
  scale_y_continuous(
      name = "number"
    , breaks = seq(0, 100, 25)
    , limits = c(0, 95)
  ) + 
  theme_minimal()

plotB <- ggplot(athletes_dat, aes(rcc, wcc)) +
  geom_point(
      aes(fill = sex)
    , shape = 21
    , color = "white"
    , size = 3
    , show.legend = FALSE
    ) + 
  scale_fill_manual(values = c("#D55E00D0", "#0072B2D0")) + 
  scale_x_continuous(name = "RBC Count") + 
  scale_y_continuous(name = "WBC count") + 
  theme_minimal()

sex_labs <- c("female", "male")

plotC <- ggplot(athletes_dat, aes(sport, pcBfat)) +
  geom_boxplot(
      aes(color = sex, fill = sex)
    , width = 0.5
    ) +
  scale_fill_manual(
      name = NULL
    , labels = sex_labs 
    , values = c("#D55E0040", "#0072B240")) + 
  scale_color_manual(
      name = NULL
    , labels = sex_labs
    , values = c("#D55E00D0", "#0072B2D0")) +
  guides(fill = guide_legend(
      ncol = 2
    , override.aes = list(
        fill = c("#D55E00D0", "#0072B2D0")
      , color = "transparent"
      )
    )
  ) +
  xlab(NULL) + 
  ylab("% body fat") + 
  theme_minimal() + 
  theme(
      legend.position = c(1, 1)
    , legend.justification = c(1, 1)
    #set top legend to zero
    , legend.margin = margin(t = 0)
  )

plot_grid(
    plot_grid(plotA, plotB, nrow = 1)
  , plotC
  , nrow = 2
)

Exercise 3

Using cdc_smallconstruct a the following graphic as precisely as possible.

Hints:

Need to do a little more data processing
Hex values: "#D55E00D0" and #0072B2D0
grey80 might be useful
size 2

cdc_small_adj <- cdc_small %>%
  mutate(wtloss = wtdesire - weight) 


ggplot(cdc_small_adj, aes(weight, wtloss)) + 
  geom_point(
      data = select(cdc_small_adj, -gender)
    , color = "grey80"
    , size = 2
  ) +
  geom_point(
      aes(color = gender)
    , size = 2
    , show.legend = FALSE
  ) + 
  scale_color_manual(values = c("#D55E00D0", "#0072B2D0")) + 
  facet_grid(
    factor(gender, labels = c("Women", "Men"))
    ~ 
    fct_rev(genhlth)
  ) + 
  xlab("Weight (lbs)") + 
  ylab( "Weight Lost/Gain in Pounds") + 
  theme_minimal()

L09 themes

library(knitr)
opts_chunk$set(dpi = 300)

Overview

The goal of this lab is to play around with the theme options in ggplot2

Datasets

We’ll be using the cdc.txt datasets.

# Load package(s)
library(tidyverse)
library(gridExtra)

#import fonts - only once, take ~5min
#font_import()

#load fonts 
extrafont::loadfonts(device = "win")

# Read in the cdc dataset
cdc <- read_delim(file = "data/cdc.txt", delim = "|") %>%
  mutate(genhlth = factor(genhlth,
    levels = c("excellent", "very good", "good", "fair", "poor"),
    labels = c("Excellent", "Very Good", "Good", "Fair", "Poor")
  ))

# Set seed
set.seed(8221984)

# Selecting a random subset of size 100
cdc_small <- cdc %>% sample_n(100)

Exercises

Exercise 1

Use the cdc_small dataset to explore several pre-set ggthemes. The code below constructs the familiar scatterplot of weight by height and stores it in plot_01. Display plot_01 to observe the default theme. Explore/apply, and display at least 7 other pre-set themes from the ggplot2 or ggthemes package. Don’t worry about making adjustments to the figures under the new themes. Just get a sense of what the themes are doing to the original figure plot_01.

There should be at least 8 plots for this task. temp1 is pictured below.

# Building plot
base <- ggplot(data = cdc_small, aes(x = height, y = weight)) +
  geom_point(size = 3, aes(shape = genhlth, color = genhlth)) +
  scale_y_continuous(
    name = "Weight in Pounds",
    limits = c(100, 300),
    breaks = seq(100, 350, 25),
    trans = "log10",
    labels = scales::unit_format(unit = "lbs")
  ) +
  scale_x_continuous(
    name = "Height in Inches",
    limits = c(55, 80),
    breaks = seq(55, 80, 5),
    labels = scales::unit_format(unit = "in", accuracy = 1)
  ) +
  scale_shape_manual(
    name = "General\nHealth?",
    labels = c("Excellent", "Very Good", "Good", "Fair", "Poor"),
    values = c(17, 19, 15, 9, 4)
  ) +
  scale_color_brewer(
    name = "General\nHealth?",
    labels = c("Excellent", "Very Good", "Good", "Fair", "Poor"),
    palette = "Set1"
  ) +
  labs(subtitle = "CDC BRFSS: Weight by Height")

legend_theme <- function(...){
  theme(
    legend.position = c(1, 0),
    legend.justification = c(1, 0)
  )
}

#out.width = '50%'
#fig.show='hold' - print at the end of code
p1 <- base + ggtitle("Plot 1: Base") + legend_theme()
p1

p2 <- base + theme_bw() + ggtitle("Plot 2: theme_bw") + legend_theme()
p2 

p3 <- base + theme_linedraw() + ggtitle("Plot 3: theme_linedraw") + legend_theme()
p3

p4 <- base + theme_light() + ggtitle("Plot 4: theme_light") + legend_theme()
p4

p5 <- base + theme_minimal() + ggtitle("Plot 5: theme_minimal") + legend_theme()
p5

p6 <- base + theme_classic() + ggtitle("Plot 6: theme_classic") + legend_theme()
p6

p7 <- base + theme_dark() + ggtitle("Plot 7: theme_dark") + legend_theme()
p7

p8 <- base + theme_void() + ggtitle("Plot 8: theme_void") + legend_theme()
p8

Which theme or themes do you particularly like? Why? I think the default or theme_minmal() provide a clean option to view the day. If lines are not as important, theme_classic would work well to focus on the data shape (not necessarly the values). Void is great for maps when you don’t want the axes to show.

Exercise 2

Using plot_01 from and the theme() function, attempt to construct the ugliest plot possible (example pictured below).

base + 
    theme(
      legend.position = c(1, 0)
    , legend.justification = c(1, 0)
    , axis.title = element_text(color="blue", face = "bold")
    , axis.text = element_text(color = "red", angle = -90)
    , axis.ticks.length = unit(1, "in")
    , panel.grid.major = element_line(color = "black", size = .5)
    , panel.grid.minor = element_line(color="white", size = 4, linetype="dotted")
    , panel.background = element_rect(fill="yellow")
    , plot.background = element_rect(color="darkgreen", fill="purple", size = 10)
    , plot.subtitle = element_text(size = 20, family="serif", face="italic")
    , legend.background = element_rect(fill="lemonchiffon", color="grey50", size = 1, linetype="dashed")
  )

Exercise 3

Using data from cdc_small create a few (at least two) graphics (maybe one scatterplot and one barplot). Style the plots so they follow a “Northwestern” theme. Check out the following webpages to help create the theme:

Visual Identity

Web Colors
Secondary Palette
Fonts & Typography — Need to use substitute fonts

NW_purple   <- "#4E2A84"
NW_purple30 <- "#B6ACD1"
NW_black80  <- "#342F2E"
NW_black50  <- "#716C6B"
NW_black20  <- "#BBB8B8"
NW_black10  <- "#D8D6D6"

ggplot(cdc_small, aes(weight, height)) + 
  geom_point(
      data = select(cdc_small, -gender)
    , color = NW_black20
    , size = 2.5
  ) +
  geom_point(
      color = NW_purple
    , alpha = 0.5
    , size = 2.5
    , show.legend = FALSE
  ) + 
  scale_x_continuous(
      labels = scales::unit_format(unit = "lbs")
    , breaks = seq(125, 325, 50)
  ) + 
  scale_y_continuous(labels = scales::unit_format(unit = "in", accuracy = 1))+
  facet_wrap(~ factor(gender, labels = c("Women", "Men"))) + 
  xlab("Weight") + 
  ylab( "Height") + 
  theme_minimal() +
  ggtitle("Gender Comparison for Height by Weight")+
  theme(
      plot.title = element_text(family = "Georgia", color = NW_black80, hjust = 0.5
                                , size = 15
                                )
    , strip.text.x = element_text(family ="Georgia", face = "bold", color = NW_purple, size=13)
    , axis.title = element_text(color = NW_black80, size=10)
    , axis.text = element_text(color = NW_black50, size = 8)
  )

cdc_small_adj <- cdc_small %>%
  mutate(exercise = factor(exerany, labels=c("No", "Yes")))

labels <- cdc_small_adj %>%
  group_by(gender, genhlth) %>%
  summarize(count = paste("n =", n()))

## `summarise()` regrouping output by 'gender' (override with `.groups` argument)

ggplot(cdc_small_adj, aes(gender))+
  geom_bar(
      aes(fill = exercise)
    , position = "fill"
    , alpha = 0.7
  )+
  facet_wrap(~genhlth, nrow=1) +
  scale_x_discrete(
      name = NULL
    , labels = c("Female", "Male")
  )+
  scale_y_continuous(
      name = NULL
    , labels = scales::percent_format()
  ) +
  scale_fill_manual(
      name = "Have you exercised in the past month?"
    , labels = c("No", "Yes")
    , values = c(NW_black50, NW_purple)
    , guide = guide_legend(reverse = TRUE)
  ) +
  geom_text(
      data = labels
    , aes(label = count, y = -0.05)
    , size = 3
    , color = NW_black80
  ) +
  ggtitle("Exercies Percent by Gender and Reported Health") +
  theme_minimal() +
  theme(
      plot.title = element_text(family = "Georgia", color = NW_black80, hjust = 0.5
                                , size = 15
                                )
    , strip.text.x = element_text(family ="Georgia", face = "bold", color = NW_purple, size=10)
    , axis.title = element_text(color = NW_black80, size=10)
    , axis.text = element_text(color = NW_black50, size = 8)
    , legend.position =  "bottom"
    , legend.text = element_text(color = NW_black80)
    , legend.title =element_text(family = "Georgia", color = NW_black80) 
  )

L10 data wrangling

Overview

The goal of this lab is to use data manipulation and transformation techniques to help use build a few plots.

Datasets

We’ll be using the mod_nba2014_15_advanced.txt and NU_admission_data.csv datasets — add both to the project’s data subdirectory. The codebook_mod_nba2014_15_advanced.txt file supplies a quick description of the variables in the mod_nba2014_15_advanced.txt dataset — suggest adding it to the data subdirectory as well. The undergraduate-admissions-statistics.pdf is the source for the NU_admission_data.csv dataset and it also contains the graphic/plot we will be attempting to re-create in the second exercise.

# Load package(s)
library(tidyverse)
library(cowplot)

extrafont::loadfonts(device = "win")

# Read in the datasets
nba <- read_delim(file = "data/mod_nba2014_15_advanced.txt", delim = "|")
admission <- read_csv(file ="data/NU_admission_data.csv") %>%
  janitor::clean_names() #lowercase name

NW_purple   <- "#4E2A84"
NW_purple60 <- "#836EAA"
NW_purple30 <- "#B6ACD1"
NW_black80  <- "#342F2E"
NW_black50  <- "#716C6B"
NW_black20  <- "#BBB8B8"
NW_black10  <- "#D8D6D6"

Exercises

Exercise 1

Using the mod_nba2014_15.txt dataset try to recreate/approximate the plot type featured in the http://fivethirtyeight.com/ article Kawhi Leonard Is The Most Well-Rounded Elite Shooter Since Larry Bird for any player of your choice for the 2014-2015 season. When calculating quartiles or considering players you may want to exclude players that played less than 10 games or played less than 5 minutes a game. That is, we only want to look for “qualified” players.

quartile_rank <- function(x = 0:99) {
  
  # Set quartile
  quart_breaks <- c(
    -Inf,
    quantile(x,
      probs = c(.25, .5, .75),
      na.rm = TRUE
    ),
    Inf
  )

  cut(x = x, breaks = quart_breaks, labels = FALSE)
}

nba_5metrics <- nba %>%
  filter(G >= 10, MP/G >= 5) %>%
  mutate(
     qTS_perc  = quartile_rank(TS_perc)
    , qUSG_perc = quartile_rank(USG_perc)
    , qDBPM     = quartile_rank(DBPM)
    , qTRB_perc = quartile_rank(TRB_perc)
    , qAST_perc = quartile_rank(AST_perc)
  ) %>%
  select(Player, contains("q")) %>%
  pivot_longer(cols = c(-Player), names_to = "metric", values_to = "rank")

plot_nba <- function(fivemetrics, playername){
  #create tidy df for specific player
  player_df <- fivemetrics %>%
    filter(Player == playername) 
    
  #denote limts, breaks and labels 
  x_limits <- player_df$metric
  x_labels <- c(
      "True\nShooting"
    , "Usage\nRate"
    , "Defensive\nBPM"
    , "Rebound\nRate"
    , "Assist\nRate"
  )
  y_breaks <- c(1, 2, 3, 4)
  y_labels <- c("1st-25th", "25th-50th", "50th-75th", "75th-99th")
  
  y_labels_df <- data.frame(
      metric = c(rep("qDBPM", 4 ))
    , y_breaks = y_breaks
    , y_labels = y_labels
  )
  
  #data frame for vertical lines between metrics
  v_lines <- data.frame(
      x    = seq(1.5, 5.5, 1)
    , xend = seq(1.5, 5.5, 1)
    , y    = rep(0, 5)
    , yend = rep(4, 5)
  )
  
  #create plot
  ggplot(player_df, aes(metric, rank)) +
    geom_col(
        width = 1
      , fill  = "#F28291"
    ) +
    coord_polar() + 
    scale_y_continuous(
        NULL
      , expand = c(0, 0)
    ) +
    scale_x_discrete(
        NULL
      , expand = c(0, 0)
      , limits = x_limits 
    ) +
    geom_segment(
        x = seq(0.5, 4.5, 1)
      , y = 0
      , xend = seq(0.5, 4.5, 1)
      , yend = 4
    ) +
    geom_hline(
        yintercept = y_breaks
      , linetype = "dotted"
    )+
    #METRIC LABELS 
    geom_text(
        aes(label = toupper(x_labels), y = 5)
      , size = 5
    ) +
    #RANK LABELS 
    geom_text(
        data = y_labels_df 
      , aes(
            label = y_labels
          , y=y_breaks
          , x=metric
        )
      , vjust = -0.75
      , size = 4
    ) +
    theme_void() +
    ggtitle(paste(playername, "\n(2015)")) +
    theme(
        panel.background = element_rect(
            fill = "grey90"
          , color = NA
        )
      , plot.title = element_text(
            hjust = 0.5
          , size = 24
        )
    )  
  
}

# {r, fig.width = 8, fig.height = 10}
#also can print directly
plot_nba(nba_5metrics, "Kevin Durant")

plot_nba(nba_5metrics, "Jeremy Lin")

# {r, out.width="100%", out.height="100%"}
#if you want to save the file and then use it 
png(
    file = "img/LeBron James.png"
  , width = 8
  , height = 10
  , units = "in"
  , res = 300
)

plot_nba(nba_5metrics, "LeBron James")

stop <- dev.off()

knitr::include_graphics("img/LeBron James.png")

Exercise 2

Using NU_admission_data.csv create two separate plots derived from the single plot depicted in undergraduate-admissions-statistics.pdf. They overlaid two plots on one another by using two y-axes. Create two separate plots that display the same information instead of trying to put it all in one single plot — consider stacking them with cowplot::plot_grid(). Also, improve upon them by (1) fixing their error with the bar heights and (2) by using a “Northwestern” theme.

Also, practice placing all the text information on the appropriate plots. While I’m not a fan and think it is unnecessary for telling the actual story of the data, sometimes clients want this and there are those that think detailed labeling enhances the plot’s value — they do have a point. When including detailed labeling like this take care to pick label fonts and colors so the text doesn’t take away the from the message of the data (the trend in these plots). With these labels you could image removing the y-axes altogether so they don’t distracts the reader/consumer.

Which approach do you find communicates the information better, their single plot or the two plot approach? Why?

admission_bar <- admission %>%
  mutate(
      non_admitted = applications - admitted_students
    , non_attended = admitted_students - matriculants
    , attended = matriculants
         ) %>%
  select(year, non_admitted, non_attended, attended)  %>%
  pivot_longer(
      cols = c(-year)
    , names_to = "status"
    , values_to = "count"
  ) %>%
  mutate(
    status = factor(
        status
      , levels = c("non_admitted", "non_attended", "attended")
      )
  )

admission_apps <- admission %>%
  select(applications, year) %>%
  mutate(apps = paste("n =", scales::comma(applications)))

plot_bar <-  ggplot(admission_bar, aes(year, count, fill = status)) +
  geom_col(width = 0.9) +
  scale_fill_manual(
      "Applications:"
    , values = c(NW_black20, NW_purple30, NW_purple60)
    , labels = c("Not Admitted", "Admitted but did not attend", "Matriculants")
    , guide = guide_legend(reverse = TRUE)
  ) +
  scale_x_continuous(
      name = NULL
    , limits = c(1998.5, 2018.5)
    , breaks = seq(1999, 2018)
    , minor_breaks = NULL
    , expand = c(0, 0)
  ) +
  scale_y_continuous(
      name = NULL
    , expand = c(0, 0)
    , breaks = NULL
    , limits = c(0, max(admission_apps$applications) + 3000)
  ) +
  geom_text(
      aes(label = scales::comma(count))
    , position = position_stack(vjust = 0.6)
    , size = 3
    , color = NW_black80
  ) +
  annotate( 
    "text"
    , x = admission_apps$year
    , y = admission_apps$applications + 2000
    , label = scales::comma(admission_apps$applications)
    , size = 3
    , color = NW_black80
    , fontface = "bold"
  ) +
  theme_minimal() +
  theme(
      legend.position = "top"
    , axis.title = element_text(color = NW_black80, family = "Georgia")
    , axis.text = element_text(color = NW_black80, family = "Georgia")
    , axis.text.y = element_text(color = NW_black50, face = "bold")
    , legend.text = element_text(color = NW_black80)
    , legend.title = element_text(family="Georgia", color=NW_black80)
    , line = element_blank()
  )

admission_line <- admission %>%
  select(year, admission_rate, yield_rate) %>%
  pivot_longer(cols = c(-year), names_to = "status", values_to = "rate" ) %>%
  mutate(
    status = factor(status
                    , levels = c("yield_rate", "admission_rate")
                    )
  )

plot_line <-  ggplot(
      admission_line
    , aes(year, rate, color = status, shape = status)
  ) +
  geom_point(size = 2) +
  geom_line(size = 1) +
  scale_color_manual(
      "Rate:"
    , values = c(NW_purple, NW_purple60)
    , labels = c("Yield Rate", "Admission Rate")
  ) +
  scale_shape_discrete(
      "Rate:"
    , labels = c("Yield Rate", "Admission Rate") 
  ) +
  scale_x_continuous(
      name = NULL
    , limits = c(1998.5, 2018.5)
    , breaks = seq(1999, 2018)
    , minor_breaks = NULL
    , expand = c(0, 0)
  ) +
  scale_y_continuous(
      name = NULL
    , limits = c(0, 62)
    , expand = c(0, 0)
    , breaks = NULL
  ) + 
  geom_text(
      data = filter(admission_line, status == "yield_rate")
    , aes(y=rate, x=year, label = paste(rate, "%", sep = ""))
    , size = 3
    , color = NW_purple
    , vjust = 0
    , nudge_y = 2
  ) +
  geom_text(
      data = filter(admission_line, status == "admission_rate")
    , aes(y=rate, x=year, label = paste(rate, "%", sep = ""))
    , size = 3
    , color = NW_purple60
    , vjust = 1
    , nudge_y = -2
  ) + 
  theme_minimal() +
  theme(
      legend.position = "top"
    , axis.title = element_text(color = NW_black80, family = "Georgia")
    , axis.text = element_text(color = NW_black80, family = "Georgia")
    , axis.text.y = element_blank()
    , legend.text = element_text(color = NW_black80)
    , legend.title = element_text(
          family="Georgia"
        , color=NW_black80
      )
  )

title <- ggdraw() + 
  draw_label(
      "NORTHWESTERN UNIVERSITY"
    , fontface = 'bold'
    , fontfamily = "Georgia"
    , color = NW_purple
    , x = .5
    , hjust = .5
    , size = 14
  ) +
  theme(
    plot.margin = margin(5, 5, 5, 5)
  )

subtitle <- ggdraw() + 
  draw_label(
      "Undergraduate Admission History by Entering Year"
    , fontfamily = "Georgia"
    , color = NW_black80
    , x = .5
    , hjust = .5
    , size = 12
  ) +
  theme(
    plot.margin = margin(5, 5, 5, 5)
  )

caption <- ggdraw() + 
  draw_label(
      "SOURCE: Information Systems Office for Admissions & Financial Aid."
    , color = NW_black50
    , x = 1
    , hjust = 1
    , size = 10
  )

plot_grid(
    plot_grid(title, subtitle, ncol = 1)
  , plot_grid(plot_bar, plot_line, ncol = 1, align = "v")
  , caption
  , ncol = 1
  , rel_heights = c(.1, 1.5, .05)
  )

This information is directly connected (one is the the raw counts, the other the rate) - however, putting it all onto one plot makes it difficult to read and see trends. I think having two plots on top of eachother is a better way to show that the data are related but still allowing enough space to read.

Challenge(s)

No not have to complete.

Using NU_admission_data.csv try to re-create/approximate the single plot depicted in undergraduate-admissions-statistics.pdf. Fix their error concerning the bar heights. Might want to simply start with one of your plots from Exercise 2 and see if it can be modified.

#more colors
NW_orange <- "#CA7C1B"
NW_yellow <- "#FFC520"

#primary/main y-axis limit and breaks 
y1_lim  <- 40000
y1_breaks <- seq(0, y1_lim, 10000)

#secondary y-axis limit and breaks 
y2_lim <- 60
y2_breaks <- seq(0,y2_lim, 15)


transform <- y1_lim*(1/y2_lim)

ggplot(admission_bar, aes(year)) +
  
  #PRIMARY AXIS
  geom_col( aes(y=count, fill = status )) +
  scale_fill_manual(
      name = NULL
    , values = c(NW_black20, NW_yellow, NW_purple60)
    , labels = c("Not Admitted", "Admitted but did not attend", "Matriculants")
    , guide = guide_legend(reverse = TRUE)
  ) +
  
  #SECONDARY AXIS 
  geom_line( 
      data = admission_line
    , aes(y = rate*transform, color = status)
    , size = 1
  ) +
  geom_point( 
      data = admission_line
    , aes(y = rate*transform, color = status, shape = status)
    , size = 2
  ) +
  scale_color_manual(
      name = NULL
    , values = c(NW_purple, NW_orange)
    , labels = c("Yield Rate", "Admission Rate")
  ) +
  scale_shape_discrete(
      name = NULL
    , labels = c("Yield Rate", "Admission Rate")
  ) +
  
  #SCALING
  scale_x_continuous(
      NULL
    , limits = c(1998.25, 2018.75)
    , breaks = seq(1999, 2018)
    , minor_break = NULL
    , expand = c(0, 0)
  ) +
    scale_y_continuous(
      name = "Applications"
    , expand = c(0, 0)
    , breaks = y1_breaks
    , labels = scales::comma
    , sec.axis = sec_axis(
          ~./transform
        , breaks = y2_breaks
        , labels = scales::unit_format(suffix = "%")
        , name = "Rate"
      )
  ) +
    
    
  #GENERAL THEMES
  theme_minimal() +
  labs(
      title = "NORTHWESTERN UNIVERSITY"
    , subtitle = "Undergradudate Admission History by Entering Year"
    , caption = "SOURCE: Information Systems Office for Admissions & Financial Aid."
  ) +
  theme(
      legend.position = "top"
    , axis.title = element_text(color = NW_black80, family = "Georgia")
    , axis.text = element_text(color = NW_black80, family = "Georgia")
    , legend.text = element_text(color = NW_black80)
    , plot.title = element_text(
          color = NW_purple
        , family ="Georgia"
        , face = "bold"
        , hjust = 0.5
      )
    , plot.subtitle = element_text(
          color = NW_black80
        , family ="Georgia"
        , hjust = 0.5
      )
    , plot.caption = element_text(color = NW_black50)
  )

M1 Midterm 1

Datasets

We will need the instructor provided stephen_curry_shotdata_2014_15.txt dataset and the nbahalfcourt.jpg. The variables should be fairly straightforward after some inspection of the dataset, but we have provided a description of variables just in case.

GAME_ID - Unique ID for each game during the season
PLAYER_ID - Unique player ID
PLAYER_NAME - Player’s name
TEAM_ID - Unique team ID
TEAM_NAME - Team name
PERIOD - Quarter or period of the game
MINUTES_REMAINING - Minutes remaining in quarter/period
SECONDS_REMAINING - Seconds remaining in quarter/period
EVENT_TYPE - Missed Shot or Made Shot
SHOT_DISTANCE - Shot distance in feet
LOC_X - X location of shot attempt according to tracking system
LOC_Y - Y location of shot attempt according to tracking system

# Load package(s)
library(tidyverse)
library(hexbin)

# Read in the dataset(s)
shots <- read_delim(file = "data/stephen_curry_shotdata_2014_15.txt", delim = "|") %>%
  janitor::clean_names()

Exercises

Exercise 1

After replicating the plots provide a summary of what the graphics indicate about Stephen Curry’s shot selection (i.e. distance from hoop) and shot make/miss rate and how they relate/compare across distance and game time (i.e. across quarters/periods).

Plot 1

m1_ex1_p1 <- ggplot(shots, aes(as.factor(period), shot_distance)) + 
  geom_boxplot(varwidth = TRUE) + 
  facet_wrap( ~ event_type) + 
  scale_x_discrete(
      name = "Quarter/Period"
    , labels = c("Q1", "Q2", "Q3", "Q4", "OT")
  ) +
  scale_y_continuous(
      name = NULL
    , labels = scales::unit_format(unit = "ft")
    , limits = c(0, NA)
    , breaks = seq(0, 40, 10)
    , minor_breaks = NULL
  ) +
  ggtitle("Stephen Curry\n2014-2015") +
  theme_minimal() +
  theme(
      plot.title = element_text(face = "bold", size = 14)
    , strip.text = element_text(face = "bold", size = 12)
    , axis.title.x = element_text(face = "bold", size = 12)
    , panel.grid.major.x = element_blank()
     
  )

m1_ex1_p1

Unsurprisingly, there are the fewest shots during OT for both missed and made- which makes sense since not every game has an OT period. For all periods, the made shots for each quarter each have a longer IQR and lower median than the missed shots. The difference in the medians makes sense: larger distances make scoring points more difficult so shots made have a shorter distance that shots missed. In addition, there seems to be some variety between the medians by quarter for made shot (Q2 is the highest, where Q4 is significantly lower and a huge drop-off in OT); compared to missed shots where each of the regular periods (Q1, Q2, Q3, Q4) have roughly the same median and OT is the only one noticeably different. There also seems to be slightly less shots taken in the Q2 and Q4 than in Q1 and Q3 (both for shots made and shots missed) which may be do to playing time (some fancy baseball knowledge I don’t know).

Plot 2

ggplot(shots, aes(shot_distance, fill = event_type)) + 
  geom_density(alpha = 0.3) + 
  scale_fill_manual(values = c("green", "red")) + 
  scale_x_continuous(
      name = NULL
    , breaks = seq(0, 40, 10)
    , minor_breaks = NULL
    , labels = scales::unit_format(unit = "ft")
  ) +
  scale_y_continuous(
      name = NULL
    , breaks = NULL
    , limits = c(0, 0.081)
    , expand = c(0, 0)) +
  annotate(
      "text"
    , x = 3
    , y = 0.04
    , label  = "Made Shots"
    , hjust = 0
    , vjust = 0
  ) +
  annotate(
      "text"
    , x = 27
    , y = 0.07
    , label  = "Missed Shots"
    , hjust = 0
    , vjust = 0
  ) + 
  ggtitle("Stephen Curry\nShot Densities (2014-2015)") +
  theme_minimal() +
  theme(
      plot.title = element_text(size = 14)
    , legend.position = "none"
    , panel.grid.major.x = element_blank()
  )

The density graph shows a similar trend from the box plots: made shots on average have a shorter distance than missed shots. These densities are both bi-modal, peaking around 5f and again around 25 ft. The made shots has the higher pick at 5ft (easier to make shots at a shorter distance) and the missed shots has a higher peak at 25 ft (harder to make shots at a longer distance. Why is there this decrease in shots at the distances between? The 3pt-line is at ~24ft, which is why there is a large increase in shots taken just around and above that height. There is not much difference in difficult between taking a 23ft shot and a 25ft shot, but the former only gives you 2pts where the later gives you 3pts.

Exercise 2

After examining the two graphics, what do you conclude about Stephen Curry’s shot selection (i.e. distance form hoop) for the 2014-2015 season? Out of the four graphics (two from Exercise 1 and two from Exercise 2), which graphic(s) do you find the most useful when trying to understand Stephen Curry’s shot selection? If you find them all useful, explain what information is better communicated in each.

# picture of half a basketball court to show the location of shots 
court <- grid::rasterGrob(jpeg::readJPEG(source = "img/nbahalfcourt.jpg"),
  width = unit(1, "npc"), height = unit(1, "npc")
)


#create base plot with court as background
basecourt <- ggplot() +
  #insert half-court to be used as plot background 
  annotation_custom(
    grob = court,
    xmin = -250, xmax = 250,
    ymin = -52, ymax = 418
  ) +
  coord_fixed() +
  xlim(250, -250) +
  ylim(-52, 418)

Plot 1

m1_ex2_p1 <-  basecourt +
  geom_hex(
      data = shots
    , aes(loc_x, loc_y)
    , bins = 20
    , alpha = 0.7
    , color = "grey"
  ) +
  scale_fill_gradient(
      name = "Shot\nAttempts"
    , low = "yellow"
    , high = "red"
    , breaks = c(0, 5, 10, 15)
    , labels = c(0, 5, 10, "15+")
    #set limit of shots to 0 to 15
    , limits = c(0, 15)
    # any shots above 15 will be 'na' so set color to high value = red
    , na.value = "red"
    ) +
  ggtitle("Shot Chart\nStephen Curry") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 14))

m1_ex2_p1

m1_ex2_p2 <- basecourt +
  geom_point(
      data = shots
    , aes(loc_x, loc_y, shape = event_type, color = event_type)
    , fill = NA
    , size = 5
  ) +
  scale_color_manual(
      name = NULL
    , values = c("green", "red")
  ) +
  scale_shape_manual(
      name = NULL
    , values = c(1, 4)
  ) +
  ggtitle("Shot Chart\nStephen Curry") +
  theme_void() +
  theme(
      plot.title = element_text(face = "bold", size = 14)
    , legend.position = "bottom"
    , legend.text = element_text(size = 12)
    )

m1_ex2_p2

To understand Curry’s shot selection, the hex map is better as it shows the density of where he take shots. The point chart gives the same idea (more shots around the hoop and outside the 3pt line), but it is difficult to tell the difference in density because there are so many points - definitely a case of over-plotting The idea of looking at the location of the made and missed shots is interesting and I think it would be better to view a faceted hex map to compare shot density between made and missed shots. The results mirror the info from above; there is a higher density of missed shots outside the 3pt line compared to made shots and a higher density of made shots around the hoop compared to missed shots.

m1_ex2_p1 + 
  facet_wrap(~ event_type) +
  theme(
      plot.title = element_text(face = "bold", size = 14, margin = margin(0, 0, 10, 0))
    , strip.text = element_text(face = "bold", size = 10)
  )

Exercise 3

Part 1

In 3-5 sentences, describe the core concept/idea and structure of the ggplot2 package.

ggplot2 is a data visualization tool that is more powerful and versatile than base R plots. ggplot2 allows the user to break down the plot into individual components that can be added, removed, or edited. ggplot2 is based on the Grammar of Graphics; which uses scales and layers to build a plot in a systematic fashion.

Part 2

Describe each of the following:

ggplot(): creates a new plot; this plot is blank without any other specifications
aes(): maps data to visual aesthetics
geoms: specifies the type of geometric object to visualize (i.e. points, bars, etc.)
stats: statistical transformation on the data rather than visualizing the data
scales: translation of data to visual properties
theme(): visual appearances that are not related to the data (i.e. background, legend position, title size, etc.)

Part 3

Explain the difference between using this code geom_point(aes(color = VARIABLE)) as opposed to using geom_point(color = VARIABLE).

geom_point(aes(color = VARIABLE)): This maps data to color based on value(s) of VARIABLE, e.g. if there are three levels of VARIABLE, the data will be mapped to three different color based on the level. This option also automatically creates a legend to show which color is assigned to which level(s) of VARIABLE.

geom_point(color = VARIABLE): This makes all points one color called VARIABLE. No legend is created because there is no mapping from data to aesthetics.

M2 Midterm 2

Datasets

We will be using the NH_2016pp_dem.txt and NH_2016pp_rep.txt datasets. NH_2016pp_dem.txt contains the raw vote count by county for each candidate for the 2016 New Hampshire Democratic presidential primary. NH_2016pp_rep.txt contains the raw vote count by county for each candidate for the 2016 New Hampshire Republican presidential primary.

# Load package(s)
library(tidyverse)
library(sf)

# Read in the dataset(s)
dem <- read_delim(file = "data/NH_2016pp_dem.txt", delim = "|") %>%
  janitor::clean_names()
rep <- read_delim(file = "data/NH_2016pp_rep.txt", delim = "|") %>%
  janitor::clean_names()

Exercises

Complete the following exercises.

Exercise 1

Using the NH_2016pp_dem.txt and NH_2016pp_rep.txt datasets in conjunction with mapping data from the maps package replicate, as close as possible, the graphic below. Note the graphic is comprised of two plots displayed side-by-side. The plots both use the same shading scheme (i.e. scale limits and fill options). Hint: You will have to use pivot_longer(), left_join(), and other data manipulation functions.

Background Information: On Tuesday, February 9th 2016, New Hampshire held primaries for the Democratic and Republican Parties to apportion state delegates to nominate their respective party’s presidential nominee. Bernie Sanders (D) and Donald Trump (R) had very good nights. They both won every county in New Hampshire within their respective party’s primary. In the Democratic primary, Hillary Clinton was second in each county. In the Republican primary John Kasich was second in all but one county where he was in a virtual tie for second, but technically in third (thus the asterisk). The graphic below visualizes the ratio of the candidate with the highest vote total to the candidate with the second highest vote total in each county (a measure of strength for each county winner). The fact that we can replace first and second place with candidate names is simply a byproduct of how the primaries played out. Otherwise, the legend labels should have been “First to Second Ratio” or we might have implemented an alternative shading scheme.

After replicating the graphic provide a summary of how the two primaries relate to one another.

# Setup NH map dataset
nh_dat <- maps::map(
  database = "county",
  regions = "new hampshire",
  plot = FALSE,
  fill = TRUE
) %>%
  #convert data into sf object (simple features)
  st_as_sf() %>%
  # remove state name, so that county name is only thing remaining
  mutate(ID = str_remove(ID, ".*,")) %>%
  # rename column to county 
  rename(county = ID)

#check the coord reference system 
#sf::st_crs(nh_dat)

#looking at just rato of top 2
nh_plot1 <- function(df, ratio_name, party_name){
df %>%
  pivot_longer(col = -county, names_to = "candidate", values_to = "votes") %>%
  group_by(county) %>%
  summarise(
    ratio = max(votes) / nth(votes, n = 2, order_by = desc(votes))
  ) %>%
  left_join(., nh_dat, by = "county") %>%
ggplot() +
  geom_sf(
      aes(geometry = geom, fill = ratio)
    , color = "grey25"
  ) +
  scale_fill_gradient(
      name = paste(ratio_name, "Ratio", sep="\n")
    , limits = c(1, 3)
    , breaks = c(seq(1, 3, 0.5))
    , low = "grey95"
    , high = "grey5"
  ) + 
  theme_void() +
  labs(
      title = paste(party_name, "Presidential Primary")
    , subtitle = "New Hampshire (2016)"
  ) +
  theme(
      legend.position = c(0, .90) 
    , legend.justification = c(0, 1)
    , legend.background = element_rect(fill = NA, color =NA)
  )
}

p1 <- nh_plot1(dem, "Sanders to Clinton", "Democratic")

## `summarise()` ungrouping output (override with `.groups` argument)

p2 <- nh_plot1(rep, "Trump to Kaisch*", "Republican")

## `summarise()` ungrouping output (override with `.groups` argument)

gridExtra::grid.arrange(p1, p2, ncol = 2, top = "Ratio between First and Second")

#look at 2 specific people ratio 
votes  <- left_join(dem, rep, "county")
nh_primary <- left_join(votes, nh_dat, by = "county") 

nh_plot2 <- function(map_df, name1, name2, ratio_name, party){
map_df %>%
  mutate(ratio = map_df[[name1]]/map_df[[name2]]) %>%
ggplot() +
  geom_sf(
      aes(geometry = geom, fill = ratio)
    , color = "grey25"
  ) +
  scale_fill_gradient(
      name = paste(ratio_name, "Ratio", sep="\n")
    , limits = c(1, 3)
    , breaks = c(seq(1, 3, 0.5))
    , low = "grey95"
    , high = "grey5"
  ) + 
  theme_void() +
  labs(
      title = paste(party, "Presidential Primary")
    , subtitle = "New Hampshire (2016)"
  ) +
  theme(
      legend.position = c(0, .90) 
    , legend.justification = c(0, 1)
    , legend.background = element_rect(fill = NA, color =NA)
  )
}

p1 <- nh_plot2(nh_primary, "sanders", "clinton", "Sanders to Clinton", "Democratic")
# ast not needed because looking at Trump:Kasich at each county
p2 <- nh_plot2(nh_primary, "trump", "kasich", "Trump to Kasich", "Republican")

gridExtra::grid.arrange(p1, p2, ncol = 2, top = "Ratio between Specific Candidates")

votes %>%
  mutate(
      ratio_rep = trump/kasich
    , ratio_dem = sanders/clinton
         ) %>%
  select(county, ratio_rep, ratio_dem)

## # A tibble: 10 x 3
##    county       ratio_rep ratio_dem
##    <chr>            <dbl>     <dbl>
##  1 belknap           2.24      1.72
##  2 carroll           1.83      1.75
##  3 cheshire          2.08      2.43
##  4 coos              2.32      1.81
##  5 grafton           1.45      2.07
##  6 hillsborough      2.34      1.39
##  7 merrimack         1.77      1.48
##  8 rockingham        2.77      1.36
##  9 strafford         2.30      1.80
## 10 sullivan          2.16      2.37

There is a larger margin of victory in the democratic primary in the bottom left county, and a larger margin of victory in the Rebulican primary in the bottom right hand sign. This could mean that in these two counties there is very strong support for the lead candiate in the repsective parties. For example, Rockingham has a very high Trump to Kasich ratio but the smallest Sanders to Clinton ratio. This could mean that Republicans are Rockingham is very pro-Trump and in agreement; but the Democrats in this county do not have a consensus on a democratic candidate - is this because the county is so pro-Trump and more conservative?

Exercise 2

Create a fake/randomly generated dataset to approximate a scatterplot graphic shown.

set.seed(2468)
low_x <- rnorm(20, mean = 5,  sd = 2)
med_x <- rnorm(20, mean = 10, sd = 2)
hig_x <- rnorm(20, mean = 15, sd = 2)

low_y <- low_x + rnorm(20, mean = 0, sd = 1)
med_y <- med_x + rnorm(20, mean = 0, sd = 2)
hig_y <- hig_x + rnorm(20, mean = 0, sd = 3)

df <- tibble(
    x = c(low_x, med_x, hig_x)
  , y = c(low_y, med_y, hig_y)
  , Group = c(
      rep("Low", 20)
    , rep("Medium", 20)
    , rep("High", 20)
  )
) %>%
  mutate(Group = factor(Group, levels = c("High", "Medium", "Low")))

ggplot(df, aes(x, y, color = Group, shape = Group, fill = Group)) + 
  stat_ellipse(geom = "polygon", alpha = 0.2) +
  geom_point(size = 3) +
  scale_shape_manual(values = c(16, 17, 15)) +
  theme_classic() +
  scale_x_continuous(breaks = seq(0, 20, 5)) +
  scale_y_continuous(breaks = seq(0, 20, 5)) +
  theme(legend.position = c(0.1, 0.9), legend.justification = c(0, 1) )

Exercise 3

Create a data set and a graphic based on tile’s for each x and y value.

The function can be drawn using the equation provided on the graphic. The color of the function is a 50-50 mix of red and blue.

base_seq <- seq(-1, 1, 0.01)
n_seq <- length(base_seq)


x <- rep(base_seq, n_seq)
y <- sort(rep(base_seq, n_seq))

cubic_df <- tibble(x, y) %>%
  mutate( fill_amount = x^3 - y)

func_cube <- function(x){x^3}

ggplot(cubic_df, aes(x, y, fill = fill_amount)) + 
  geom_raster() +
  scale_fill_gradient2(
      name = NULL
    , high = "blue"
    , low = "red"
    , mid = "white"
    , midpoint = 0
  ) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  stat_function(
      fun = func_cube
    , color = rgb(.5*255, 0, .5*255, maxColorValue = 255)
    , size = 0.6
  ) +
  annotate(
    "text"
    , x = 0
    , y = 0.15
    , label = "x^3"
    , parse = TRUE
    , size = 8
  )

L11 Shiny 1

Exercies

Exercies 1

Work through lesson 1 and lesson 2 which both contain an exercise for practice. In the case of lesson 1’s exercise you should have replaced the base R graphics code with ggplot code (i.e. make the histogram using ggplot2). Submit your shiny applications for lessons 1 and 2.

Lesson 1

#LESSON 1
library(shiny)
library(tidyverse)

# Define UI for app that draws a histogram ----
ui <- fluidPage(
  
  # App title ----
  titlePanel("Hello World!"),
  
  # Sidebar layout with input and output definitions ----
  sidebarLayout(
    
    # Sidebar panel for inputs ----
    sidebarPanel(
      
      # Input: Slider for the number of bins ----
      sliderInput(inputId = "bins",
                  label = "Number of bins:",
                  min = 5,
                  max = 50,
                  value = 30)
      
    ),
    
    # Main panel for displaying outputs ----
    mainPanel(
      
      # Output: Histogram ----
      plotOutput(outputId = "distPlot")
      
    )
  )
)

# Define server logic required to draw a histogram ----
server <- function(input, output) {
  
  # Histogram of the Old Faithful Geyser Data ----
  # with requested number of bins
  # This expression that generates a histogram is wrapped in a call
  # to renderPlot to indicate that:
  #
  # 1. It is "reactive" and therefore should be automatically
  #    re-executed when inputs (input$bins) change
  # 2. Its output type is a plot
  output$distPlot <- renderPlot({
    
    
    ggplot(faithful, aes(waiting)) + 
      geom_histogram(
          color = "orange"
        , fill = "#75AADB"
        , bins = input$bins
      )+
      labs(
          x = "Waiting time to next erruption (in min)"
        , title = "Histogram of waiting times"
        , y = "Frequency"
      ) +
      theme_classic() + 
      theme(plot.title = element_text(size = 20))
    
#TEMPLATE HISTOGRAM EXAMPLE 
#    x    <- faithful$waiting
#    bins <- seq(min(x), max(x), length.out = input$bins + 1)
#    hist(x, breaks = bins, col = "#75AADB", border = "white",
#         xlab = "Waiting time to next eruption (in mins)",
#         main = "Histogram of waiting times")
    
  })
  
}

shinyApp(ui = ui, server = server)

Lesson 2

library(shiny)
library(tidyverse)



# Define UI ----
ui <- fluidPage(
  titlePanel("Lesson 2"),
  
  sidebarLayout(position = "left",
                sidebarPanel("Check out my Guinea Pigs"),
                mainPanel(
                  h1("Daffodil and Blossom", align = "center"),
                  p("These are my two guinea pigs.  I adopted them in January 2019.", style = "font-family: 'times'; font-si16pt"), 
                  strong("They are are very cute."),
                  br(),
                  code("They have no idea how to code."),
                  div("They do know how to run around the cage and wheek for food.", style = "color:blue"),
                  br(),
                  p("Their favorite foods are",
                    span("cilantro, parsely,", style = "color:green"),
                    "and", 
                    span("carrots", style = "color:orange"), 
                    "."), 
                  img(src = "peegs.jpg", width = "80%")
                )
  )
)
# Define server logic ----
server <- function(input, output) {
  
}

# Run the app ----
shinyApp(ui = ui, server = server)

Exercise 2

Create a shiny for the CDC data where the input is the number of bins and the output is a histogram of weight by gender.

library(shiny)
library(tidyverse)

cdc <- read_delim(file = "cdc.txt", delim = "|") %>%
  mutate(genhlth = factor(genhlth,
                          levels = c("excellent", "very good", "good", "fair", "poor"),
                          labels = c("Excellent", "Very Good", "Good", "Fair", "Poor")
  ))

# Define UI for app that draws a histogram ----
ui <- fluidPage(
  
  # App title ----
  titlePanel(NULL),
  
  sidebarLayout(
    position = "right", 
    
    sidebarPanel(
      
      # Input: Slider for the number of bins ----
      sliderInput(inputId = "bins",
                  label = "Number of bins:",
                  min = 5,
                  max = 50,
                  value = 30)
      
    ),
    
    # Main panel for displaying outputs ----
    mainPanel(
      
      # Output: Histogram ----
      plotOutput(outputId = "distPlot")
      
    )
  )
)

# Define server logic required to draw a histogram ----
server <- function(input, output) {
  
  output$distPlot <- renderPlot({
    
    
    ggplot(cdc, aes(weight, fill = gender)) + 
      geom_histogram(
          color = "black"
        , bins = input$bins
      ) +
      scale_fill_discrete(
          name = "Gender"
        , labels = c("Female", "Male")
      ) +
      labs(
          x = "Weight in Pounds"
        , title = "CDC BRFSS Histogram of Weight Grouped by Gender"
      ) +
      theme_minimal() + 
      theme(
          plot.title = element_text(size = 20)
        , legend.position = c(0.55, 0.6)
        , legend.justification = c(1, 0)
        , legend.background = element_rect(fill = NA, color = NA)
      )
    

    
  })
  
}

shinyApp(ui = ui, server = server)

Arend’s app (on shinnyapps server):

#{r, out.width = "100%"}
knitr::include_app("https://amkuyper.shinyapps.io/CDCapp/", height = "500px")

Exercise 3

Create an app with a state or country; explore options for text, images, and HTML tags.

#MINNESOTA MAP 
library(shiny)
library(tidyverse)
library(rvest) #for scraping data from mn web 
library(maps)


mn <- map_data("county", "minnesota") %>% 
  select(long, lat, group, id = subregion)


url <- "https://www.minnesota-demographics.com/counties_by_population"
webpage <- read_html(url)
tbls <- html_nodes(webpage, "table")

tbls_ls <- webpage %>%
  html_nodes("table") %>%
  html_table()

pop_mn_counties <- tbls_ls[[1]][-88, ] %>%
  mutate(
    Population = as.numeric(gsub(",", "",  Population))
    , County = tolower(gsub(" County", "", County))
    , County = ifelse(County == "st. louis", "st louis", County)
    , perc_pop = Population / sum(Population)
  ) %>%
  janitor::clean_names() 

mn_pop_map <- mn %>%
  left_join(., pop_mn_counties, by = c("id" = "county"))

# Define UI ----
ui <- fluidPage(
  titlePanel("Minnesota"),
  
  sidebarLayout(position = "left",
                sidebarPanel(
                    h3("Fun Facts:")
                  , strong("State Nickname:"), " Land of 10,000 Lakes",  br()
                  , strong("State Motto:"), " The Star of the North", br()
                  , strong("Population:"), "5.576 millsion (2017 est.)", br()
                  , strong("State Flag:")
                  , img(src = "flag.png", width = "80%"), br()
                  , strong("State Bird:"), " Loon"
                  , img(src = "loon.jpg", width = "80%")
                ),
                mainPanel(
                    plotOutput(outputId = "statemap")
                  , tags$ul(
                    tags$li("Minnesota was admitted as the 32nd state on May 11, 1858"), 
                    tags$li("Minnesota is ranked 12th in largest area and 22nd in population; nearly 55% of residents live in the Twin Cities metro area"), 
                    tags$li("Minnesota is known for a population that enjoys playing and watching hockey"), 
                    tags$li( "Minnesota still has a Virginia confederate flag that was caputred during the Battle of Gettysburg in 1863; Virigina has asked for the flag to be returned multiple times but ",
                        tags$a(href = "https://www.twincities.com/2017/08/20/minnesota-has-a-confederate-symbol-and-it-is-going-to-keep-it/"
                                      , "Minnesota has refused")
                              ), 
                    tags$li("Learn more on ", tags$a(href="https://en.wikipedia.org/wiki/Minnesota", "Wikipedia"))
                  )
                )
  )
)
# Define server logic ----
server <- function(input, output) {
  output$statemap <-  renderPlot({
    ggplot(mn_pop_map, aes(x = long, y = lat, fill = perc_pop)) +
      geom_polygon(aes(group = group) , color = "grey35") +
      scale_fill_continuous(
        name = "Percent of Population"
        , breaks = c(0.05, 0.10, 0.15, 0.20)
        , labels = c("5%", "10%", "15%", "20%")
      ) +
    coord_quickmap() +
      theme_void()
  })
  
}

# Run the app ----
shinyApp(ui = ui, server = server)

L12 Shiny 2

Exercises

Exercise 1

Work through lesson 3 and lesson 4 which both contain an exercise for practice. In this case, the exercise for lesson 4 simply builds directly upon lesson 3’s exercise. Submit one shiny app that corresponds with the completion of the exercises for lessons 3 and 4.

#LESSON 3 & 4
library(shiny)

ui <- fluidPage(
  titlePanel("censusVis"),
  
  sidebarLayout(
    sidebarPanel(
      helpText("Create demographic maps with 
               information from the 2010 US Census."),
      
      selectInput("var", 
                  label = "Choose a variable to display",
                  choices = c("Percent White", 
                              "Percent Black",
                              "Percent Hispanic", 
                              "Percent Asian"),
                  selected = "Percent White"),
      
      sliderInput("range", 
                  label = "Range of interest:",
                  min = 0, max = 100, value = c(0, 100))
    ),
    
    mainPanel(
      textOutput("selected_var"),
      textOutput("min_max")
    )
  )
)

server <- function(input, output) {
  
  output$selected_var <- renderText({ 
    paste("You have selected", input$var)
  })
  
  output$min_max <- renderText({ 
    paste("You have chosen a range that goes from",
          input$range[1], "to", input$range[2])
  })
  
}

shinyApp(ui, server)

runApp('Eichlersmith_Martha_L12/lesson_4', display.mode = "showcase")

Exercise 2

Create a shiny for the CDC data where the input is the number of bins and variable and the output is a histogram of weight by the given variable.

library(shiny)
library(tidyverse)
library(skimr)

# https://stackoverflow.com/questions/48106504/r-shiny-how-to-display-choice-label-in-selectinput 
# https://shiny.rstudio.com/reference/shiny/1.4.0/tabsetPanel.html

cdc <- read_delim(file = "cdc.txt", delim = "|") %>%
  mutate(
    genhlth = factor(
      genhlth
      , levels = c("excellent", "very good", "good", "fair", "poor")
      , labels = c("Excellent", "Very Good", "Good", "Fair", "Poor")
    )
    , hlthplan = factor(
      hlthplan
      , levels = c(0, 1)
      , labels = c("No", "Yes")
    )
    , exerany = factor(
      exerany
      , levels = c(0, 1)
      , labels = c("No", "Yes")
    )
    , smoke100 = factor(
      smoke100
      , levels = c(0, 1)
      , labels = c("No", "Yes")
    )
    , gender = factor(
      gender
      , levels = c("f", "m")
      , labels = c("Female", "Male")
    )
  )

# Define UI for app that draws a histogram ----
ui <- fluidPage(
  
  # App title ----
  titlePanel("CDC BRFSS Histograms"),
  
  sidebarLayout(
    position = "right", 
    
    sidebarPanel(
      
      selectInput("x_var", label = "Select Variable:",
                  choices = list(
                      "Actual Weight" = "weight"
                    , "Desired Weight" = "wtdesire"
                    , "Height" = "height"
                  ),  selected = "weight"
          ), #end of select input 
      
      
      sliderInput(inputId = "bins",
                  label = "Number of bins:",
                  min = 5,
                  max = 50,
                  value = 30,
                  animate = animationOptions(interval = 1000, loop = FALSE)
        ),  #end of slider input 
      
      radioButtons("fill_var", label = "Select Fill/Legend Variable", 
                   choices = list(
                       "General Health" = "genhlth"
                     , "Health Coverage" = "hlthplan"
                     , "Exercised in Past Month" = "exerany"
                     , "Smoked 100 Cigarettes" = "smoke100"
                     , "Gender" = "gender"
                     , "None" = "None"
                     ), selected = "genhlth"
          ) #end of radio buttons 
      
    ), #end of side bar panel 
    
    # Main panel for displaying outputs ----
    mainPanel(
      tabsetPanel(
        tabPanel("Plot", plotOutput("plot")),
        tabPanel("Summary", verbatimTextOutput("summary_x"), verbatimTextOutput("summary_fill")) 
      ) #end tabsetPanel 
    ) #end mainPanel
  ) #end sidebar layout 
) # end fluid page 

fill_names <- c(
  "General Health" = "genhlth"
  , "Health Coverage" = "hlthplan"
  , "Exercised in Past Month" = "exerany"
  , "Smoked 100 Cigarettes" = "smoke100"
  , "Gender" = "gender"
)

x_names <- c(
  "Actual Weight in Pounds" = "weight"
  , "Desired Weight in Pounds" = "wtdesire"
  , "Height in Inches" = "height"
)

# Define server logic required to draw a histogram ----
server <- function(input, output) {

#PLOT OUTPUT   
  output$plot <- renderPlot({
    
    #other option 
    #var_hist <- case_when(input$var_x == "Actual Weight" ~ pull(cdc, weight))
    
    #x_labe. <- case_when(input$var_x == "Actual Weight" ~ "Actual Weight in Pounds")
    
    fill <- cdc[[input$fill_var]]
    fill_name <- names(fill_names)[fill_names == input$fill_var]
    
    
    x <- cdc[[input$x_var]]
    x_name <- names(x_names)[x_names == input$x_var]
    
    
      base_plot <- ggplot(cdc, aes(x)) + 
        scale_fill_discrete(name = NULL) +
        labs(
          x =  x_name 
          , y = "Count"
          , subtitle = fill_name
        ) +
        theme_minimal() +
        theme(
          plot.subtitle = element_text(hjust = 0.5)
          , legend.position = "top"
          , plot.background = element_rect(fill = "grey95", color = NA)
          , panel.grid.major = element_line(color = "grey80")
        ) 
      
      if (input$fill_var == "None") {
        base_plot + geom_histogram(
            fill = "skyblue"
          , color = "black"
          , bins = input$bins
        )
      } else {
        base_plot + geom_histogram(
          aes(fill = fill)
          , color = "black"
          , bins = input$bins
        )
      } #end of if else 
      
  }) #end of renderplot
  
############ END OF PLOT OUTPUT
  
# SUMMARY X OUTPUT
  output$summary_x <- renderPrint({ 
    print(skim(cdc, input$x_var), include_summary = FALSE, strip_metadata = TRUE)
  }) #end of render
  
########### END OF SUMMARY X OUTPUT 
  
  # SUMMARY FILL OUTPUT
  output$summary_fill <- renderPrint({
    if (input$fill_var == "None") {
      X <- NULL 
    } else {
    print(skim(cdc, input$fill_var), include_summary = FALSE, strip_metadata = TRUE)
    } #end of if else 
  }) #end of render
  
########### END OF SUMMARY FILL  OUTPUT 
  
} #end of server 

shinyApp(ui = ui, server = server)

# rsconnect::deployApp('Eichlersmith_Martha_L12/CDC_plot', account = "mareichler-nw")

This app can be accessed at mareichler-nw.shinyapps.io/cdc_plot/ ; it is also embedded below.

#{r, out.width = "100%"}
knitr::include_app("https://mareichler-nw.shinyapps.io/cdc_plot/", height = "600px")

L13 Shiny 3

Exercises

Exercise 1

Work through lesson 5 and lesson 6 which both contain an exercise for practice.

Lesson 5

#lesson 5
library(maps)
library(mapproj)
source("helpers.R")
counties <- readRDS("data/counties.rds")

# User interface ----
ui <- fluidPage(
    titlePanel("censusVis"),
    
    sidebarLayout(
        sidebarPanel(
            helpText("Create demographic maps with 
        information from the 2010 US Census."),
            
            selectInput("var", 
                        label = "Choose a variable to display",
                        choices = c("Percent White", "Percent Black",
                                    "Percent Hispanic", "Percent Asian"),
                        selected = "Percent White"),
            
            sliderInput("range", 
                        label = "Range of interest:",
                        min = 0, max = 100, value = c(0, 100))
        ),
        
        mainPanel(plotOutput("map"))
    )
)

# Server logic ----
server <- function(input, output) {
    output$map <- renderPlot({
        args <- switch(input$var,
                       "Percent White" = list(counties$white, "darkgreen", "% White"),
                       "Percent Black" = list(counties$black, "black", "% Black"),
                       "Percent Hispanic" = list(counties$hispanic, "darkorange", "% Hispanic"),
                       "Percent Asian" = list(counties$asian, "darkviolet", "% Asian"))
        
        args$min <- input$range[1]
        args$max <- input$range[2]
        
        do.call(percent_map, args)
    })
}

# Run app ----
shinyApp(ui, server)

#lesson 5 helper function 
# Note: percent map is designed to work with the counties data set
# It may not work correctly with other data sets if their row order does 
# not exactly match the order in which the maps package plots counties
percent_map <- function(var, color, legend.title, min = 0, max = 100) {

  # generate vector of fill colors for map
  shades <- colorRampPalette(c("white", color))(100)
  
  # constrain gradient to percents that occur between min and max
  var <- pmax(var, min)
  var <- pmin(var, max)
  percents <- as.integer(cut(var, 100, 
    include.lowest = TRUE, ordered = TRUE))
  fills <- shades[percents]

  # plot choropleth map
  map("county", fill = TRUE, col = fills, 
    resolution = 0, lty = 0, projection = "polyconic", 
    myborder = 0, mar = c(0,0,0,0))
  
  # overlay state borders
  map("state", col = "white", fill = FALSE, add = TRUE,
    lty = 1, lwd = 1, projection = "polyconic", 
    myborder = 0, mar = c(0,0,0,0))
  
  # add a legend
  inc <- (max - min) / 4
  legend.text <- c(paste0(min, " % or less"),
    paste0(min + inc, " %"),
    paste0(min + 2 * inc, " %"),
    paste0(min + 3 * inc, " %"),
    paste0(max, " % or more"))
  
  legend("bottomleft", 
    legend = legend.text, 
    fill = shades[c(1, 25, 50, 75, 100)], 
    title = legend.title)
}

Lesson 6

#lesson 6

# Load packages ----
library(shiny)
library(quantmod)

# Source helpers ----
source("helpers.R")

# User interface ----
ui <- fluidPage(
    titlePanel("stockVis"),
    
    sidebarLayout(
        sidebarPanel(
            helpText("Select a stock to examine.

        Information will be collected from Yahoo finance."),
            textInput("symb", "Symbol", "SPY"),
            
            dateRangeInput("dates",
                           "Date range",
                           start = "2013-01-01",
                           end = as.character(Sys.Date())),
            
            br(),
            br(),
            
            checkboxInput("log", "Plot y axis on log scale",
                          value = FALSE),
            
            checkboxInput("adjust",
                          "Adjust prices for inflation", value = FALSE)
        ),
        
        mainPanel(plotOutput("plot"))
    )
)

# Server logic
server <- function(input, output) {
    
    dataInput <- reactive({  
        getSymbols(input$symb, src = "yahoo",
                   from = input$dates[1],
                   to = input$dates[2],
                   auto.assign = FALSE)
    })
    
    finalInput <- reactive({
        if (!input$adjust) return(dataInput())
        adjust(dataInput())
    })
    
    output$plot <- renderPlot({
        chartSeries(finalInput(), theme = chartTheme("white"),
                    type = "line", log.scale = input$log, TA = NULL)
    })
}

# Run the app
shinyApp(ui, server)

#lesson 6 helper 

if (!exists(".inflation")) {
  .inflation <- getSymbols('CPIAUCNS', src = 'FRED', 
     auto.assign = FALSE)
}  

# adjusts Google finance data with the monthly consumer price index 
# values provided by the Federal Reserve of St. Louis
# historical prices are returned in present values 
adjust <- function(data) {

      latestcpi <- last(.inflation)[[1]]
      inf.latest <- time(last(.inflation))
      months <- split(data)               
      
      adjust_month <- function(month) {               
        date <- substr(min(time(month[1]), inf.latest), 1, 7)
        coredata(month) * latestcpi / .inflation[date][[1]]
      }
      
      adjs <- lapply(months, adjust_month)
      adj <- do.call("rbind", adjs)
      axts <- xts(adj, order.by = time(data))
      axts[ , 5] <- Vo(data)
      axts
}

Exercise 2

Create a copy of lesson 5 and use this to create a more complex app using ggplot.
This app can be accessed at https://amkuyper.shinyapps.io/census-app-ggplot2/

#census app 
library(shiny)
library(maps)
library(tidyverse)
library(tools) #for toTitleCase()
library(statebins)
source("func_statemap.R")
source("func_countymap.R")
source("func_binmap.R")

counties <- readRDS("data/counties.rds")  %>%
    separate(name, c("state", "county"), ",")

counties_map <- map_data("county") %>% 
    select(long, lat, group, county = subregion, state = region) %>% 
    left_join(., counties, by = c("county" = "county",  "state" = "state"))

bin_data <- counties %>% 
    mutate(
        white.pop = total.pop*white
        , black.pop = total.pop*black
        , hispanic.pop = total.pop*hispanic
        , asian.pop = total.pop*asian
    ) %>% 
    select(state, total.pop, white.pop, black.pop, hispanic.pop, asian.pop) %>%
    drop_na() %>%
    group_by(state) %>%
    summarize_all(sum) %>%
    mutate(
        white = white.pop / total.pop
        , black = black.pop / total.pop
        , hispanic = hispanic.pop / total.pop
        , asian = asian.pop / total.pop
        , state = toTitleCase(state)
    ) %>%
    select(state, white, black, hispanic, asian)

choices <- c(state.name, "Contiguous 48 States", "Contiguous 48 States, Counties")

# User interface ----
ui <- fluidPage(
    titlePanel("censusVis"),
    
    sidebarLayout(
        sidebarPanel(
            helpText("Create demographic maps with 
        information from the 2010 US Census."),
            
            selectInput("area", label = "Select a state or the contiguous 48 states to display", 
                        choices = choices, selected = choices[51]),
            
            selectInput("var", 
                        label = "Choose a variable to display",
                        choices = c("Percent White", "Percent Black",
                                    "Percent Hispanic", "Percent Asian"),
                        selected = "Percent White"),
            
            sliderInput("range", 
                        label = "Range of interest:",
                        min = 0, max = 100, value = c(0, 100))
        ),
        
        mainPanel(plotOutput("map"))
    )
)

# Server logic ----
server <- function(input, output) {
    
    dataInput <- reactive({
        if (input$area %in% state.name) { 
            counties_map %>%
            filter(state == tolower(input$area))
        }
        else if (input$area == choices[52]) {counties_map}
        else {bin_data}
    }) 
    
    genArgs <- reactive({
        
        args <- switch(input$var,
                       "Percent White" = list(dataInput()$white, "darkgreen", "% White"),
                       "Percent Black" = list(dataInput()$black, "dodgerblue", "% Black"),
                       "Percent Hispanic" = list(dataInput()$hispanic, "darkorange", "% Hispanic"),
                       "Percent Asian" = list(dataInput()$asian, "darkviolet", "% Asian"))
        args$perc_min <- input$range[1]
        args$perc_max <- input$range[2]
        args$area_name <- input$area
        args$map_data <- dataInput() 
        args
    })
    
    output$map <- renderPlot({
        if ( input$area %in% state.name ) { do.call(state_map, genArgs()) } 
        else if (input$area == choices[52]) {do.call(county_map, genArgs())}
        else {do.call(bin_map, genArgs())}
    })
}

# Run app ----
shinyApp(ui, server)

## rsconnect::deployApp('Eichlersmith_Martha_L13/census-app', account = "mareichler-nw")

#func_statemap
#function to map a given state by county 
state_map <- function(map_data, fill_var, fill_color, legend_name, perc_min, perc_max, area_name){

  perc_breaks <- seq(perc_min, perc_max, (perc_max-perc_min)/4)
  perc_labels <- paste(perc_breaks, "%", sep = "")
  if(perc_max < 100){perc_labels[5] <- paste(perc_labels[5], "or more")}
  
  
  ggplot(map_data, aes(x = long, y = lat)) +
    geom_polygon(aes(group = group, fill = fill_var)) +
    ggtitle(area_name) + 
    borders("state", area_name, colour = "black") +
    scale_fill_gradient(
      name = legend_name
      , high = fill_color
      , low = "white"
      , na.value = "grey50"
      , limits = c(perc_min, perc_max)
      , breaks = perc_breaks 
      , labels = perc_labels 
      , oob = scales::squish
    ) +
    coord_quickmap() +
    theme_void() +
    theme(
        plot.title = element_text(size = 30, hjust = 0.5)
      , legend.title = element_text(size = 20)
      , legend.text = element_text(size = 14)
      , legend.key.width = unit(.8, "cm")
      , legend.key.height = unit(.8, "cm")
    )
}

#func_countymap
#map of whole 48 states by county
county_map <- function(map_data, fill_var, fill_color, legend_name, perc_min, perc_max, area_name){

  perc_breaks <- seq(perc_min, perc_max, (perc_max-perc_min)/4)
  perc_labels <- paste(perc_breaks, "%", sep = "")
  if(perc_max < 100){perc_labels[5] <- paste(perc_labels[5], "or more")}
  
  ggplot(map_data, aes(x = long, y = lat)) +
    geom_polygon(aes(group = group, fill = fill_var)) +
    borders("state", colour = "white") +
    ggtitle(area_name) + 
    scale_fill_gradient(
      name = legend_name
      , high = fill_color
      , low = "white"
      , na.value = "grey50"
      , limits = c(perc_min, perc_max)
      , breaks = perc_breaks 
      , labels = perc_labels 
      , oob = scales::squish
    ) +
    coord_map("conic", lat0=30) +
    theme_void() +
    theme(
      plot.title = element_text(size = 30, hjust = 0.5)
      , legend.title = element_text(size = 20)
      , legend.text = element_text(size = 14)
      , legend.key.width = unit(.8, "cm")
      , legend.key.height = unit(.8, "cm")
    )
}

#func_binmap 
#create state bin chlo. map 
bin_map <- function(map_data, fill_var, fill_color, legend_name, perc_min, perc_max, area_name){
  
  perc_breaks <- seq(perc_min, perc_max, (perc_max-perc_min)/4)
  perc_labels <- paste(perc_breaks, "%", sep = "")
  if(perc_max < 100){perc_labels[5] <- paste(perc_labels[5], "or more")}
  
  ggplot(map_data, aes(state = state, fill = fill_var)) +
    geom_statebins(border_size = 0.5) + 
    ggtitle(area_name) + 
    scale_fill_gradient(
      name = legend_name
      , high = fill_color
      , low = "white"
      , na.value = "grey50"
      , limits = c(perc_min, perc_max)
      , breaks = perc_breaks 
      , labels = perc_labels 
      , oob = scales::squish
      , guide = guide_legend(reverse = TRUE, override.aes = list(color = "black"))
    ) +
    theme_statebins() +
    coord_equal() + 
    theme(
      plot.title = element_text(size = 30, hjust = 0.5)
      , legend.position = "right"
      , legend.title = element_text(size = 20)
      , legend.text = element_text(size = 14)
      , legend.key.height = unit(.8, "cm")
      , legend.key.width = unit(.8, "cm")
    )

}

Data Visualization Class Work

Spring 2020

Martha Eichlersmith

L01 ggplot overview

Overview

Dataset

Exercise 1

1a Inspect the data

1b Price per year

1c Pieces per year

All Years

Data after 1985

1d LEGO set themes

L02 aesthetics

Overview

Datasets

Exercises

Exercise 1

Exercise 1.1

Exercise 1.2

Exercise 2

L03 categories & geom_smooth

Overview

Datasets

Exercises

Exercise 1

Plot 1

Plot 2

Plot 3

Exercise 2

Plot 1

Plot 2

Plot 3

Exercise 3

Plot 1

Plot 2

Plot 3

Plot 4

Plot 5

Plot 6

L04 geom_text & annotation

Overview

Datasets

Exercises

Exercise 1

Exercise 2

Plot 1

Plot 2

Exercise 3

Exercise 4

L05 2d surface & geospatial

Overview

Datasets

Exercises

Exercise 1

Exercise 2

Plot 1

Plot 2

Exercise 3

Exercise 4

Plot 1

Plot 2

Challenge(s)

Plot 1: Manhatten and Household Median Income

Plot 2: Twin Cities and Percentage of Renter-occupied Units

L06 errorsbars & layers

Overview

Datasets

Exercises

Exercise 1

Exercise 2

Exercise 3

Exercise 4

L07 scales, axes, & legends

Overview

Datasets

Exercises

Exercise 1

Exercise 2

Exercise 3