6 Data manipulation with dplyr

Let’s talk about data manipulation. We’ll be using the dplyr package, which is part of tidyverse. You can find a quick reference page under Help -> Cheatsheets in RStudio. First, we need to load our packages & data.

library(tidyverse) 
library(cowplot)
theme_set(theme_cowplot())
lizards <- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this data

6.1 The Pipe (|>)

The pipe ( |> ) operator strings functions together in a sequence. It takes the result of the function on its left and makes it the first argument to the function on the right. Let’s say you wanted to calculate the base-12 log of the mean of the square root of the absolute value of numbers between -50 and 50. The traditional way to write that would be:

log(mean(sqrt(abs(-50:50))), base = 12)
## [1] 0.6256332

This is rather difficult to read; it has a lot of nested parentheses, and you need to start from the inside and work your way out to see what’s happening. With the pipe, you could re-write it like this:

-50:50 |> abs() |> sqrt() |> 
  mean() |> log(base = 12)
## [1] 0.6256332

Using pipes can make your code much clearer, and is quite helpful when creating a sequence of related transformations on data.

In RStudio, you can insert the pipe by pressing Ctrl+Shift+M.

When looking for help online, you may also find a different version of the pipe (%>%); this is an older version of the pipe that is part of the tidyverse. It works similarly to |> but has a number of technical drawbacks. In most cases, you can use them interchangeably.

6.2 Adding/modifying columns (mutate)

The mutate() function creates a new column in a data frame. For example, the total length of a lizard is defined as its snout-vent length (SVL) plus it’s tail length.

mutate(.data = lizards, total_length = SVL + Tail) |> 
  View() # Use View to look at the results in RStudio
Site Color_morph Limb Mass Diameter Height SVL Tail Perch_type total_length
A Green 14.3 6.46 8 164 61.8 43.9 Other 105.7
A Brown 12.3 5.82 18 151 57.1 42.2 Tree 99.3
A Blue 10.5 4.29 36 130 49.1 25.0 Building 74.1
A Brown 10.3 5.29 31 131 51.2 38.2 Tree 89.4
A Brown 10.9 5.69 20 138 51.5 46.9 Shrub 98.4
A Brown 10.4 5.84 25 137 45.3 59.0 Shrub 104.3
A Green 11.1 5.91 7 138 49.7 47.6 Building 97.3
A Brown 10.0 5.09 20 141 48.0 34.9 Tree 82.9
A Brown 12.3 7.20 19 129 54.9 61.0 Tree 115.9
A Brown 11.2 6.66 15 134 52.7 50.4 Tree 103.1

This uses the lizards data to create a new column, total_length. Within the mutate command, you can reference columns directly by their names (like you do for aes() in ggplot). Mutate can create multiple new columns in a single command.

mutate(lizards, # generally, the .data argument is not named
       # All subsequent arguments refer to new columns
       total_length = SVL + Tail,
       rel_limb = Limb/SVL,
       log_total_length = log(total_length),
       # You can also change an existing column by saving something to its name
       Color_morph = paste(Color_morph, "morph") # add "morph" after each color
       ) |> View()
Site Color_morph Limb Mass Diameter Height SVL Tail Perch_type total_length rel_limb log_total_length
A Green morph 14.3 6.46 8 164 61.8 43.9 Other 105.7 0.2313916 4.660605
A Brown morph 12.3 5.82 18 151 57.1 42.2 Tree 99.3 0.2154116 4.598146
A Blue morph 10.5 4.29 36 130 49.1 25.0 Building 74.1 0.2138493 4.305415
A Brown morph 10.3 5.29 31 131 51.2 38.2 Tree 89.4 0.2011719 4.493121
A Brown morph 10.9 5.69 20 138 51.5 46.9 Shrub 98.4 0.2116505 4.589041
A Brown morph 10.4 5.84 25 137 45.3 59.0 Shrub 104.3 0.2295806 4.647271
A Green morph 11.1 5.91 7 138 49.7 47.6 Building 97.3 0.2233400 4.577799
A Brown morph 10.0 5.09 20 141 48.0 34.9 Tree 82.9 0.2083333 4.417635
A Brown morph 12.3 7.20 19 129 54.9 61.0 Tree 115.9 0.2240437 4.752728
A Brown morph 11.2 6.66 15 134 52.7 50.4 Tree 103.1 0.2125237 4.635699

Note that this doesn’t modify the lizards dataset. It creates a new data frame that has an additional column. You’ll need to save it as a new variable to use it.

lizards_full = lizards |> # It's also traditional to pipe the data argument in
  mutate(total_length = SVL + Tail,
         rel_limb = Limb/SVL,
         log_total_length = log(total_length)) 

Tidyverse functions are designed to be piped together. For example:

lizards |> 
  mutate(Total_length = SVL + Tail) |> 
  ggplot(aes(x = Site, y = Total_length)) +
  geom_boxplot()

Creating a quick plot at the end of a data manipulation step can be a good way to get a visual idea of what you’re doing.

Here are a few other helpful things to do with mutate():

lizards |> mutate(
  intercept = 1, # Add a constant
  row_number = 1:n() # the n() function tells you how many rows are 
    # in the current data frame (it only works in mutate & related functions)
) |>  View()
Site Color_morph Limb Mass Diameter Height SVL Tail Perch_type intercept row_number
A Green 14.3 6.46 8 164 61.8 43.9 Other 1 1
A Brown 12.3 5.82 18 151 57.1 42.2 Tree 1 2
A Blue 10.5 4.29 36 130 49.1 25.0 Building 1 3
A Brown 10.3 5.29 31 131 51.2 38.2 Tree 1 4
A Brown 10.9 5.69 20 138 51.5 46.9 Shrub 1 5
A Brown 10.4 5.84 25 137 45.3 59.0 Shrub 1 6
A Green 11.1 5.91 7 138 49.7 47.6 Building 1 7
A Brown 10.0 5.09 20 141 48.0 34.9 Tree 1 8
A Brown 12.3 7.20 19 129 54.9 61.0 Tree 1 9
A Brown 11.2 6.66 15 134 52.7 50.4 Tree 1 10

Here are a few exercises to try:

  1. Add a column to the lizards dataset that gives the lizard’s height relative to the maximum height of any lizard (hint: use max(Height) in a mutate command to find that value).
  2. Calculate perch circumference (Diameter * pi), then pipe that result into a scatter plot of relative limb length vs. circumference. Note that pi is a pre-defined variable in R.

6.3 Subsetting by row (filter)

Let’s define “large lizards” as:

lizards |> 
  mutate(large = SVL > 60) |> 
  View()
Site Color_morph Limb Mass Diameter Height SVL Tail Perch_type large
A Green 14.3 6.46 8 164 61.8 43.9 Other TRUE
A Brown 12.3 5.82 18 151 57.1 42.2 Tree FALSE
A Blue 10.5 4.29 36 130 49.1 25.0 Building FALSE
A Brown 10.3 5.29 31 131 51.2 38.2 Tree FALSE
A Brown 10.9 5.69 20 138 51.5 46.9 Shrub FALSE
A Brown 10.4 5.84 25 137 45.3 59.0 Shrub FALSE
A Green 11.1 5.91 7 138 49.7 47.6 Building FALSE
A Brown 10.0 5.09 20 141 48.0 34.9 Tree FALSE
A Brown 12.3 7.20 19 129 54.9 61.0 Tree FALSE
A Brown 11.2 6.66 15 134 52.7 50.4 Tree FALSE

The large column is a logical vector, with TRUE & FALSE values. We can use logical vectors to get a subset of the data frame where the vector is TRUE with the filter() function.

lizards |>
  mutate(large = SVL > 60) |> 
  filter(large) |> 
  View()
Site Color_morph Limb Mass Diameter Height SVL Tail Perch_type large
A Green 14.3 6.46 8 164 61.8 43.9 Other TRUE
B Brown 13.7 7.57 23 162 67.0 73.8 Building TRUE
B Green 14.4 7.75 8 174 68.8 65.4 Tree TRUE
B Blue 13.8 7.61 35 163 63.5 69.7 Building TRUE
B Green 15.2 8.27 14 188 67.0 50.6 Tree TRUE
B Green 14.1 6.73 17 176 64.8 49.0 Tree TRUE
B Blue 14.2 8.43 29 177 69.6 85.3 Tree TRUE
B Green 14.3 13.30 9 179 68.3 74.0 Tree TRUE
B Blue 14.4 7.71 33 182 65.5 59.6 Shrub TRUE
B Brown 14.1 7.95 32 178 70.1 69.0 Tree TRUE

Note that only TRUE values of large remain the the data frame. It’s not actually necessary to create a column before filtering:

lizards |> filter(SVL > 60) |> View()
Site Color_morph Limb Mass Diameter Height SVL Tail Perch_type
A Green 14.3 6.46 8 164 61.8 43.9 Other
B Brown 13.7 7.57 23 162 67.0 73.8 Building
B Green 14.4 7.75 8 174 68.8 65.4 Tree
B Blue 13.8 7.61 35 163 63.5 69.7 Building
B Green 15.2 8.27 14 188 67.0 50.6 Tree
B Green 14.1 6.73 17 176 64.8 49.0 Tree
B Blue 14.2 8.43 29 177 69.6 85.3 Tree
B Green 14.3 13.30 9 179 68.3 74.0 Tree
B Blue 14.4 7.71 33 182 65.5 59.6 Shrub
B Brown 14.1 7.95 32 178 70.1 69.0 Tree

The filter() command returns every row where its logical conditions are TRUE. Conditional statements that create logical statements include the following:

  • x == y: TRUE if x equals y (This is not the same as x = y)!
  • x != y: TRUE if x does not equal y
  • x > y: x >= y;
  • x < y: x <= y;
  • between(x, y, z): TRUE if x >= y AND x <= z
  • is.na(x): TRUE if x is a missing value (NA)
  • x %in% y: TRUE if x is an element in y

Note that all of these (except for x %in% y) are vectorized.

c(1, 2) == c(2, 3) - 1
## [1] TRUE TRUE
c(1, 2, 3) == c(1, 4, 9)
## [1]  TRUE FALSE FALSE
c(1, 2) == c(2, 1) # positions don't match
## [1] FALSE FALSE
c(1, 2, 3, 4, 5) %in% c(1, 2) # for %in%, position only matters for the left argument
## [1]  TRUE  TRUE FALSE FALSE FALSE

You can also combine and modify logical values:

  • x & y: returns TRUE if both x and y are TRUE
  • x | y: returns TRUE if either x or y are TRUE
  • !x: returns the opposite of x; this one is particularly useful to combine with other logical functions; for example filter(data, !is.na(x)) will return all rows where x is not a missing value.

If you give filter() more than one condition, it applies all of them by combining them with &.

lizards |> 
  filter(Color_morph == "Blue",
         Site %in% c("A", "B", "C", "D", "E")) |> 
  ggplot(aes(x = SVL, y = Height, color = Site)) + geom_point()

Exercises:

  1. Print a dataframe that shows only the lizards higher than 150 cm. How many are there (the console printout should tell you).
  2. How many lizards perching on trees or shrubs are not brown? Visualize the height to diameter relationship between them. Hint: you can use either %in% or a combination of == and | to meet the first condition.

6.4 Subsetting by column (select)

You can subset certain columns with the select() function. There are several ways to do this. The simplest is by name:

lizards |> select(Site, Color_morph, SVL) |> View()
Site Color_morph SVL
A Green 61.8
A Brown 57.1
A Blue 49.1
A Brown 51.2
A Brown 51.5
A Brown 45.3
A Green 49.7
A Brown 48.0
A Brown 54.9
A Brown 52.7

You can also select by position:

lizards |> select(1, 2, 7) |> View()
Site Color_morph SVL
A Green 61.8
A Brown 57.1
A Blue 49.1
A Brown 51.2
A Brown 51.5
A Brown 45.3
A Green 49.7
A Brown 48.0
A Brown 54.9
A Brown 52.7

This is more useful for ranges of values:

lizards |> select(1:4, 7) |> View()
Site Color_morph Limb Mass SVL
A Green 14.3 6.46 61.8
A Brown 12.3 5.82 57.1
A Blue 10.5 4.29 49.1
A Brown 10.3 5.29 51.2
A Brown 10.9 5.69 51.5
A Brown 10.4 5.84 45.3
A Green 11.1 5.91 49.7
A Brown 10.0 5.09 48.0
A Brown 12.3 7.20 54.9
A Brown 11.2 6.66 52.7

You can also use character vectors:

lizards |> select("Site", "Color_morph", "SVL") |> View()
Site Color_morph SVL
A Green 61.8
A Brown 57.1
A Blue 49.1
A Brown 51.2
A Brown 51.5
A Brown 45.3
A Green 49.7
A Brown 48.0
A Brown 54.9
A Brown 52.7

Note that if you want to use a variable that has column names saved as a character vector, you’ll need to use a helper function (all_of) to tell select that you want to look for the contents of the variable, not the name of the variable:

select_vars = c("Site", "Color_morph", "SVL")
lizards |> 
  select(all_of(select_vars)) |> # without all_of, it would try to look for a column called "select_vars"
  View()
Site Color_morph SVL
A Green 61.8
A Brown 57.1
A Blue 49.1
A Brown 51.2
A Brown 51.5
A Brown 45.3
A Green 49.7
A Brown 48.0
A Brown 54.9
A Brown 52.7

You can remove columns by using a negative sign. (Note that negative signs are ignored if you have any names without negative signs).

lizards |> select(-Color_morph, -Limb) |> View()
Site Mass Diameter Height SVL Tail Perch_type
A 6.46 8 164 61.8 43.9 Other
A 5.82 18 151 57.1 42.2 Tree
A 4.29 36 130 49.1 25.0 Building
A 5.29 31 131 51.2 38.2 Tree
A 5.69 20 138 51.5 46.9 Shrub
A 5.84 25 137 45.3 59.0 Shrub
A 5.91 7 138 49.7 47.6 Building
A 5.09 20 141 48.0 34.9 Tree
A 7.20 19 129 54.9 61.0 Tree
A 6.66 15 134 52.7 50.4 Tree
lizards |> select(-(1:5)) |> View()
Height SVL Tail Perch_type
164 61.8 43.9 Other
151 57.1 42.2 Tree
130 49.1 25.0 Building
131 51.2 38.2 Tree
138 51.5 46.9 Shrub
137 45.3 59.0 Shrub
138 49.7 47.6 Building
141 48.0 34.9 Tree
129 54.9 61.0 Tree
134 52.7 50.4 Tree

You can also use the where helper function to select columns based on their characteristics. For example, the is.numeric function returns TRUE if its argument is a number; you can use it to select all numeric columns.

lizards |> 
  select(where(is.numeric)) |> 
  View()
Limb Mass Diameter Height SVL Tail
14.3 6.46 8 164 61.8 43.9
12.3 5.82 18 151 57.1 42.2
10.5 4.29 36 130 49.1 25.0
10.3 5.29 31 131 51.2 38.2
10.9 5.69 20 138 51.5 46.9
10.4 5.84 25 137 45.3 59.0
11.1 5.91 7 138 49.7 47.6
10.0 5.09 20 141 48.0 34.9
12.3 7.20 19 129 54.9 61.0
11.2 6.66 15 134 52.7 50.4

You could do the same for text or logical vectors with is.character or is.logical, respectively. Note that there aren’t parentheses after is.numeric in the above code. that’s because we aren’t calling it on any particular value; instead, the where function calls it on every column of the data frame, and we’re just telling it what function to use.

There’s a lot more you can do with this if you want to get fancy; the documentation is available at ?tidyselect::language.

6.5 Sorting by columns (arrange)

You can use arrange() to sort by one or more column values. To sort lizards from lowest to highest mass:

lizards |> arrange(Mass) |> View()
Site Color_morph Limb Mass Diameter Height SVL Tail Perch_type
A Blue 10.5 4.29 36 130 49.1 25.0 Building
L Brown 9.5 4.33 22 209 40.5 34.7 Other
H Blue 9.4 4.42 28 161 40.3 41.5 Tree
H Green 9.3 4.82 14 150 44.5 43.7 Tree
L Blue 9.2 4.82 31 191 42.6 44.5 Tree
L Blue 10.5 4.82 31 215 44.6 39.9 Shrub
L Green 10.2 4.85 15 214 45.2 35.6 Tree
K Green 9.7 4.89 18 58 47.7 23.7 Other
H Blue 10.4 4.96 37 174 43.3 40.0 Shrub
H Green 10.8 4.99 6 166 47.8 33.0 Tree

If you wish to sort from highest to lowest, use the desc() helper function:

lizards |> arrange(desc(Mass)) |> View()
Site Color_morph Limb Mass Diameter Height SVL Tail Perch_type
B Green 14.3 13.30 9 179 68.3 74.0 Tree
J Brown 16.7 11.83 31 145 80.4 69.2 Other
N Brown 15.8 10.13 25 130 73.2 90.0 Tree
M Brown 11.9 9.79 24 210 56.0 70.7 Tree
P Blue 15.4 9.74 41 221 76.3 95.7 Tree
O Brown 16.4 9.63 24 217 74.3 65.7 Tree
P Brown 15.4 9.58 29 216 70.4 81.6 Tree
N Blue 14.9 9.54 38 128 68.4 67.5 Tree
J Green 17.2 9.35 0 166 79.1 79.9 Tree
O Blue 13.9 9.35 27 203 67.8 77.8 Shrub

When you have categorical variables, you’ll often have ties:

lizards |> arrange(Site) |> View()
Site Color_morph Limb Mass Diameter Height SVL Tail Perch_type
A Green 14.3 6.46 8 164 61.8 43.9 Other
A Brown 12.3 5.82 18 151 57.1 42.2 Tree
A Blue 10.5 4.29 36 130 49.1 25.0 Building
A Brown 10.3 5.29 31 131 51.2 38.2 Tree
A Brown 10.9 5.69 20 138 51.5 46.9 Shrub
A Brown 10.4 5.84 25 137 45.3 59.0 Shrub
A Green 11.1 5.91 7 138 49.7 47.6 Building
A Brown 10.0 5.09 20 141 48.0 34.9 Tree
A Brown 12.3 7.20 19 129 54.9 61.0 Tree
A Brown 11.2 6.66 15 134 52.7 50.4 Tree

In this case, it’s helpful to sort by multiple variables; the following code orders by Site, then by color morph within site, then by SVL.

lizards |> arrange(Site, Color_morph, SVL) |> View()
Site Color_morph Limb Mass Diameter Height SVL Tail Perch_type
A Blue 10.6 5.53 36 134 46.1 50.3 Other
A Blue 10.9 5.73 29 146 47.0 56.5 Other
A Blue 10.1 5.40 27 133 48.2 45.7 Tree
A Blue 10.5 4.29 36 130 49.1 25.0 Building
A Blue 10.8 6.02 30 148 50.1 57.9 Tree
A Blue 11.4 5.72 32 134 50.7 48.8 Tree
A Blue 12.1 5.67 35 150 50.8 51.7 Other
A Blue 10.8 5.27 30 141 51.2 43.3 Tree
A Blue 12.5 6.18 28 140 52.9 48.4 Tree
A Blue 11.0 6.68 31 143 53.8 49.5 Shrub

One particularly useful thing you can do with this is create rankings.

lizards |> 
  arrange(desc(SVL)) |> 
  mutate(size_rank = 1:n()) |>
  View()
Site Color_morph Limb Mass Diameter Height SVL Tail Perch_type size_rank
O Blue 18.7 9.33 43 230 84.5 88.9 Tree 1
J Green 18.2 8.65 2 169 84.3 76.3 Tree 2
J Brown 16.4 9.21 30 147 81.0 86.8 Shrub 3
J Brown 16.7 11.83 31 145 80.4 69.2 Other 4
O Brown 17.3 8.04 23 223 80.2 56.6 Other 5
R Green 17.1 8.83 6 231 80.0 71.1 Tree 6
O Blue 16.1 8.50 36 221 79.3 73.8 Tree 7
J Green 17.2 9.35 0 166 79.1 79.9 Tree 8
J Blue 17.0 8.31 41 164 78.6 74.4 Tree 9
R Brown 17.1 8.30 23 217 78.5 50.3 Building 10

6.6 Summarizing data

Summarize is like mutate, but it generally produces columns that are shorter than the input. It’s typically used for summary stats. For example, this calculates several characteristics of SVL.

lizards |> 
  summarize(mean_SVL = mean(SVL), 
            sd_SVL = sd(SVL),
            med_SVL = median(SVL), 
            count = n()) |> 
  View()
mean_SVL sd_SVL med_SVL count
60.75282 8.708924 59.6 657

A useful trick for summarize is to take the mean of a logical vector; TRUE and FALSE are interpreted as 1 and 0, so this gives you a frequency. For example, if you wanted to get the proportion of color morphs:

lizards |> 
  summarize(freq_Blue = mean(Color_morph == "Blue"),
            freq_Brown = mean(Color_morph == "Brown"),
            freq_Green = mean(Color_morph == "Green")
            )
## # A tibble: 1 × 3
##   freq_Blue freq_Brown freq_Green
##       <dbl>      <dbl>      <dbl>
## 1     0.327      0.336      0.336

You can use the across helper function to apply the same summary function to multiple rows.

lizards |> 
  summarize(
    across(.cols = c(SVL, Tail),
           .fns = mean)
    ) |> View()
SVL Tail
60.75282 57.55205

The .cols argument identifies the columns to use for the summary, using the same methods as select(), the .fns should be one or more functions to apply to each column. If you wish to use more than one summary function, you need to create a named vector:

lizards |> 
  summarize(
    across(.cols = where(is.numeric), # apply to all numeric functions
           .fns = c(Mean = mean, StDev = sd)) # named vector (Mean and StDev)
    ) |> View()
Limb_Mean Limb_StDev Mass_Mean Mass_StDev Diameter_Mean Diameter_StDev Height_Mean Height_StDev SVL_Mean SVL_StDev Tail_Mean Tail_StDev
13.16088 1.83096 7.024368 1.088521 22.73364 10.13449 168.7382 42.69773 60.75282 8.708924 57.55205 12.12895

This applies the functions mean and sd to all numeric columns; The results have the names "Mean" and "StDev" that we gave each function applied to the end of the column.

6.7 Group Operations

The real power of the dplyr package comes from being able to apply all of the above functions to grouped subsets of a data frame. To create a grouped table use the group_by function:

lizards |> group_by(Site)
## # A tibble: 657 × 9
## # Groups:   Site [19]
##    Site  Color_morph  Limb  Mass Diameter Height   SVL  Tail Perch_type
##    <chr> <chr>       <dbl> <dbl>    <dbl>  <dbl> <dbl> <dbl> <chr>     
##  1 A     Green        14.3  6.46        8    164  61.8  43.9 Other     
##  2 A     Brown        12.3  5.82       18    151  57.1  42.2 Tree      
##  3 A     Blue         10.5  4.29       36    130  49.1  25   Building  
##  4 A     Brown        10.3  5.29       31    131  51.2  38.2 Tree      
##  5 A     Brown        10.9  5.69       20    138  51.5  46.9 Shrub     
##  6 A     Brown        10.4  5.84       25    137  45.3  59   Shrub     
##  7 A     Green        11.1  5.91        7    138  49.7  47.6 Building  
##  8 A     Brown        10    5.09       20    141  48    34.9 Tree      
##  9 A     Brown        12.3  7.2        19    129  54.9  61   Tree      
## 10 A     Brown        11.2  6.66       15    134  52.7  50.4 Tree      
## # … with 647 more rows
## # ℹ Use `print(n = ...)` to see more rows

This doesn’t appear to do much on its own; however, look what happens when you combine it with summarize:

lizards |> group_by(Site) |> 
  summarize(mean_SVL = mean(SVL),
            sd_SVL = sd(SVL),
            count = n()) |> 
  View()
Site mean_SVL sd_SVL count
A 51.69615 3.679944 26
B 66.92222 2.182673 36
C 58.09722 5.576250 36
D 70.70000 3.058280 27
E 53.17727 2.776887 44
F 58.77955 2.734583 44
G 63.87308 6.027507 26
H 52.18571 5.101008 42
I 57.04412 4.973639 34
J 74.34857 3.625512 35

This calculates the mean, SD of SVL for each site. You can group by multiple factors

lizards |> group_by(Site, Color_morph) |> 
  summarize(mean_SVL = mean(SVL), 
            count = n()) |> 
  ungroup() |> # Removes grouping; usually a good idea at the end unless you want surprises
  View()
Site Color_morph mean_SVL count
A Blue 49.99000 10
A Brown 52.01538 13
A Green 56.00000 3
B Blue 67.54000 10
B Brown 66.74667 15
B Green 66.60000 11
C Blue 58.96250 8
C Brown 57.03125 16
C Green 58.94167 12
D Blue 70.43750 8

Grouping doesn’t just work with summarize; for example, you can use it with mutate to find the relative height of each lizard within its site:

lizards |> group_by(Site) |> 
  mutate(
    rel_height = Height/max(Height)) |> 
    # max(Height) returns the max height in each site
  ungroup() |> 
  ggplot(aes(x = Site, y = rel_height)) + 
  geom_jitter(width = .2, height = 0)

You can also use this to easily calculate frequencies:

An alternate way to do this would be to combine rsummariseandmutate`.

lizards |> 
  group_by(Site, Color_morph) |> 
  summarize(count = n()) |> 
  # count is the total number of each morph at each site
  group_by(Site) |> 
  # Calculate color morph frequency at each site
  mutate(site_frequency = count / sum(count)) |> 
  View()
Site Color_morph count site_frequency
A Blue 10 0.3846154
A Brown 13 0.5000000
A Green 3 0.1153846
B Blue 10 0.2777778
B Brown 15 0.4166667
B Green 11 0.3055556
C Blue 8 0.2222222
C Brown 16 0.4444444
C Green 12 0.3333333
D Blue 8 0.2962963

Some Exercises:

  1. Visualize the relationship between the maximum height at a site and the average limb length. Use this to help:
 lizards |> 
  # Put your summarize() code here
  # You should name your new columns max_height and mean_limb
    ggplot(aes(x = max_height, y = mean_limb)) + 
    geom_smooth(method = "lm") + geom_point() 
  1. For each site, what’s the mean limb length of the five largest individuals by SVL? What proportion of these individuals is blue?