7 Performing some useful analyses in R
I’m going to give some examples of how to do common analyses you’ll need for this class. I won’t be spending much time on the statistical assumptions or diagnostics.
library(tidyverse)
library(cowplot)
theme_set(theme_cowplot())
<- read_csv("example_data/anoles.csv") # See Appendix A if you don't have this data lizards
7.1 A note on factors
R has two ways of representing textual data: character vectors (also called strings) and factors.
- Character vectors are just text; they have no inherent underlying meaning.
- Factors are a data type with a specific number of levels; they’re often used to represent different experimental treatments. Examples could include {low, medium, high} or {control, treatment}. Each level of a factor is associated with a number.
For the most part, it’s easier and safer to work with character vectors. Most functions we’ll be using know how to convert them when it’s necessary.
One important thing to note is that ggplot
arranges character vectors alphabetically on its categorical scales, but orders factors by their level number. Thus, to change the order of categorical x-axes (and other scales), you need to make your categories into a factor. This is done with the fct_inorder
, fct_infreq
, and related functions, which are part of the forcats
package and loaded with tidyverse. The easiest one to use is fct_inorder
, which changes the level values to be in the order of your data; when combined with arrange()
and other dplyr
functions, this is quite flexible and powerful. For more information on these and other factor functions, take a look at the forcats website.
If you want to manually create a factor, you can use the factor
command, which is in base R (no package).
= c("Red","Green","Blue")
color_levels # Randomly select 20 colors from color_levels
= sample(color_levels, size = 20, replace = TRUE)
color_example
color_example## [1] "Green" "Blue" "Green" "Green" "Blue" "Green" "Green" "Green" "Blue"
## [10] "Blue" "Red" "Blue" "Green" "Green" "Blue" "Green" "Red" "Blue"
## [19] "Blue" "Blue"
# Convert it into a factor
= factor(color_example, levels = color_levels)
color_as_factor
color_as_factor## [1] Green Blue Green Green Blue Green Green Green Blue Blue Red Blue
## [13] Green Green Blue Green Red Blue Blue Blue
## Levels: Red Green Blue
7.2 Comparing categorical frequencies (Contingency Tables & related analyses)
These tests generally compare the frequencies of count data.
7.2.1 Contingency Tables
The easiest way to make a contingency table is from a data frame where each column is a categorical variable you want in the table & each row is an observation. Let’s say we wanted to create a color morph by perch type contingency with our lizard data.
<- lizards |>
color_by_perch_tbl select(Color_morph, Perch_type) |> # select only the columns you want
table() # feed them into the table command
color_by_perch_tbl## Perch_type
## Color_morph Building Other Shrub Tree
## Blue 17 41 27 130
## Brown 31 25 25 140
## Green 30 25 32 134
If you want to switch your contingency table from counts to frequencies, just divide it by its sum:
/ sum(color_by_perch_tbl)
color_by_perch_tbl ## Perch_type
## Color_morph Building Other Shrub Tree
## Blue 0.02587519 0.06240487 0.04109589 0.19786910
## Brown 0.04718417 0.03805175 0.03805175 0.21308980
## Green 0.04566210 0.03805175 0.04870624 0.20395738
7.2.2 Chi-squared & Fisher’s Exact Tests
Once you have a contingency table, you can test for independence between the rows and columns. This provides you with your test statistic (X-squared
), degrees of freedom, and p-value.
chisq.test(color_by_perch_tbl)
##
## Pearson's Chi-squared test
##
## data: color_by_perch_tbl
## X-squared = 11.603, df = 6, p-value = 0.07143
Chi-squared tests assume that there’s at least five observations in each cell of the contingency table. If this fails, then the resulting values aren’t accurate. For example:
<- lizards |>
site_a_contingency filter(Site == "A") |> # cut down the data size
select(Color_morph, Perch_type) |> # select only the columns you want
table() # feed them into the table command
print(site_a_contingency)
## Perch_type
## Color_morph Building Other Shrub Tree
## Blue 1 3 1 5
## Brown 0 0 2 11
## Green 1 1 0 1
chisq.test(site_a_contingency)
## Warning in chisq.test(site_a_contingency): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: site_a_contingency
## X-squared = 9.752, df = 6, p-value = 0.1355
Note the warning that “Chi-squared approximation may be incorrect.” In this case, it’s a good idea to run the Fisher’s exact test, which investigates the same null hypotheses, but works with low counts. Fisher’s test is less powerful than the Chi-squared test, so it should only be used when it’s the only option.
fisher.test(site_a_contingency)
##
## Fisher's Exact Test for Count Data
##
## data: site_a_contingency
## p-value = 0.06251
## alternative hypothesis: two.sided
You can also run a chi-squared or Fisher’s exact test directly on two vectors or data frame columns:
chisq.test(lizards$Color_morph, lizards$Perch_type)
##
## Pearson's Chi-squared test
##
## data: lizards$Color_morph and lizards$Perch_type
## X-squared = 11.603, df = 6, p-value = 0.07143
7.3 Linear Models: Regression, ANOVA, and t-tests
Linear regression and Analysis of Variance (ANOVA) are both special cases of the general linear model (LM), which fits a continuous response (y) to one or more predictors (x). You specify linear models in R with a formula syntax, which generally follows as: response ~ predictor
. Combining this formula with the lm()
function and a datset gives you the basis of a linear model.
7.3.1 Regression
Lets say we wanted to see how snout-vent length (SVL) affects mass:
<- lm(Mass ~ SVL, data = lizards)
simple_reg
simple_reg##
## Call:
## lm(formula = Mass ~ SVL, data = lizards)
##
## Coefficients:
## (Intercept) SVL
## 1.62577 0.08886
By default, this creates an LM object, which tells us the regression coefficients. For a linear regression (continuous response), these tell us the regression equation; in this case, that for every \(1 \text{ mm}\) increase in SVL, mass increases by \(0.089 \text{ g}\). Note that in this case, it may make sense to re-scale SVL to be in cm, so that the coefficient would be easier to interpret (e.g., use lizards |> mutate(SVL_cm = SVL/10)
. To extract the coeficients directly, use:
coef(simple_reg)
## (Intercept) SVL
## 1.6257700 0.0888617
For more information, use the summary function:
summary(simple_reg)
##
## Call:
## lm(formula = Mass ~ SVL, data = lizards)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2672 -0.4699 -0.0530 0.3959 5.6050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.625770 0.210781 7.713 4.59e-14 ***
## SVL 0.088862 0.003434 25.874 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7661 on 655 degrees of freedom
## Multiple R-squared: 0.5055, Adjusted R-squared: 0.5047
## F-statistic: 669.5 on 1 and 655 DF, p-value: < 2.2e-16
The most important components of this are the \(R^2\) (which is listed as Multiple R-squared
), the standard errors and p-values for each of your coefficients (Coefficients
section), and the overall F-statistic and p-value (the last line).
A quick note on p-values:
- Don’t base all of your interpretation on p-values; the \(R^2\) and adjusted \(R^2\) of a model are more important.
- The overall p-value relates to how the whole model explains the variance in the data; the coefficient p-value relates to whether the specific coefficient is different from zero.
- Coefficient-level p-values tend to be rather fragile, and shouldn’t be used.
7.3.1.1 Plotting linear regressions
For a simple linear regression, ggplot
can automatically plot the trendline (a.k.a., fitted values) and confidence intervals with geom_smooth()
.
|>
lizards ggplot() +
aes(x = SVL, y = Mass) +
geom_point() +
geom_smooth(method = "lm", # use a linear regression
se = TRUE, # Include a confidence interval around the line
level = .95) # the level of the confidence interval; default = 95%
For more complicated models, this approach may not work very well; it can be helpful to calculate the fitted values & confidence intervals from the model object and plot them directly.
You can get these values directly with the predict()
function. The code below calculates the trendline and 95% confidence interval for the regression, and adds them to a data frame with the original data.
<- simple_reg |>
simple_reg_predictions # Predict fitted values wish 95% confidence intervals
predict(interval = "confidence", level = .95) |>
# The output of predict() is a matrix; that's hard to work with, so...
as_tibble() # we convert the output of predict() to a data frame (tibble)
<- lizards |>
simple_reg_plot_data select(SVL, Mass) |> # we only need these columns
bind_cols(simple_reg_predictions) # adds columns of simple_reg_preds to our lizards
View(simple_reg_plot_data)
SVL | Mass | fit | lwr | upr |
---|---|---|---|---|
61.8 | 6.46 | 7.117423 | 7.058313 | 7.176533 |
57.1 | 5.82 | 6.699773 | 6.636126 | 6.763420 |
49.1 | 4.29 | 5.988879 | 5.890800 | 6.086959 |
51.2 | 5.29 | 6.175489 | 6.088343 | 6.262634 |
51.5 | 5.69 | 6.202147 | 6.116487 | 6.287808 |
45.3 | 5.84 | 5.651205 | 5.531606 | 5.770804 |
49.7 | 5.91 | 6.042196 | 5.947328 | 6.137065 |
48.0 | 5.09 | 5.891132 | 5.787013 | 5.995249 |
54.9 | 7.20 | 6.504277 | 6.433552 | 6.575002 |
52.7 | 6.66 | 6.308782 | 6.228823 | 6.388740 |
To plot this, you’d use a combination of geom_line()
for the fitted values and geom_ribbon()
, which would create the confidence interval region.
ggplot(simple_reg_plot_data) +
aes(x = SVL) +
geom_point(aes(y = Mass), color = "cornflowerblue") +
geom_ribbon(aes(ymin = lwr, ymax = upr),
fill = alpha("black", .2), # dark fill with 80% transparency (alpha)
color = grey(.4), # dark-ish grey border line
linetype = 2) + # dotted border line
geom_line(aes(y = fit))
You can also use predict()
to calculate the respected value of your response variable given new predictor(s); this is useful for interpolation and extrapolation.
# Create a data frame with new predictors
= tibble(SVL = c(20, 150, 600)) # New predictors
svl_predictors <- predict(simple_reg,
mass_predictions newdata = svl_predictors) # The newdata argument is key here
# If unspecified, it uses the original data
|>
svl_predictors mutate(Mass_estimate = mass_predictions)
## # A tibble: 3 × 2
## SVL Mass_estimate
## <dbl> <dbl>
## 1 20 3.40
## 2 150 15.0
## 3 600 54.9
For more information, see the help page ?predict.lm
.
7.3.2 ANOVA
Analysis of variance (ANOVA) is a special case of linear model where all of the predictors are all categorical. However, ANOVAs are usually treated differently from regressions for historical reasons. In R, you fit an ANOVA in the same way as a regression (with the lm()
command); however, it’s common to use the aov()
command on the lm
’s output to reformat the results into a more traditional style. For the simple (one-way) ANOVA follows:
# Fit the model wtih a regression
<- lm(Diameter ~ Color_morph, data = lizards)
simple_anova_lm # Reformat into traditional ANOVA style; this is usually done with a pipe as part of the previous step
<- aov(simple_anova_lm)
simple_anova
# Look at the ANOVA table
|> summary()
simple_anova ## Df Sum Sq Mean Sq F value Pr(>F)
## Color_morph 2 50586 25293 985.2 <2e-16 ***
## Residuals 654 16790 26
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA table gives us estimates of variation explained by the predictor and the residuals. Note how different the output is from the summary of a regression, even though the underlying math is the same:
summary(simple_anova_lm)
##
## Call:
## lm(formula = Diameter ~ Color_morph, data = lizards)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.8837 -3.1176 -0.1176 3.1163 18.1163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.8837 0.3456 95.16 <2e-16 ***
## Color_morphBrown -8.7661 0.4854 -18.06 <2e-16 ***
## Color_morphGreen -21.4086 0.4854 -44.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.067 on 654 degrees of freedom
## Multiple R-squared: 0.7508, Adjusted R-squared: 0.75
## F-statistic: 985.2 on 2 and 654 DF, p-value: < 2.2e-16
Two quick notes about this:
- While \(R^2\) isn’t included in the
aov()
summary, it can be calculated as the total sum of squares of your predictors (in this case, Color_morph) divided by the total sum of squares including the residual; For this example, we have \(5.0586239\times 10^{4} / (5.0586239\times 10^{4} + 1.6790147\times 10^{4}) = 0.7508007\).
- The
aov()
table provides a line for each predicting factor and the residuals; thelm()
summary includes estimates for the intercept and the specific levels of the factors (e.g.,Color_morphBrown
). These represent the dummy coded variables that R uses under the hood; you don’t need to worry about this, but I’ve got an explanation for them below if you’re interested.
Dummy coding: R converts an ANOVA into a (multiple) regression model by changing a categorical predictor with \(n\) levels into \(n-1\) predictor variables that can have values of 0 or 1. The first level of the factor doesn’t get a dummy variable; its mean is represented by the regression’s intercept. The other dummy coefficients represent the mean difference between the reference level and the level they represent. In our example, the reference category is Blue, so the regression’s intercept is the mean Diameter for blue lizards. The average value for brown lizards would be the intercept plus the Color_morphBrown
coefficient (and similarly for green). The aov()
function reinterprets the dummy coded results as a traditionally categorical analysis.
7.3.2.1 ANOVA means & confidence intervals
# make a data frame with just your predictors (distinct removes duplicates)
<- lizards |> distinct(Color_morph)
color_levels
color_levels## # A tibble: 3 × 1
## Color_morph
## <chr>
## 1 Green
## 2 Brown
## 3 Blue
<-
simple_anova_means # Calculate means & 95% CI
predict(simple_anova, newdata = color_levels,
interval = "confidence", level = 0.95) |>
as_tibble() |> # Convert to data frame
bind_cols(color_levels) # add the prediction data as more columns
simple_anova_means## # A tibble: 3 × 4
## fit lwr upr Color_morph
## <dbl> <dbl> <dbl> <chr>
## 1 11.5 10.8 12.1 Green
## 2 24.1 23.4 24.8 Brown
## 3 32.9 32.2 33.6 Blue
7.3.2.2 Post-hoc tests
In general, the ANOVA tests whether a factor (such as color morph) explains a significant amount of variation in the response. To test whether there are significant differences between specific levels of a factor, you need to run post-hoc tests on your ANOVA. A common post-hoc is Tukey’s HSD (honest significant difference).
|> TukeyHSD()
simple_anova ## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = simple_anova_lm)
##
## $Color_morph
## diff lwr upr p adj
## Brown-Blue -8.766074 -9.906219 -7.625929 0
## Green-Blue -21.408608 -22.548753 -20.268463 0
## Green-Brown -12.642534 -13.774806 -11.510261 0
You can use the witch
argument to specify specific level comparisons, but by default it compares all of them.
A traditional way of summarizing post-hoc tests is to assign one or more letters to significantly different levels, where levels with different letters are significantly different from each other. For example, the letter grouping c("A", "AB", "B", "C")
would indicate that group 1 and group 3 are different from each other, but not from group 2; group 4 is different from everything. In the above example, everything is different, so we could assign the letters A through C.
# We'll store these in a data frame for later use
<- simple_anova_means |>
anova_means_letters arrange(Color_morph) |> # These will be plotted alphabetically, so we'll sort them that way to keep things easy
mutate(anova_labels = c("A", "B", "C"))
anova_means_letters
## # A tibble: 3 × 5
## fit lwr upr Color_morph anova_labels
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 32.9 32.2 33.6 Blue A
## 2 24.1 23.4 24.8 Brown B
## 3 11.5 10.8 12.1 Green C
7.3.2.3 Plotting ANOVAs
The basics of plotting ANOVA results was covered in the last part of Section @ref(ggp_discx_conty), but we’ll refine them a bit here. The main additions are the use of geom_text()
to include the labels above each category and scale_y_continuous
, which we use to customize the y axis a bit so that it doesn’t cut off the labels. Be sure to indicate in your figure captions what the letters mean.
library(ggforce) # for geom_sina()
# We're going to add post-hoc letters to this plot
<- max(lizards$Diameter) * 1.05 # Put letters at the top
letter_y_position
<- ggplot(lizards) +
anova_plot aes(x = Color_morph) + # Color_morph column is in both lizards AND anova_means_letters
# Show the raw data
geom_sina(aes(y = Diameter), # Diameter column is in lizards data frame
color = alpha("blue", .2)) + # semi-transparent points
geom_errorbar(
aes(ymin = lwr, ymax = upr), # columns lwr and upr are in anova_means_letters
data = anova_means_letters, # Use the summary data frame instead of lizards
color = "black",
width = .3 # How wide the error bars are
+
) geom_point(aes(y = fit), # fit column is in anova_means_letters
color = "black", size = 2,
data = anova_means_letters) +
geom_text(aes(label = anova_labels), # label is the aesthetic used to indicate text
y = letter_y_position, # This is a fixed spot, not an aesthetic
fontface = "bold",
size = 6, # font size
data = anova_means_letters) +
scale_y_continuous(
# insure the letters aren't cut off at the top
limits = c(NA, letter_y_position), # This is the same as ylim()
breaks = seq(from = 0, to = 50, by = 10), # set axis ticks every 10 points
name = c("Perch Diameter (mm)") # this is the same as ylab()
+
) xlab("Color Morph")
anova_plot
7.3.3 Comparing two means (t-tests)
While it’s perfectly valid to use an ANOVA to compare the means of two samples, the t-test is a more traditional approach. You can do this with the t-test()
function.
First, let’s create a 2-level categorical variable to work with. Normally, this would be done during data collection, but there aren’t any in this data source. Let’s define lizards as big (SVL >= 65 mm) or small (SVL <= 55 mm) and remove the intermediates.
<- lizards |>
lizards_by_size filter(SVL <= 55 | SVL >= 65) |> # Remove the lizards of intermediate size
mutate(Size_class = if_else (SVL >= 65, "big", "small") )# Define Size_class as big or small, based on SVL
Next, let’s use a t-test to see if limb length differs between big and small lizards. First, we should look at the mean and standard deviation of each group. The classic t-test requires that the variances of the two groups are equal; if not, an adjustment will need to be applied. In practice, this adjustment has no ill effect if the variances are equal, so you may as well apply it. In any case, it’s a good idea to look at data summaries before running a test.
<- lizards_by_size |>
lizard_size_sumary_table group_by(Size_class) |> # We want to summarize each size class
summarize(mean = mean(Limb), sd = sd(Limb), var = var(Limb), n = n()) |>
ungroup()
lizard_size_sumary_table
## # A tibble: 2 × 5
## Size_class mean sd var n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 big 15.1 0.984 0.969 240
## 2 small 11.2 0.800 0.641 206
In this case, the standard deviations are somewhat different, but not particularly. In any case, we’ll run the test without assuming for equal variances.
<- t.test(Limb ~ Size_class, # formula specifies response (on left) and predictor (on right)
lizard_t_test data = lizards_by_size,
var.equal = FALSE) # Assume unequal variances; this is the default option.
print(lizard_t_test)
##
## Welch Two Sample t-test
##
## data: Limb by Size_class
## t = 45.668, df = 442.74, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group big and group small is not equal to 0
## 95 percent confidence interval:
## 3.694599 4.026898
## sample estimates:
## mean in group big mean in group small
## 15.09958 11.23883
When reporting a t-test, you should t, the degrees of freedom, the p-value, the effect size (Cohen’s d). The effect size isnt’ provided for you directly, so you’ll need to calculate it yourself.
7.3.3.1 Effect size: Cohen’s d
Cohen’s d is essentially a standardized way of measuring the difference between two groups. Conceptually, it’s the number of standard deviations of difference between the two groups. You can calculate it by taking the difference in the group means and dividing by the pooled standard deviation. To get the pooled standard deviation, calculate the weighted average of the variance (weighted by population size) and take the square root.
|>
lizard_size_sumary_table summarize(mean_diff = diff(mean),
pooled_variance = sum(var * n) / sum(n) ) |> #weighted mean of variance
mutate(cohen_d = abs(mean_diff) / sqrt(pooled_variance) ) # COhen's d is always positive, so take abs() of mean_diff
## # A tibble: 1 × 3
## mean_diff pooled_variance cohen_d
## <dbl> <dbl> <dbl>
## 1 -3.86 0.817 4.27
You could report your results as follows (assuming that figure 1 compared the two groups):
Big lizards have significantly longer limbs than small lizards (d = 4.27, t = 45.7, df = 442.7, p < 0.0001; Figure 1)
7.3.4 Comparing two rates or relationships (ANCOVA)
Analysis of covariance (ANCOVA) is a statistical model that combines a linear regression with an ANOVA or t-test. Essentially, it is a regression that also allows the slope and/or intercept of the fit line to vary between levels of a categorical variable. To keep things simple, we’re going to focus on ANCOVAS that only test two-level categorical variables; however, there’s no inherent reason for it to require more.
Let’s make two versions of the lizard
data, contrasting Brown with Blue and Brown with Green. I’m also re-defining Color_morph as a factor so that Brown is taken as the reference category (see dummy coding under the ANOVA section).
= lizards |> filter(Color_morph != "Blue") |>
lizards_brown_green mutate(Color_morph = factor(Color_morph, levels = c("Brown", "Green")))
= lizards |> filter(Color_morph != "Green") |>
lizards_brown_blue mutate(Color_morph = factor(Color_morph, levels = c("Brown", "Blue")))
Let’s test to see if the relationship between body size (SVL
) and limb length (Limb
) differs by color morph. To run an ANCOVA, you can use the lm()
function with a formula like Limb ~ SVL * Color_morph
.
= lm(Limb ~ SVL * Color_morph, data = lizards_brown_green)
ancova_limb_green print(ancova_limb_green)
##
## Call:
## lm(formula = Limb ~ SVL * Color_morph, data = lizards_brown_green)
##
## Coefficients:
## (Intercept) SVL Color_morphGreen
## 0.8483960 0.2018953 0.0182944
## SVL:Color_morphGreen
## 0.0008839
=lm(Limb ~ SVL * Color_morph, data = lizards_brown_blue)
ancova_limb_blue print(ancova_limb_blue)
##
## Call:
## lm(formula = Limb ~ SVL * Color_morph, data = lizards_brown_blue)
##
## Coefficients:
## (Intercept) SVL Color_morphBlue
## 0.848396 0.201895 0.441016
## SVL:Color_morphBlue
## -0.006167
This tells us that the ANCOVA has four coefficients:
- Intercept (for the regression on Brown)
SVL
, the baseline regression slope for the reference category, BrownColor_morphGreen
(or Blue), how the intercept differs between Brown and the other color; this is the main effect of color morph.SVL:Color_morphGreen
(or Blue), how the slope differs between Brown and the other color; this is the SVL / Color morph interaction term.
For both data sets, both the main effect and interaction of color morph are near zero; you can test this statistically with summary()
.
summary(ancova_limb_green)
##
## Call:
## lm(formula = Limb ~ SVL * Color_morph, data = lizards_brown_green)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08012 -0.48111 0.03624 0.46773 1.04263
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8483960 0.2841340 2.986 0.00299 **
## SVL 0.2018953 0.0045739 44.140 < 2e-16 ***
## Color_morphGreen 0.0182944 0.3844295 0.048 0.96207
## SVL:Color_morphGreen 0.0008839 0.0062369 0.142 0.88737
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5664 on 438 degrees of freedom
## Multiple R-squared: 0.9065, Adjusted R-squared: 0.9058
## F-statistic: 1415 on 3 and 438 DF, p-value: < 2.2e-16
summary(ancova_limb_blue)
##
## Call:
## lm(formula = Limb ~ SVL * Color_morph, data = lizards_brown_blue)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11070 -0.45341 0.00414 0.47641 1.04449
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.848396 0.284746 2.979 0.00305 **
## SVL 0.201895 0.004584 44.046 < 2e-16 ***
## Color_morphBlue 0.441016 0.392663 1.123 0.26200
## SVL:Color_morphBlue -0.006167 0.006381 -0.966 0.33434
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5677 on 432 degrees of freedom
## Multiple R-squared: 0.9002, Adjusted R-squared: 0.8995
## F-statistic: 1299 on 3 and 432 DF, p-value: < 2.2e-16
This means that the two regression lines are essentially the same. Let’s plot one of them to see what that looks like:
|> ggplot() + aes(x = SVL, y = Limb, color = Color_morph) +
lizards_brown_green geom_point() + geom_smooth(method = 'lm') + scale_color_viridis_d(begin = 0.3, end = 1)
As you can see, the regression lines are almost indistinguishable.
What would an ANCOVA show if there are differences? Let’s see if color morph affects the Tail:Diameter relationship instead:
=lm(Diameter ~ Tail * Color_morph, data = lizards_brown_blue)
ancova_dia_blue print(ancova_dia_blue)
##
## Call:
## lm(formula = Diameter ~ Tail * Color_morph, data = lizards_brown_blue)
##
## Coefficients:
## (Intercept) Tail Color_morphBlue
## 20.38163 0.06444 7.91466
## Tail:Color_morphBlue
## 0.01617
summary(ancova_dia_blue)
##
## Call:
## lm(formula = Diameter ~ Tail * Color_morph, data = lizards_brown_blue)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5127 -2.9652 0.0357 3.2047 19.6002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.38163 1.69334 12.036 < 2e-16 ***
## Tail 0.06444 0.02863 2.251 0.024918 *
## Color_morphBlue 7.91466 2.32069 3.410 0.000709 ***
## Tail:Color_morphBlue 0.01617 0.03952 0.409 0.682632
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.966 on 432 degrees of freedom
## Multiple R-squared: 0.4499, Adjusted R-squared: 0.4461
## F-statistic: 117.8 on 3 and 432 DF, p-value: < 2.2e-16
|> ggplot() + aes(x = Tail, y = Diameter, color = Color_morph) +
lizards_brown_blue geom_point() + geom_smooth(method = 'lm') + scale_color_viridis_d(begin = 0.3, end = 1)
## `geom_smooth()` using formula 'y ~ x'
In this case, you can see that there’s a significant main effect but no significant interaction; this results in a separation between the two regression lines on the figure.
If you get this result, re-run the ANCOVA but replace the *
with a +
; this will remove the interaction effect and give you better estimate.
=lm(Diameter ~ Tail + Color_morph, data = lizards_brown_blue)
ancova_dia_blue_noint summary(ancova_dia_blue_noint)
##
## Call:
## lm(formula = Diameter ~ Tail + Color_morph, data = lizards_brown_blue)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6433 -2.9521 0.0416 3.1976 19.4588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.88954 1.19088 16.701 < 2e-16 ***
## Tail 0.07293 0.01972 3.699 0.000245 ***
## Color_morphBlue 8.84397 0.47572 18.591 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.961 on 433 degrees of freedom
## Multiple R-squared: 0.4497, Adjusted R-squared: 0.4472
## F-statistic: 176.9 on 2 and 433 DF, p-value: < 2.2e-16
Biologically, you can interpret this as Blue lizards have a larger perch diameter than brown across a variety of tail sizes. Specifically, blue diameter is on average average 8.84 mm wider (you get this from the main effect coefficient of the no-intercept model).
When reporting this, be sure to provide the regression equation (Intercept and slope of Tail) and the main effect.
But what if you have a significant interaction? Let’s look at the Brown/Green comparison:
=lm(Diameter ~ Tail * Color_morph, data = lizards_brown_green)
ancova_dia_green print(ancova_dia_green)
##
## Call:
## lm(formula = Diameter ~ Tail * Color_morph, data = lizards_brown_green)
##
## Coefficients:
## (Intercept) Tail Color_morphGreen
## 20.38163 0.06444 -0.01528
## Tail:Color_morphGreen
## -0.21839
summary(ancova_dia_green)
##
## Call:
## lm(formula = Diameter ~ Tail * Color_morph, data = lizards_brown_green)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4841 -3.2126 0.0082 2.9184 14.6412
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.38163 1.58128 12.889 < 2e-16 ***
## Tail 0.06444 0.02674 2.410 0.0164 *
## Color_morphGreen -0.01528 2.18355 -0.007 0.9944
## Tail:Color_morphGreen -0.21839 0.03695 -5.910 6.89e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.637 on 438 degrees of freedom
## Multiple R-squared: 0.6635, Adjusted R-squared: 0.6612
## F-statistic: 287.8 on 3 and 438 DF, p-value: < 2.2e-16
|> ggplot() + aes(x = Tail, y = Diameter, color = Color_morph) +
lizards_brown_green geom_point() + geom_smooth(method = 'lm') + scale_color_viridis_d(begin = 0.3, end = 1)
## `geom_smooth()` using formula 'y ~ x'
In this case, we have a significant interaction, which indicates that the relationship between Diameter and Tail is different between color morphs; the graph backs this up.
The main effect is not significantly different in this case, but that generally doesn’t matter (as it can’t be interpreted without the slope).
To understand and report this, it’s best to think of it as a pair of separate regressions.
The regression for Brown is determined by your Intercept and your slope (e.g., Diameter = Intercept + slope * Tail
).
The regression for Green adds the Green-specific coefficients to those effects: Diameter = (Intercept + Color_morphGreen) + (Tail slope + Tail:Color_morphGreen) * Tail
.
In this case, you’d report your regression slopes as:
Brown: Diameter = 20.38 + 0.06 * Tail Green: Diameter = 20.37 + -0.15 * Tail
You should also include the \(R^2\) and the p-values for the non-intercept coefficients for both cases.
ggplot(lizards, aes(x = SVL, y = Diameter, color = Color_morph)) +
geom_point() +
geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'