1. Overview

This tutorial is designed to follow up on concepts and skills covered in the previous tutorial, “Analyzing Quantitative Trait Data in R: an example in Arabidopsis lyrata.” Here we introduce genotype data from genetic markers in specific chromosomal regions in addition to the trait data from the previous tutorial, and use it to infer whether nearby genes (quantitative trait loci, or QTLs) affect traits.

By the time you complete this tutorial, you should be able to do the following:

Do basic population genetic analyses on molecular marker data, such as testing for Mendelian genotypic ratios in crosses.
Evaluate the size and significance of QTL effects linked to molecular markers, including additive and dominance effects, using linear models.
Plot QTL effects as regressions and as barplots.
Use transformations and generalized linear models to analyze effects on non-normal traits.
Use additional approaches to generate calculated fields.
Use a hypothesis-based approach to develop predictions and interpret results from QTL analyses.

2. Loading data and adding calculated variables:

Find the .csv file named QTLStudyTraitAnalysis_survMarkers.csv, which should be in the folder on Box labeled F2_Marker_Trait_Photo_Data. This is similar to the data file you used with the previous tutorial, but also contains columns for genotype data. Download it into the folder you will be running R Studio from.

We will first read in the .csv file and generate a data frame with the data:

mtDataAll <- read.csv("QTLStudyTraitAnalysis_survMarkers.csv", header=TRUE)

Next, we will generate some of the same calculated fields we did with the previous data frame; i.e. vegetative diameter changes in the different time periods, mean number of basal leaves on sampled inflorescences on each plant, and mean values of inflorescence architecture traits:

mtDataAll$dDiam21 <- with(mtDataAll,Diam2 - Diam1)
mtDataAll$dDiam32 <- with(mtDataAll,Diam3 - Diam2)
mtDataAll$meanBLvs <- apply(mtDataAll[,c(12:14)],1,mean,na.rm=TRUE)
mtDataAll$meanOrder <- apply(mtDataAll[,c(15,18,21)],1,mean,na.rm=TRUE)
mtDataAll$meanBrNode <- apply(mtDataAll[,c(16,19,22)],1,mean,na.rm=TRUE)
mtDataAll$meanFlNode <- apply(mtDataAll[,c(17,20,23)],1,mean,na.rm=TRUE)

In addition, we will calculate a field for the mean number of lateral branches per sampled inflorescences on each plant:

mtDataAll$meanBrNumber <- mtDataAll$meanFlNode - mtDataAll$meanBrNode

We will summarize the names of all the data fields and which columns of the data frame they are in:

names(mtDataAll)

##  [1] "Pop"                  "Fam"                  "Num"                 
##  [4] "X2nd_Data_colec_date" "LatRating0215"        "PShootStatus0215"    
##  [7] "LShootStatus0215"     "Rhiz0215"             "Remarks1"            
## [10] "Remarks2"             "Collector"            "S1BLvs"              
## [13] "S2BLvs"               "S3BLvs"               "S4Order"             
## [16] "S4BrNode"             "S4FlNode"             "S5Order"             
## [19] "S5BrNode"             "S5FlNode"             "S6Order"             
## [22] "S6BrNode"             "S6FlNode"             "Rhiz0615"            
## [25] "CollDate"             "Comments"             "Diam1"               
## [28] "Diam2"                "Diam3"                "Infl"                
## [31] "Surv1216"             "PIN1"                 "PIN3"                
## [34] "PIN_combined"         "LFY"                  "GI"                  
## [37] "SOC1"                 "GenoComment"          "dDiam21"             
## [40] "dDiam32"              "meanBLvs"             "meanOrder"           
## [43] "meanBrNode"           "meanFlNode"           "meanBrNumber"

Notice that the data frame includes some variables that weren’t present in the trait data file we used in the first tutorial. These variables (PIN1, PIN3, PIN_combined, and LFY) are molecular marker genotypes, which we will explain in detail in the next section.

We want to check whether our version of R has treated two categorical variables as factors, which we may need:

class(mtDataAll$Pop)

## [1] "character"

class(mtDataAll$Fam)

## [1] "character"

As the results show, they are both being treated as characters, not factors, so we need to fix this:

mtDataAll$Pop <- as.factor(mtDataAll$Pop)
mtDataAll$Fam <- as.factor(mtDataAll$Fam)
levels(mtDataAll$Fam)

##  [1] "A"            "A1"           "A2"           "AM"           "AS"          
##  [6] "B"            "B2"           "BM"           "BS"           "BS1"         
## [11] "BS2"          "C1"           "C2"           "CS"           "D1"          
## [16] "D2"           "DS"           "E"            "F"            "FM"          
## [21] "FS"           "G"            "H"            "header=TRUE)" "HM"          
## [26] "HS"           "IM"           "J"            "M"            "P1"          
## [31] "P2"           "Q1"           "Q2"

We will now generate a new data frame with just the F2 individuals:

mtDataF2 <- with(mtDataAll,mtDataAll[Pop=="F2",])

Remember: The square brackets identify the [row,column] coordinates you are selecting, separated by a comma. Since there is nothing after the comma in this case, all columns will be selected in generating the new data frame.

3. Understanding molecular marker genotypes:

Let us look at the contents of one of the molecular marker variables.

mtDataF2$PIN_combined

##   [1]  1  0  1  0  0  2  1 NA  1  1  2  1 NA  1  0  2 NA NA  1  1  1  2  0 NA  0
##  [26]  0  1  1  2  1  0  0 NA NA  1  2  2  1  1  1  1  1  0  2  2 NA  2  0  1  0
##  [51]  0  1  1  1  2  1  1  0 NA  1  1  2  1  1  2  1 NA  1 NA  0  0  2 NA  2  1
##  [76]  1  1  0  1  1  0 NA  0  1  1  1  1  0  1  0  1  2  1  1  1  1  2  0  2  2
## [101]  0 NA  0  2  1  0  0  1  1  1  1  0 NA  1  1  1  0  2  1  1  1  2 NA NA  0
## [126]  1  1  1  2  1  1  1  2 NA NA NA NA NA  1 NA NA  1  2  1  1 NA  0  2  1  0
## [151] NA NA  1  0  0  0  2  1 NA  1  0  1 NA  1  2  0 NA  1  2  2  1 NA  1  1  1
## [176]  1  1  2  2  1  2  2  1  1  2  2  1  0 NA NA  1  1  2  1  1  1  1  0  1  1
## [201]  1  1  1  2  0  2  2  1  2 NA  1  2  2  2  1  1  1  1  1  1  2  1  2  2  1
## [226] NA  2  1 NA  1  1  2  1 NA  1  0 NA  2  0  2  1  2 NA  0  0  0  1  1 NA  1
## [251]  0  2  1  1 NA  1  2  1  1 NA  0  0  1 NA NA NA  2  0  1  0 NA NA  1 NA  1
## [276]  1  1  0  0  2  2  1  1  1  2  1 NA  1  2  0 NA  0  1  1  2  0  2  2 NA  1
## [301]  1  2 NA  2  1  0  1  1  1  0  2  0  1  2  0 NA  2  2  2 NA  1  1  2  1  2
## [326]  1  2 NA NA NA  1 NA  2  1 NA NA  2  1  2  2 NA  1  0  1  2  1 NA NA NA  1
## [351]  1  0  1  0  0  1  1  0  2  1 NA  1 NA  1  0 NA  0 NA  1  1  0 NA  0 NA  1
## [376]  2  1  0 NA  1  1 NA  1  2  0  1  1  1  1  2  2 NA  2  1  1  2  2  0 NA NA
## [401] NA  1  1  0  1  2  2  1  1  0 NA  2 NA  1  2  2  1  0 NA NA  1 NA  1 NA NA
## [426]  1  2  1 NA  1  2  1  0 NA NA  0  1  2  2  1  1  1  2  1  2 NA  0  2 NA  2
## [451]  0  1  1  1  1 NA  0 NA  0  2 NA  2  0 NA  0  2  1  1  1  2  2 NA NA  2  0
## [476]  2  2  1  1  0  0  1  1  2  1  2  1  1 NA  0  1  2  1 NA  1  1  1  1  2  1
## [501]  1  0  0 NA NA  1  1  1  1  2 NA NA  2  2  2 NA NA  1  1 NA  0  2  1  2  1
## [526] NA  2 NA  2  2  1  1  2  0  0  1  1  2  0  0  1 NA  0  0  1  0  1  0  2  0
## [551] NA  2 NA NA NA  1  0  0  0  1  1  0  0  1  1  1 NA  1  0  0  0  0  0  1 NA
## [576] NA  0  2  1  2  1  1 NA  1  1  0  2  2  0 NA  0  0  1 NA NA  0 NA  2 NA  2

Note that this variable is a vector of 600 entries, each corresponding to a different F2 plant, consisting of a string of 0, 1, and 2 values, along with quite a few NA entries for samples that didn’t get genotyped. If we want to know how many samples did have a genotype, we can do the following to find the length of the PIN_combined vector with all the NA entries omitted:

length(na.omit(mtDataF2$PIN_combined))

## [1] 481

But what do the values of 0, 1, or 2 mean?

Since these samples are F2 plants from a crosses between Spiterstulen (SP) plants from Norway and Mayodan (MY) plants from North Carolina, at any one location (or locus) in the genome, they can have two SP alleles (SP/SP), one SP and one MY allele (MY/SP), or two MY alleles (MY/MY). We have coded these three genotypes as 0, 1, and 2, respectively. We can think of this as plants with 0, 1, or 2 MY alleles at that particular locus.

We obtained these genotypes by amplifying gene segments by PCR and identifying length differences between alleles by running the PCR products on an agarose gel. This ended up being a little bit complicated, because our F2 plants actually came from two different sets of crosses. In the first locus we genotyped (part of the PIN1 gene), some SP plants had alleles identical to those of MY plants, so we couldn’t determine the population origin of the alleles by genotyping PIN1 in one of the crosses. For this cross, we instead genotyped a part of the PIN3 gene, which is very close to PIN1 on A. lyrata chromosome 2. To simplify our analyses, we then combined the data for the PIN1 and PIN3 genotypes in a single column and called it PIN_combined.

But why are we so interested in genotyping this particular segment of chromosome 2? An earlier genetic mapping study, using markers from across the entire genome, identified a QTL with large effects on a trade-off between reproduction and vegetative growth during the reproductive season in this part of chromosome 2. The purpose of the current study is to get clues on how a gene (or genes) in this segment causes this trade-off. Since we don’t have funds yet to do genome-wide genotyping, we have been focusing on markers in the regions we already know have QTLs.

As we explained in the previous tutorial, we have evidence that at least some QTLs function by affecting the rate at which newly-formed shoots switch over from vegetative growth to reproductive growth. We call this the time allocation hypothesis. If MY alleles cause plants to make this switch more quickly than SP alleles, then we would expect MY alleles to reduce the number of leaves produced during the reproductive period, which would also result in vegetative rosettes becoming smaller during this period, and increase the number of infloresecences produced.

Given this long-winded introduction, the first thing we might like to do is test whether the ratios of SP/SP, MY/SP, and MY/MY genotypes follow the expected 1:2:1 ratio for F2 plants. (If you don’t remember why we expect those ratios, remember that the F2 plants came from a cross between two F1 plants that each had MY/SP genotypes. Do a monohybrid Punnett square if you need to in order to verify the expected ratio.)

We first find the number of F2 plants with each genotype using the table function. This function is usually used to generate crosstab or contingency tables of counts across a pair of variables, but creates as simple vector of counts when only one variable is selected like this:

table(na.omit(mtDataF2$PIN_combined))

## 
##   0   1   2 
## 113 239 129

This gave us a vector of the number of individuals with 0, 1, or 2 recorded for the genotype; i.e. SP/SP, MY/SP, or MY/MY. This looks pretty close to a 1:2:1 ratio. To test this, we can save it as a vector object in order to do a chi-square test.

PIN_geno_counts <- table(na.omit(mtDataF2$PIN_combined))

To use the chisq.test function, the first argument, x is the vector of genotype counts we just generated, and the second argument, p is the corresponding vector of genotype probabilities, which we will just enter directly:

chisq.test(x = PIN_geno_counts, p = c(0.25,0.5,0.25))

## 
##  Chi-squared test for given probabilities
## 
## data:  PIN_geno_counts
## X-squared = 1.0832, df = 2, p-value = 0.5818

The output shows us that the deviation from expected values is well within the range we would expect by chance (P-value ~= 0.62). If we had seen a significant deviation, it would have suggested that one or more of the genotypes had reduced viability, or possibly one of the gametes was less likely to be transmitted.

4. Estimating genotypic additive effects on traits:

Does this new study verify that the PIN region on chromosome 2 contains a QTL for reproduction vs. growth trade-offs? And, do the results support the time allocation hypothesis we described in the previous section? We will now start exploring the answers to these questions using the linear model tools lm function introduced in the previous tutorial.

In this section, we begin with the simplest model, in which the trait value is treated as a function of the number of MY alleles in the PIN region of chromosome 2. We will start with the trait dDiam32, which if you recall measures the net change in vegetative rosette diameter during the reproductive season and has clear predictions in the time allocation model. If MY alleles reduce the length of the vegetative growth stage on each shoot, we expect them to reduce the size of the rosette during the reproductive season. The model is set up as follows:

dD32_mt <- lm(dDiam32 ~ PIN_combined, data=mtDataF2)

Before we look at the summary of the output object dD32_mt, we need to consider that the predictor PIN_combined is numeric, not a factor, unlike Pop, which we used as a predictor when we introduced linear models in the previous tutorial. This will cause lm to treat its effects on dDiam32 as a linear regression on the number of MY alleles, rather than treat the three possible genotypes as three categories.

Understanding that, let us now look at the summary:

summary(dD32_mt)

## 
## Call:
## lm(formula = dDiam32 ~ PIN_combined, data = mtDataF2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.816 -17.816  -0.816  15.512  90.855 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -15.512      2.195  -7.068 6.31e-12 ***
## PIN_combined   -9.671      1.745  -5.541 5.22e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.97 on 435 degrees of freedom
##   (163 observations deleted due to missingness)
## Multiple R-squared:  0.06593,    Adjusted R-squared:  0.06378 
## F-statistic:  30.7 on 1 and 435 DF,  p-value: 5.221e-08

We interpret the output as follows:

The Intercept estimate gives us the y-intercept of dDiam32; in other words, the fitted value of dDiam32 when PIN_combined = 0, which happens when thegenotype is SP/SP. The PIN_combined estimate is the regression slope, or the net change in dDiam32 with each additional unit of increase in PIN_combined. This tells us that each additional MY allele in the PIN_combined genotype reduces the value of dDiam32 by 9.63 mm, which is consistent with what our time allocation hypothesis predicts!

Moreover, the P-value for the effect of PIN_combined is very small; less than 1 in ten million. This means it is extremely unlikely that the fitted slope of the model would be this steep if the PIN region genotype had no effect on dDiam32.

Looking below the table, we see that the multiple R-squared value is ~= 0.065. This means that the PIN region genotype only explains a small amount of the variance in dDiam32, but for a quantitative trait this is actually pretty substantial. Looking at the effects in another way, the fitted net reduction in rosette diameter in MY/MY genotypes is nearly 20 mm greater (or 2 x 9.630) than that in SP/SP genotypes. Given that the difference between plants from the parental SP and MY populations in dDiam32 was ~= 34 mm (which we found in the previous tutorial), it turns out that the PIN region of chromosome 2 explains more than half that difference!

We can visualize the regression graphically, which might make it easier to interpret, using the same argument structure with the function plot rather than lm and then using abline to add the regression line from the lm object:

plot(dDiam32 ~ PIN_combined, data=mtDataF2); abline(dD32_mt)

The scatterplot is a little hard to visualize since the x-axis values are all 0, 1, or 2, but the regression line slope shows the decreasing fitted value of dDiam32 as the number of MY alleles in the PIN genotype increases.

We can also use plot with the lm object as its argument to obtain the diagnostic plots as we did in the previous unit:

plot(dD32_mt)

To see the different diagnostic plots, when you run these scripts yourself, you need to click on the console and hit several times to observe the individual plots in the Plots tab in the lower right part of the RStudio window. Once they are all plotted, you can use the left and right arrows to scroll through the plots.

Since the fitted model is a linear regression, the first plot, Residuals vs Fitted, is actually meaningful. If the relationship between dDiam32 and its predictor (PIN_combined) is truly linear, we would expect the residual (i.e. the difference between the observed and fitted value of dDiam32) to have a mean of zero across the whole range of PIN_combined values. The red line on the plot tracks the local relationship of the residual mean to the predictor, and we see that it indeed stays close to zero.

The second plot, the Normal Q-Q plot shows that the overall distribution of residual dDiam32 values is slightly skewed, with higher values more spread out than what would be expected. Since the P-values for linear models are based on the assumption that the residuals are normally-distributed, one might want to be a little careful interpreting marginal P-values (close to 0.05, say), but that’s certainly not the case here!

5. Assignment #1:

Run all the scripts above to re-create all of the objects and analyses we have run. If you get any errors, check carefully for typos, missing punctuation, etc.
Based on the time allocation hypothesis described above, predict what effects the PIN region of chromosome 2 will have on these traits: meanBLvs (the mean number of basal leaves on sampled inflorescences), and Infl (the number of inflorescences produced by each plant).
Write scripts to run lm on meanBLvs and Infl to test whether those predictions are supported. Be sure to give each lm run a unique object name. Briefly describe and interpret the results.
In addition, run lm on the following traits, and similarly describe and interpret the results. What other traits, if any, show evidence of being affected by genes in the PIN region?

LatRating0215: the branchiness of the rosettes at the time the second diameter measurement was taken, around the time flowering began.

Diam1: the rosette diameter at the end of the first fall after sowing.

dDiam21: the net change in rosette diameter overwinter.

meanOrder: the average complexity of branching on sampled inflorescences.

meanBrNumber: the average number of branches on sampled inflorescences.

6. Genotypic effects on non-normal traits:

As we stated above, one of the assumptions of linear models is that the residual of the dependent (y-axis) variable is normally-distributed. So, what do we do if it isn’t? We will look at a couple of examples here.

First, let us go back to one of the traits you analyzed in the assignment you just completed: meanBLvs. You might or might not have run the diagnostic plots on this, but if you had you would have seen something a little disconcerting. You should have obtained the following results of the lm analysis.

BLvs_mt <- lm(meanBLvs ~ PIN_combined, data=mtDataF2)
summary(BLvs_mt)

## 
## Call:
## lm(formula = meanBLvs ~ PIN_combined, data = mtDataF2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4013 -1.0680 -0.1777  0.9320  5.7105 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.1777     0.1350  30.941   <2e-16 ***
## PIN_combined  -0.2216     0.1077  -2.056   0.0403 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.657 on 466 degrees of freedom
##   (132 observations deleted due to missingness)
## Multiple R-squared:  0.008993,   Adjusted R-squared:  0.006866 
## F-statistic: 4.229 on 1 and 466 DF,  p-value: 0.0403

Note that the P-value was really close to that “magic” cutoff of 0.05. If we’re sensible, we don’t really treat this as “magic” at all, but instead see values around the 0.05 level as providing some evidence of a real effect, but only weak evidence. Moreover, if we run the diagnostic Normal Q-Q plot, it looks quite non-normal:

plot(BLvs_mt, which = 2)

The which argument to plot specifies which of the set of diagnostics to generate if you don’t want to see all of them. Since there are a lot of rather extreme values on the upper end of the distribution, we might want to try a transformation of the dependent variable. Replacing the mean number of basal leaves per inflorescence with its square root would make extreme values on the high end of the distribution less extreme, so we’ll try that, as follows:

BLvs_sqrt_mt <- lm(sqrt(meanBLvs) ~ PIN_combined, data=mtDataF2)
summary(BLvs_sqrt_mt)

## 
## Call:
## lm(formula = sqrt(meanBLvs) ~ PIN_combined, data = mtDataF2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30682 -0.25118 -0.00514  0.27607  1.16447 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.00514    0.03397  59.023   <2e-16 ***
## PIN_combined -0.06048    0.02711  -2.231   0.0261 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4169 on 466 degrees of freedom
##   (132 observations deleted due to missingness)
## Multiple R-squared:  0.01057,    Adjusted R-squared:  0.008446 
## F-statistic: 4.978 on 1 and 466 DF,  p-value: 0.02615

plot(BLvs_sqrt_mt, which = 2)

Notice two things. First, using the square root of meanBLvs actually made the P-value a little more significant. Second, the plot of the residuals definitely looks much closer to the normal expectations. While the P-value is not overwhelmingly convincing, the direction of the effect (MY alleles associated with fewer basal leaves on shoots before they transition to becoming inflorescences) is consistent with our time allocation hypothesis.

Second, let’s look at a trait that is obviously non-normal because it is categorical: whether or not a plant produced rhizomatous shoots. This is a binary trait, coded as 0 if a plant did not produce rhizomatous shoots, and as 1 if it did. First, let’s ignore the fact that the trait is non-normal and just run lm as we did with the other traits:

Rhiz_mt <- lm(Rhiz0615 ~ PIN_combined, data=mtDataF2)
summary(Rhiz_mt)

## 
## Call:
## lm(formula = Rhiz0615 ~ PIN_combined, data = mtDataF2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6534 -0.6496  0.3466  0.3504  0.3542 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.645761   0.038798  16.644   <2e-16 ***
## PIN_combined 0.003807   0.031016   0.123    0.902    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4781 on 469 degrees of freedom
##   (129 observations deleted due to missingness)
## Multiple R-squared:  3.212e-05,  Adjusted R-squared:  -0.0021 
## F-statistic: 0.01506 on 1 and 469 DF,  p-value: 0.9024

The results provide no evidence that the PIN region affects this trait. The estimates are the fitted effects on the frequency of plants with rhizomatous shoots. However, it would be much better to use a model that actually treats Rhiz0615 as the binary trait that it is. To do this, we can fit a generalized linear model using the function glm. The arguments with glm are similar to those with lm. However, we also specify a family argument that identifies the type of dependent variable. We will specify family = binomial since the probability of an individual with a given genotype having rhizomatous shoots is binomial. Doing so will transform the binomial probabilities from the 0-1 range to an unbounded range using a link function, in this case a logit function in which the probability y is transformed to ln(y/(1-y)). This will have properties more similar to a normal distribution.

Rhiz_mt_glm <- glm(Rhiz0615 ~ PIN_combined, family=binomial, data=mtDataF2)
summary(Rhiz_mt_glm)

## 
## Call:
## glm(formula = Rhiz0615 ~ PIN_combined, family = binomial, data = mtDataF2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4557  -1.4482   0.9226   0.9289   0.9352  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.60044    0.16982   3.536 0.000407 ***
## PIN_combined  0.01673    0.13599   0.123 0.902110    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 610.08  on 470  degrees of freedom
## Residual deviance: 610.07  on 469  degrees of freedom
##   (129 observations deleted due to missingness)
## AIC: 614.07
## 
## Number of Fisher Scoring iterations: 4

In this case, the P-value is nearly the same as we obtained with lm. Notice that the summary output appears somewhat different from that using lm. Instead of R-squared values, we are given null and residual deviance. This is because the model is fit using maximum likelihood rather than ordinary least squares, and involves a bunch of statistical theory we won’t complicate ourselves with here. The coefficients are in logit-transformed values so the estimates no longer give us the probabilities of rhizomatous shoots directly, and are a little difficult to interpret.

7. Models including dominance:

So far, we’ve used models that assume that alleles of the QTL in the PIN region affect traits additively; in other words, that the MY/SP heterozygotes have a mean trait value halfway between the two homozygotes. This is a big assumption that might or might not be correct.

To address this, we can convert our PIN_combined genotype into additive and dominance components. We do this as follows:

mtDataF2$PIN_combAdd <- mtDataF2$PIN_combined-1
mtDataF2$PIN_combDom <- 1-abs(1-mtDataF2$PIN_combined)

The variable PIN_combAdd is generated by simply subtracting 1 from PIN_comb, so its value for SP/SP, MY/SP, and MY/MY genotypes are -1, 0, and 1 respectively. The rather contrived expression for PIN_combDom makes its value 1 when the genotype is MY/SP (i.e. heterozygous), and 0 for both homozygotes.

We can put both of these components into the genotype-trait model, in this case a multiple regression model:

dD32_AD <- lm(dDiam32 ~ PIN_combAdd + PIN_combDom, data=mtDataF2)
summary(dD32_AD)

## 
## Call:
## lm(formula = dDiam32 ~ PIN_combAdd + PIN_combDom, data = mtDataF2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.660 -18.660  -2.092  16.262  88.908 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -23.085      1.744 -13.238  < 2e-16 ***
## PIN_combAdd   -9.823      1.744  -5.633 3.19e-08 ***
## PIN_combDom   -4.255      2.483  -1.714   0.0873 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.91 on 434 degrees of freedom
##   (163 observations deleted due to missingness)
## Multiple R-squared:  0.0722, Adjusted R-squared:  0.06793 
## F-statistic: 16.89 on 2 and 434 DF,  p-value: 8.653e-08

The estimates from this additive-dominance model are interpreted as follows:

The Intercept is now the midpoint between the fitted values for the SP/SP and MY/MY genotypes. Thus, the fitted values for MY/MY and SP/SP are obtained by adding and subtracting, respectively, the estimate for PIN_combAdd from the Intercept. Thus, the fitted value for MY/MY is -23.038 + (-9.776) = -32.814. The fitted value for SP/SP is -23.038 - (-9.776) = -13.262.

The fitted value for the MY/SP genotype is obtained by adding the estimate for PIN_combDom to the Intercept; i.e. -23.038 - 4.355 = -27.393.

If there were absolutely no dominance, we would expect the effect of PIN_combDom to equal zero. While this isn’t the case, the P-value for PIN_combDom is ~=0.08, not very strong evidence for its true value being different from zero. If we compare the model F-statistic P-value to that obtained with the simple linear model above, we find that the additive-dominance model is actually less significant. The degree of freedom we use up in estimating the extra parameter is not worth it in this case – but we wouldn’t know that without testing!

If either the MY or SP allele were completely dominant, the dominance coefficient would be essentially equal to +/- the additive coefficient. However, in many cases the degree of dominance is somewhere in between absent and complete, or we could even see overdominance, where the heterozygote mean is greater than or less than both homozygotes.

Another model we can use is simply to treat the PIN_combined genotype as a factor rather than a numeric value. We can do this as follows:

dD32_cat <- lm(dDiam32 ~ as.factor(PIN_combined), data=mtDataF2)
summary(dD32_cat)

## 
## Call:
## lm(formula = dDiam32 ~ as.factor(PIN_combined), data = mtDataF2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.660 -18.660  -2.092  16.262  88.908 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -13.262      2.553  -5.194 3.17e-07 ***
## as.factor(PIN_combined)1  -14.077      3.105  -4.533 7.53e-06 ***
## as.factor(PIN_combined)2  -19.645      3.488  -5.633 3.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.91 on 434 degrees of freedom
##   (163 observations deleted due to missingness)
## Multiple R-squared:  0.0722, Adjusted R-squared:  0.06793 
## F-statistic: 16.89 on 2 and 434 DF,  p-value: 8.653e-08

Here, the Intercept estimate is the fitted value of dDiam32 for SP/SP, which is identical to that in the additive-dominance model. The other two estimates are added to the Intercept to obtain the fitted values for MY/SP and MY/MY respectively. If you do the calculations, these also come out identical to the fitted values of the additive-dominance model, possibly with rounding error. In fact, the values for R-squared, F-statistic, and P-value are all identical to those of the additive-dominance model. The two models are both equivalent; they both use all the information we can glean from the genotypes but just package it in a different way.

The additive-dominance formulation is more enlightening from a genetic standpoint since it provides estimates and P-values for the separate additive and dominance components. However, the factor approach is nice because it can be used to generate a nice boxplot showing the phenotypic distributions of each genotype, as follows:

plot(dDiam32 ~ as.factor(PIN_combined), data=mtDataF2)

8. Assignment #2:

Write scripts and run additive-dominance models for all the traits you analyzed in Assignment #1. (Don’t forget to copy and run the script to generate the additive and dominance components for the PIN_combined) genotypes first!)

Are the dominance components significant for any of the traits? If so, describe the dominance effect. Does incuding dominance in the model result in significant effects of the PIN region for any additional traits?

#9. Estimating QTL effects on a more complicated calculated trait:

Lastly, we will investigate the effect of the PIN QTL region on early development of lateral inflorescences. The trait we’re interested in is whether or not plants had started developing lateral inflorescences at the time the second rosette diameter measurement (Diam2) was taken, near the time plants were beginning to flower. Unfortunately, the way we recorded the data didn’t directly measure what we are after, so we will first have to do some rather complicated recoding.

The two recorded traits are labeled PShootStatus0215 and LShootStatus0215, and refer to the status of primary shoots (the main stem of the plant) and lateral shoots respectively. In each case, these are recorded as “N” (no reproductive development), “B” (the primary shoot or at least one lateral shoot had bolted, i.e. started elongating, but not flowered yet), or “F” (the primary shoot or at least one lateral shoot had open flowers). Also, in a few cases these were erroneously coded “Y”, indicating that they had either bolted or flowered.

Note that shoots that flower had already bolted, so the “B”, “F”, and “Y” categories had all bolted already. We want to generate variables with numeric codes to indicate whether the primary and lateral shoots had bolted yet (combining the “bolted” and “flowered” categories), and separately whether they had actually flowered or not.

First, we make sure that our version of R is treating the two variables as factors and not simply as characters.

mtDataF2$PShootStatus0215 <- as.factor(mtDataF2$PShootStatus0215)
mtDataF2$LShootStatus0215 <- as.factor(mtDataF2$LShootStatus0215)

Now we list the levels of each variable.

levels(mtDataF2$PShootStatus0215)

## [1] ""  "B" "F" "N" "Y"

levels(mtDataF2$LShootStatus0215)

## [1] ""  "B" "F" "N" "Y"

Notice that the first level in each case is ““; in other words, blank.

Now, we define four new variables that establish the codes for whether primary shoots and lateral shoots had bolted yet or flowered yet.

mtDataF2$PBolted0215 <- as.numeric(c(NA,1,1,0,1)[match(mtDataF2$PShootStatus0215,
                                        levels(mtDataF2$PShootStatus0215))])
mtDataF2$PFlowered0215 <- as.numeric(c(NA,0,1,0,NA)
                                     [match(mtDataF2$PShootStatus0215,
                                        levels(mtDataF2$PShootStatus0215))])
mtDataF2$LBolted0215 <- as.numeric(c(NA,1,1,0,1)[match(mtDataF2$LShootStatus0215,
                                        levels(mtDataF2$LShootStatus0215))])
mtDataF2$LFlowered0215 <- as.numeric(c(NA,0,1,0,NA)
                                     [match(mtDataF2$LShootStatus0215,
                                        levels(mtDataF2$LShootStatus0215))])

What each of these scripts does is establish the vector of values that go with the levels of the existing variables in order (““,B”,“F”,“N”,“Y”). (in the case of PBolted0215, these are NA, 1, 1, 0, and 1). The match function assigns these values to the new variable in place of the values of the old variable. (In place of the levels argument, we could have simply put in the vector c(““,”B”,“F”,“N”,“Y”).)

Since we aren’t sure whether “Y” meant “bolted” or “flowered,” we enter these as NA for the Flowered variables.

We can now test whether the QTL in the PIN chromosomal region affects these traits. We are mainly interested in whether MY alleles increase the probablility that lateral shoots become reproductive early. To do this, we use LBolted0215 in the additive-dominance model:

LBolted_AD <- lm(LBolted0215 ~ PIN_combAdd + PIN_combDom, data=mtDataF2)
summary(LBolted_AD)

## 
## Call:
## lm(formula = LBolted0215 ~ PIN_combAdd + PIN_combDom, data = mtDataF2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7559 -0.5781  0.2441  0.4219  0.7080 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.52397    0.03037  17.250  < 2e-16 ***
## PIN_combAdd  0.23194    0.03037   7.636 1.25e-13 ***
## PIN_combDom  0.05409    0.04306   1.256     0.21    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4698 on 474 degrees of freedom
##   (123 observations deleted due to missingness)
## Multiple R-squared:  0.111,  Adjusted R-squared:  0.1073 
## F-statistic:  29.6 on 2 and 474 DF,  p-value: 7.722e-13

We see that the additive effect of the PIN genotype is highly significant and large. If we add and subtract the additive estimate (PIN_combAdd) from the intercept, the fitted values for MY/MY and SP/SP genotypes, which represent the probabilities of bolted lateral shoots are ~= 0.76 and 0.29 respectively – a huge difference!! The dominance coefficient is not significantly different from zero, suggesting a largely additive effect.

Since LBolted_AD is a binary variable, we really should re-run the analysis using glm using a binomial family.

LBolted_AD_glm <- glm(LBolted0215 ~ PIN_combAdd + PIN_combDom, family=binomial,
                   data=mtDataF2)
summary(LBolted_AD_glm)

## 
## Call:
## glm(formula = LBolted0215 ~ PIN_combAdd + PIN_combDom, family = binomial, 
##     data = mtDataF2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6794  -1.3137   0.7481   1.0470   1.5690  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.1224     0.1462   0.837    0.402    
## PIN_combAdd   1.0079     0.1462   6.895 5.38e-12 ***
## PIN_combDom   0.1924     0.1966   0.978    0.328    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 654.91  on 476  degrees of freedom
## Residual deviance: 600.41  on 474  degrees of freedom
##   (123 observations deleted due to missingness)
## AIC: 606.41
## 
## Number of Fisher Scoring iterations: 4

When we do this, we see that the P-value isn’t quite as tiny but is still highly significant. It is hard to interpret the coefficients since they are based on a logit transformation (explained above). However, if we were to do the inverse of the logit transformation on the glm predicted values, we would obtain the exact same fitted values as the lm. (You could run the function fitted.values(LBolted_AD_glm) and see a vector of the fitted values for each sample to verify this.)

Let’s consider the implications of these results. When you did Assignments #1 and #2, you should have noticed that MY alleles in the PIN region reduced the extent of lateral shoot development at the time the second diameter measurement was taken (reduced values of LatRating0215). In spite of this, MY alleles led to a greater probability that lateral shoots had already bolted at the same time point. Thus, MY alleles delayed formation of lateral shoots but led to earlier reproductive transitions on these shoots, providing additional powerful evidence for our hypothesis that the PIN QTL region affects time allocation.

10. A concluding statistical warning:

To conclude this tutorial, I should point out that the approach we took in the statistical analysis of the PIN region can be highly misleading if we’re not careful. This is because we did statistical tests for effects on a lot of different traits, increasing the opportunity to obtain significant P-values by chance alone. By definition, if we were to test 20 independent traits with absolutely no QTL effect, we would expect one of them on average to result in a P-value less than 0.05 (= 1/20). This is known as the multiple testing problem. A similar problem occurs when we test many different regions of the genome for QTL effects on a single trait.

To reduce the chance of making false inferences when many independent tests are done, the best practice is to use a smaller P-value threshold. There are a few different ways to do this P-value correction, which we won’t go into here. However, these corrrections also reduce the statistical power to detect true effects.

In our case, we are relatively safe using uncorrected P-values to make inferences. First, previous studies had already shown that the PIN region has a QTL affecting growth and reproduction. Those studies used much smaller P-values to identify which regions across the entire A. lyrata genome harbored QTLs. Our goals here are to understand more specifically how this already-known QTL functons. Second, we had plenty of evidence that many of these traits are not independent, but instead represent different measures of the same resource allocation process. Finally, many though not all of our “significant” P-values were much less than 0.05. Thus, we can be safe here in drawing relatively confident conclusions from our analyses.

Analyzing Genotype/Trait Associations in R: an example in Arabidopsis lyrata

David Remington

UPDATED June 27, 2023