1. Overview

This tutorial guides users through the basics of creating a data frame from quantitative trait data, generating calculated variables, generating summary statistics, and doing basic statistical analyses.

By completing this tutorial, you should be able to do the following:

Generate a data frame, which is a collection of variables and observations to be analyzed, by importing a spreadsheet file or by modifying an existing data frame.
Display information about variables in a data frame such as their names, types, and categories.
Calculate new variables from existing ones.
Generate summary statistics from variables in a data frame.
Estimate and interpret correlations between variables.
Set up and interpret a principal components analysis.
Set up and interpret the results from statistical analyses using linear models.

Our data come from a pair of F2 crosses between two populations of the perennial herbaceous plant Arabidopsis lyrata from contrasting climatic extremes of its range. In addition to the large set of 600 F2 plants, some samples from the two parental populations, Mayodan (MY) and Spiterstulen (SP) and F1 plants from crosses between them are also included.

2. Installing R and RStudio:

You will need to download and install the correct version of R for your computer from https://www.r-project.org/. There are versions for Windows, Mac, and Linux. The process is pretty self-explanatory.

You should also download and install RStudio. This provides a nicer “front end” for viewing and organizing your R scripts, viewing output, and accessing Help resources. Go to https://www.rstudio.com/products/rstudio/ to download and install. Select the free open-source version.

To use R, you can click on the RStudio icon to open RStudio, navigate to the folder where your data file is (lower right, Files tab), and set that folder as your working directory. Create or open a script file (File/New File, or File/Open File). If you have already created output, you can also open the workspace .RData file containing the output.

You can also navigate to the folders where you already have script (.R) and workspace (.RData) and click on those files. Usually these will open directly using RStudio; if they don’t, right-click and select “Open with. . .”

3. Generating a data frame:

Before generating data frame from a CSV file, make sure you have the most recent version in the same directory your R script file (.R) is in. For the purposes of this tutorial, this can be downloaded from Github using the link under “Resources” on the Remington Lab website.

This reads the CSV file QTLStudyTraitAnalysis_surv.csv, and creates a data frame called traitDataAll from it. The argument header=TRUE tells R that the first row of the CSV file contains the names of each variable.

To run one or more lines of script, we highlight them and press the Run button above the script window. Alternatively we can press CTRL+R.

traitDataAll <- read.csv("QTLStudyTraitAnalysis_surv.csv", header=TRUE)

Reminder:

In a line of R script, the expression to the right of the “<-” represents a function to be run; in this case read.csv.

The name to the left of the “<-” specifies the object that will contain the output of the function; in this case we name it traitDataAll.

Now, suppose we want to inspect the structure of our data frame.

In this case it will have too many rows and columns to print out neatly, so we want to look at just the first twenty rows and the first six columns (variables). The two items in the square brackets, separated by a comma, specify which rows to display (first item) and which columns to display (second item). The “c” stands for “concatenate.”

traitDataAll[c(1:20),c(1:6)]

##    Pop Fam Num X2nd_Data_colec_date LatRating0215 PShootStatus0215
## 1   MY  A1   6             02_12_15             2                F
## 2   MY  A2   2             02_12_15             3                F
## 3   MY  A2   6             02_06_15             2                F
## 4   MY  A2   3                                 NA                 
## 5   MY  A2   4             02_06_15             2                F
## 6   MY  A2   5             02_12_15             2                F
## 7   MY  B2   2             02_14_15             2                F
## 8   MY  B2   4             02_12_15             3                B
## 9   MY  B2   6             02_12_15             2                F
## 10  MY  B2   8             02_12_15             2                F
## 11  MY  C1   2             02_11_15             2                F
## 12  MY  C1   2             02_12_15             2                F
## 13  MY  C2 2-B                                 NA                 
## 14  MY  C2   2             02_12_15             3                F
## 15  MY  C2   4             02_05_15             2                F
## 16  MY  C2   5             02_11_15             3                F
## 17  MY  C2   8             02_12_15             3                F
## 18  MY  D1   2             02_11_15             2                B
## 19  MY  D1   4             02_15_15             2                F
## 20  MY  D2   2             02_12_15             2                F

Note that since this line had no “<-”, the output was displayed directly in the console.

Another very useful tool is to display the names of the variables in the data frame; i.e. the column headings. This provides us a convenient reminder of the variable names and which columns they are in:

names(traitDataAll)

##  [1] "Pop"                  "Fam"                  "Num"                 
##  [4] "X2nd_Data_colec_date" "LatRating0215"        "PShootStatus0215"    
##  [7] "LShootStatus0215"     "Rhiz0215"             "Remarks1"            
## [10] "Remarks2"             "Collector"            "S1BLvs"              
## [13] "S2BLvs"               "S3BLvs"               "S4Order"             
## [16] "S4BrNode"             "S4FlNode"             "S5Order"             
## [19] "S5BrNode"             "S5FlNode"             "S6Order"             
## [22] "S6BrNode"             "S6FlNode"             "Rhiz0615"            
## [25] "CollDate"             "Comments"             "Diam1"               
## [28] "Diam2"                "Diam3"                "Infl"                
## [31] "Surv1216"

Note that some of the variables in the data frame are numeric, while others are alphanumeric and represent categories.

For the categorical variables, we might like to see what all the different categories (or “levels”) are. For example, which populations do our samples come from. There are two ways to do this:

First, we can use the square bracket notation to select the column number of the variable we want, in this case column 1, as follows. (Note that the “traitDataAll” argument requires the [row,column] specification, so we just leave the row blank.)

levels(traitDataAll[,1])

## NULL

Alternatively, we can specify the variable by placing a $ after the data frame object name, followed by the variable name:

levels(traitDataAll$Pop)

## NULL

In each case, running this function produces an output of NULL, meaning that there were no levels! Since we expected to see a list of categories, what is wrong?

We may want to check what type of variable (i.e. what “class”) R interpreted it as when it created the data frame. To do this, we can do the following:

class(traitDataAll$Pop)

## [1] "character"

This result shows us that R interpreted Pop as a character variable, not as a categorical factor. (Sometimes different versions of R will interpret the very same input differently. That’s what happened here; the version of R I was using previously with this script had correctly interpreted Pop as a factor!)

We can fix this as follows:

traitDataAll$Pop <- as.factor(traitDataAll$Pop)

In this case, the function as.factor treats its argument (the Pop variable) as a factor, and assigns that treatment to the very same variable.

To check this, we can again check its class:

class(traitDataAll$Pop)

## [1] "factor"

. . . and now we can see what the different populations are:

levels(traitDataAll$Pop)

## [1] "F1" "F2" "MY" "SP"

4. Generating summary statistics:

For this study, plants were grown in a greenhouse under natural light conditions, with cooler temperatures in the winter. Seeds were sown in September. Vegetative rosette diameters were measured from photos taken at three time points: early December going into the winter (Diam1); in February when flowering was just beginning (Diam2); and in May near the end of flowering (Diam3).

First, let’s calculate the mean diameters (in mm) at each measurement. (The argument na.rm=TRUE removes samples for which the diameter data are missing, which were assigned values of NA when the data frame was generated.)

mean(traitDataAll$Diam1,na.rm=TRUE)

## [1] 100.2218

mean(traitDataAll$Diam2,na.rm=TRUE)

## [1] 98.41066

mean(traitDataAll$Diam3,na.rm=TRUE)

## [1] 73.12148

Note that the diameters were actually somewhat smaller on average after flowering. This makes biological sense if there are trade-offs of some kind between vegetative and reproductive growth. In fact, understanding the nature of these trade-offs is the main focus of our research.

Of course, what we’d really like to know is how the populations differed from each other in mean diameter. We can use the tapply function to find this:

with(traitDataAll,tapply(Diam1,Pop,mean,na.rm=TRUE))

##        F1        F2        MY        SP 
##  99.36667 100.35271  96.48276 103.84211

By using the with function, we can specify the data frame to use, which allows us to specify which variables to use in tapply without naming the data frame each time. The first argument to tapply specifies the measurement to be used, the second argument specifies the categories for which it will be estimated, and the third argument specifies the statistic to be estimated. Thus, the first line gives us the mean of Diam1 for each population.

Note that for Diam1, the SP plants are the largest on average and the MY plants the smallest, with the two hybrid sets (not surprisingly) in between.

Now let us look at Diam2:

with(traitDataAll,tapply(Diam2,Pop,mean,na.rm=TRUE))

##        F1        F2        MY        SP 
## 106.51429  97.89331 103.03571  92.63636

After the winter, when the plants start flowering, the F1 hybrids are now the largest, and the SP plants are the smallest.

with(traitDataAll,tapply(Diam3,Pop,mean,na.rm=TRUE))

##       F1       F2       MY       SP 
## 87.00000 73.06667 51.82143 77.72000

For Diam3, the F1s are still the largest but MA and SP have again switched ranks and the MY plants are by far the smallest.

We can estimate the standard deviations of each measurement in the same way by making the third argument sd instead of mean:

with(traitDataAll,tapply(Diam1,Pop,sd,na.rm=TRUE))

##       F1       F2       MY       SP 
## 19.33102 21.43403 21.04013 21.17195

with(traitDataAll,tapply(Diam2,Pop,sd,na.rm=TRUE))

##       F1       F2       MY       SP 
## 15.16065 16.58808 15.45719 22.11344

with(traitDataAll,tapply(Diam3,Pop,sd,na.rm=TRUE))

##       F1       F2       MY       SP 
## 25.54190 20.19250 10.56217 18.41539

We can visualize these data distributions more easily with box-and-whisker plots. The structure of the boxplot arguments is a little different than the previous functions. Think of it as plotting rosette diameters as a function of population. Rather than using the with function, boxplot specifies the dataframe with a data= argument.

boxplot(Diam1 ~ Pop, data=traitDataAll)

In these plots, the thick bar is the median (which might be a little different from the mean), the shaded box is the range between the 1st and 3rd quartiles (i.e. the 25th and 75th percentiles), and the whiskers extend an additional 1.5 times the range between the 1st and 3rd quartiles. (Or to the most extreme values if those are closer.)

We see here that the population differences in Diam1 are pretty small compared to the variation within populations.

boxplot(Diam2 ~ Pop, data=traitDataAll)

The differences between populations in Diam2 seem more substantial. (Determinging how statistically significant these differences are would require some additional analysis, which we will cover later.)

boxplot(Diam3 ~ Pop, data=traitDataAll)

Here the differences between populations in Diam3 look pretty large. The third quartile in MY is less than or equal to the first quartile in the remaining populations.

We can also plot histograms of the rosette diameter distributions, either for the entire data set or for individual populations. Here, we just do it for the F2 plants.

with(traitDataAll[traitDataAll$Pop=="F2",], hist(Diam3), 
     breaks=c(0:20*10))

The syntax above may seem a little odd since we specify the data frame twice. We need to use with so R knows to look for Diam3 in the traitDataAll dataframe. We then use the square brackets to specify that we are only using rows of the data frame for which the Pop == F2. (== means “is identical to”.) For whatever reason, R does not recognize Pop within the brackets without specifying the data frame.

We see that the data look more-or-less normally distributed. Another way to visualize just how normally distributed the data are is to use the function qqnorm, as follows:

with(traitDataAll, qqnorm(Diam3))
with(traitDataAll, qqline(Diam3))

In this plot, the x-axis shows the expected distribution of measurements in standard deviation units from the mean. The y-axis shows the actual distribution of trait values, and the line shows the expected relationship if the data were normally distributed.

The plot for the F2 samples shows the data are pretty close to a normal distribution although the values are somewhat compressed at the lower end. (Notice that there is also one outlier with a measurement of 0 for Diam3. This dramatizes why we should not put in a zero for a missing measurement rather than leaving it blank!)

5. Calculated variables:

We can add additional variables to our data frame that we calculate from other variables. For example, with our diameter measurements we may prefer to replace Diam2 and Diam3 with the net growth between diameter measurements. Thus, we do the following:

traitDataAll$dDiam21 <- with(traitDataAll,Diam2 - Diam1)
traitDataAll$dDiam32 <- with(traitDataAll,Diam3 - Diam2)

Looking at rosette size in this way, Diam1, dDiam21, and dDiam32 represent the net vegetative growth in three independent time periods. When we compare dDiam21 and dDiam32 between populations, we might gain new insights on the differences between them:

with(traitDataAll,tapply(dDiam21,Pop,mean,na.rm=TRUE))

##        F1        F2        MY        SP 
##  9.964286 -2.854978  5.607143 -8.705882

with(traitDataAll,tapply(dDiam32,Pop,mean,na.rm=TRUE))

##        F1        F2        MY        SP 
## -20.20000 -25.16088 -51.21429 -17.22727

We see here that the MY plants had a much greater net loss of vegetative tissue during the reproductive period on average than SP plants, with the two hybrid classes showing intermediate values.

We also need to generate a calculated variable to make use of a key set of measurements. On each plant, we sampled three inflorescences (which are shoots that had switched from being compact shoots producing normal leaves to greatly elongated shoots that produce flowers), and counted the number of normal basal leaves produced on each one before the switch. This can be thought of as representing how long each new shoot grew vegetatively before its growing tip became reproductive. These counts are in three separate columns, and we are interested in the mean value for each plant. To further complicate things, some plants did not have three inflorescences in the first place, so we need the mean of fewer inflorescences on those plants.

We can generate this mean correctly as follows using the apply function:

traitDataAll$meanBLvs <- apply(traitDataAll[,c(12:14)],1,mean,na.rm=TRUE)

The first argument to apply specifies that the data to use are columns 12 through 14 of traitDataAll. (If we refer back to our vector of names we generated earlier, we see that these are three variables named S1BLvs, S2BLvs, and S3BLvs.)

The second argument (the numeral 1) specifies that the calculation we apply to these data will be across rows; in other words separately to the data in columns 12 through 14 of each sample. If we instead put in “2”, the operation would be applied to each of the three columns across all samples.

The third argument specifies that the calculation is to take the mean. Finally, na.rm=TRUE removes any missing values.

We need to generate means in a similar way to three sets of measurements we took on three different inflorescences to evaluate their branching patterns:

traitDataAll$meanOrder <- apply(traitDataAll[,c(15,18,21)],1,mean,na.rm=TRUE)
traitDataAll$meanBrNode <- apply(traitDataAll[,c(16,19,22)],1,mean,na.rm=TRUE)
traitDataAll$meanFlNode <- apply(traitDataAll[,c(17,20,23)],1,mean,na.rm=TRUE)

meanOrder measures the mean complexity of branching on the sampled inflorescences. meanBrNode and meanFlNode describe the lowest node on the inflorescence on average that produced the first branch and the first flower, respectively.

Let’s look now at how the populations differ in their mean values for these traits:

with(traitDataAll,tapply(meanBLvs,Pop,mean,na.rm=TRUE))

##       F1       F2       MY       SP 
## 3.885714 3.976477 3.057471 4.101449

We see that MY plants tended to produce fewer basal leaves on new shoots before they started elongating (about 3 basal leaves on average) than the other three populations (about 4 on average)

with(traitDataAll,tapply(meanOrder,Pop,mean,na.rm=TRUE))

##       F1       F2       MY       SP 
## 2.095238 2.265688 2.816092 1.492754

with(traitDataAll,tapply(meanBrNode,Pop,mean,na.rm=TRUE))

##       F1       F2       MY       SP 
## 2.104762 2.190182 2.683908 1.600000

with(traitDataAll,tapply(meanFlNode,Pop,mean,na.rm=TRUE))

##       F1       F2       MY       SP 
## 4.695238 4.584629 4.712644 3.202899

In essence, these data show that MY plants tended to make branchier inflorescences than SP plants, and to produce branches higher up on the inflorescence than SP plants. The F1 and F2 hybrids tended to be intermediate for these traits.

6. An assignment:

Download the data file QTLStudyTraitAnalysis_surv.csv from the Github repository (using the link under “Resources” on the Remington Lab website), and run all the scripts shown above to regenerate the work we’ve done so far. Then use tapply and boxplot to examine how the following traits vary between populations.

Infl: Count of the number of separate inflorescences at end of flowering season; i.e. the number of vegetative shoots that underwent transition to being reproductive.

LatRating0215: Visual scoring of how much vegetative branching had taken place by February when plants started flowering. Scored on a 1-5 scale, where 1 = no branching and 5 = extensive branching.

Rhiz0615: This indicates whether or not the plant had produced rhizomatous shoots (new vegetative rosettes that develop from roots away from the main stem) by the end of flowering. “1” indicates that a plant had rhizomatous shoots, and “0” indicates that it did not.

Surv1216: This is whether the plants were still alive 18 months later.

For each trait, describe briefly the results, including how large any differences between populations appear to be. (The box-and-whisker plot for Rhiz0615 won’t be very meaningful since the only measurements are 0 or 1.) Remember that these statistics are descriptive only, and can’t by themselves tell us whether or not the differences are statistically significant.

7. Trait correlations and principal components:

So far, we have just looked at differences between populations in individual traits, one at a time. Understanding the nature of trade-offs requires that we look at the pattern of correlations between traits.

To do this, it will be most informative to look specifically at the F2 plants. In the F2s, independent assortment and crossovers between MY and SP chromosomes in their F1 parents resulted in different chromosomal segments being inherited independently. Thus, correlations between traits in the F2s suggests that variation in the same genes, or in multiple genes very close together, affected both traits at once.

We will first generate a new data frame that just includes the F2 plants:

traitDataF2 <- with(traitDataAll,traitDataAll[Pop=="F2",])

To see what our sample size is, we can use the function length with one of the variables:

length(traitDataF2$Fam)

## [1] 600

We see that there are exactly 600 F2 plants in our sample. (This wasn’t intentional; it just turned out that way!)

To explore correlations between traits, let’s again use names to display all our variables and the column numbers, since we’ve added a few new ones.

names(traitDataF2)

##  [1] "Pop"                  "Fam"                  "Num"                 
##  [4] "X2nd_Data_colec_date" "LatRating0215"        "PShootStatus0215"    
##  [7] "LShootStatus0215"     "Rhiz0215"             "Remarks1"            
## [10] "Remarks2"             "Collector"            "S1BLvs"              
## [13] "S2BLvs"               "S3BLvs"               "S4Order"             
## [16] "S4BrNode"             "S4FlNode"             "S5Order"             
## [19] "S5BrNode"             "S5FlNode"             "S6Order"             
## [22] "S6BrNode"             "S6FlNode"             "Rhiz0615"            
## [25] "CollDate"             "Comments"             "Diam1"               
## [28] "Diam2"                "Diam3"                "Infl"                
## [31] "Surv1216"             "dDiam21"              "dDiam32"             
## [34] "meanBLvs"             "meanOrder"            "meanBrNode"          
## [37] "meanFlNode"

We would like to explore correlations between the following nine traits: Diam1, dDiam21, dDiam32, Infl, LatRating0215, meanBLvs, Rhiz0615, meanOrder, and Surv1216.

To do this we use the function cor, specifying the column numbers for the variables we want to include. We will also specify that each pairwise correlation should use all samples that have data for both traits, and that Pearson correlations of the actual trait values (not of their rank orders) will be used. This will produce a 9x9 table of the correlation values.

cor(traitDataF2[,c(27,32,33,30,5,34,24,35,31)],
    use="pairwise.complete.obs", method="pearson")

##                     Diam1       dDiam21     dDiam32         Infl LatRating0215
## Diam1          1.00000000 -0.6809898199 -0.38544755  0.211396175   -0.01138508
## dDiam21       -0.68098982  1.0000000000 -0.17487380  0.006498393   -0.03886277
## dDiam32       -0.38544755 -0.1748738023  1.00000000 -0.506062614    0.08451300
## Infl           0.21139618  0.0064983927 -0.50606261  1.000000000    0.03489757
## LatRating0215 -0.01138508 -0.0388627662  0.08451300  0.034897572    1.00000000
## meanBLvs      -0.15926406  0.0141061300  0.26879632 -0.212745609    0.06403816
## Rhiz0615       0.01156672  0.0249990731  0.10708639 -0.055386708    0.10304944
## meanOrder     -0.01299630 -0.0029787963 -0.11038235  0.121832334   -0.07891854
## Surv1216      -0.02988549  0.0001290826  0.06428793  0.031700604    0.03209561
##                  meanBLvs    Rhiz0615    meanOrder      Surv1216
## Diam1         -0.15926406  0.01156672 -0.012996304 -0.0298854944
## dDiam21        0.01410613  0.02499907 -0.002978796  0.0001290826
## dDiam32        0.26879632  0.10708639 -0.110382351  0.0642879332
## Infl          -0.21274561 -0.05538671  0.121832334  0.0317006036
## LatRating0215  0.06403816  0.10304944 -0.078918542  0.0320956065
## meanBLvs       1.00000000 -0.02959943 -0.109175269 -0.0295029645
## Rhiz0615      -0.02959943  1.00000000  0.012680628  0.1201576013
## meanOrder     -0.10917527  0.01268063  1.000000000  0.0738436185
## Surv1216      -0.02950296  0.12015760  0.073843619  1.0000000000

The correlations can range from -1 (completely negatively correlated) to +1 (completely positively correlated). A value of zero means that there is no correlation at all. Note that each variable has a correlation of 1.0000 with itself, as we would hope!!

A key correlation to look for is between dDiam32 and Infl, which would be a strong negative correlation if there were a trade-off between vegetative growth and reproduction during the flowering season. Indeed, this is one of the strongest correlations in the entire table (r ~= -0.51). Initial and overwinter diameter growth (Diam1 and dDiam21) are also strongly negatively correlated.

meanBLvs also has a moderately strong negative correlation with Infl (-0.21),and a moderately strong positive correlation with dDiam32 (0.27). This provides some support for a previous hypothesis that the growth-reproduction trade-offs are due in part to genetic variation in how rapidly the tips of new vegetative shoots transition to becoming reproductive. The more basal vegetative leaves on a shoot, the slower the transition, which we hypothesized would lead to fewer inflorescences and less net loss of leaf tissue during the reproductive season.

Some correlations that are not very strong are between Infl and the measures of inflorescence branching, between rhizomatous shoots and everything else and between Surv1216 and everything else. This suggests that the genes that affect differences in number of inflorescences don’t affect inflorescence branching, that genes that affect variation in rhizomatous shoot development don’t affect much of anything else, and that genes affecting any of the other traits had little impact on survival.

A more comprehensive way to look at correlations among the entire set of traits at once is to do a principal components analysis (PCA) . This identifies the correlation pattern among traits that explains the most total variation (called PC1), next the pattern that explains the most of the remaining variation (PC2), and so on. These PCs represent a new set of axes, which are called eigenvectors, to describe the data. If trade-offs between vegetative and reproductive growth dominate the correlations among traits, we might expect PC1 to involve this trade-off.

To set up our PCA, we first create a new data frame with just the variables we want to use. We will use the same variables we used for the correlation analysis but leave out the survival variable:

PCTraitsF2 <- traitDataF2[,c(27,32,33,30,5,34,24,35)]

Next, we remove all samples (rows) that have missing data for any of the variables using the function na.omit, and put the results in another new data frame:

PCTraitsF2Trim <- na.omit(PCTraitsF2)

We can check to make sure we selected the columns we wanted:

names(PCTraitsF2Trim)

## [1] "Diam1"         "dDiam21"       "dDiam32"       "Infl"         
## [5] "LatRating0215" "meanBLvs"      "Rhiz0615"      "meanOrder"

. . . and see how many samples are remaining:

length(PCTraitsF2Trim$Diam1)

## [1] 440

This reduced the number of samples from 600 to 440. However, PCA cannot make use of missing data.

Now we use the function prcomp to run the PCA on the trimmed data frame. We will save the output in a new list file, which will allow us to do different analyses on the results. The argument scale. = TRUE (note the “.” after “scale”) tells prcomp to standardize each variable first by dividing its value by its variance.

PCModelF2 <- prcomp(PCTraitsF2Trim, scale. = TRUE)

We can view a summary of the results as follows:

summary(PCModelF2)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6    PC7
## Standard deviation     1.4213 1.2307 1.0420 1.0165 0.9511 0.88209 0.7327
## Proportion of Variance 0.2525 0.1893 0.1357 0.1292 0.1131 0.09726 0.0671
## Cumulative Proportion  0.2525 0.4418 0.5776 0.7067 0.8198 0.91706 0.9842
##                            PC8
## Standard deviation     0.35602
## Proportion of Variance 0.01584
## Cumulative Proportion  1.00000

This shows us that the first two principal components (PC1 and PC2) have standard deviations substantially greater than 1.0. Since the individual traits were standardized to have standard deviations of 1.0, this shows us that the data are “stretched” along these two axes, telling us that they represent correlations among the traits.

To see the actual axes (eigenvectors) each PC represents, we do the following to view the rotation item in the PCModelF2 object, which is technically a list:

PCModelF2$rotation

##                       PC1         PC2          PC3          PC4         PC5
## Diam1          0.54630646 -0.43597398  0.035158296 -0.008936645  0.06185517
## dDiam21       -0.28696115  0.69109882 -0.062872885 -0.142911501  0.09556237
## dDiam32       -0.50378269 -0.35282772  0.001983747  0.208349185 -0.07068949
## Infl           0.46915441  0.29951831 -0.125079156 -0.211365802 -0.22182080
## LatRating0215 -0.05956755 -0.14925977 -0.658214363 -0.481447381 -0.41311751
## meanBLvs      -0.33082584 -0.20706306  0.207035743 -0.193233563 -0.50140456
## Rhiz0615      -0.06646145 -0.04989468 -0.708723563  0.465628196  0.24456883
## meanOrder      0.16699858  0.22464955  0.027624769  0.636766163 -0.67153092
##                      PC6         PC7           PC8
## Diam1          0.1644490  0.23450823 -0.6513803170
## dDiam21        0.1200222  0.19573200 -0.5948197256
## dDiam32       -0.3435940 -0.49340982 -0.4602121044
## Infl           0.1024847 -0.75120268 -0.0764825578
## LatRating0215 -0.2912501  0.23145641 -0.0084109791
## meanBLvs       0.7180024 -0.02328479  0.0002265174
## Rhiz0615       0.4525597 -0.08880751  0.0385190544
## meanOrder     -0.1575657  0.19218516 -0.0519027979

We see that PC1 is approximately 0.55 times the standardized value of Diam1, plus -0.29 times the standardized value of dDiam21, etc.

The largest coefficients in PC1, in terms of absolute value, are those for Diam1, dDiam32, and Infl, followed by meanBLvs. Thus, both the trade-off between reproduction and reproductive-season vegetative growth, along with its hypothesized cause (how rapidly new shoot tips transition from being vegetative to being reproductive), contribute strongly to PC1. However, initial diameter growth (Diam1) also contributes, suggesting that plants that acquired more photosynthetic resources (leaf size) early on were able to flower more.

The strongest contributors to PC2 are a negative correlation between initial diameter growth (Diam1) and subsequent overwinter growth (dDiam21). However, the same vegetative-reproductive trade-off during the reproductive season also contributes fairly strongly to this PC.

The actual signs of the coefficients for an individual PC are arbitrary. For PC1, if we made all the positive coefficients negative and vice versa, we would be describing the exact same relationship among the variables. It is the relative signs, representing positive vs. negative correlations among variables, that are important.

Finally, we can plot the first two PCs, showing the predicted values for each individual (in orange) along with the coefficients for each trait (vectors in black). This gets messy with such a large sample, though!

biplot(PCModelF2, pc.biplot = TRUE, col = c("orange","black"))

One last warning here: since Rhiz0615 is a binary (0/1) trait, it clearly violates the normal-distribution assumptions on which PCA is based. In practice, this might not make a lot of difference, but one should be aware of the potential complications.

8. Some basic statistical analysis:

We said earlier that our descriptive statistics and plots did not give us specific information on the statistical significance of differences between populations. To test for statistical significance we can generate some basic linear models using the lm function in R.

Let’s analyze the effects of population on three rosette diameter growth measurements we looked at earlier. For Diam1, the model specification is as follows. Note that the arguments to lm are very similar to those used earlier for boxplot. However, we will generate an object for each trait, which we can then produce a summary and get other output.

Diam1ByPop <- lm(Diam1 ~ Pop, data = traitDataAll)

To view a summary of the output, we use summary:

summary(Diam1ByPop)

## 
## Call:
## lm(formula = Diam1 ~ Pop, data = traitDataAll)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.353 -14.483   0.647  14.517  67.647 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   99.367      3.890  25.546   <2e-16 ***
## PopF2          0.986      4.005   0.246    0.806    
## PopMY         -2.884      5.548  -0.520    0.603    
## PopSP          4.475      6.247   0.716    0.474    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.31 on 573 degrees of freedom
##   (115 observations deleted due to missingness)
## Multiple R-squared:  0.002627,   Adjusted R-squared:  -0.002595 
## F-statistic: 0.503 on 3 and 573 DF,  p-value: 0.6804

This first part gives us a distribution of the residuals, showing the extremes and quantiles of the Diam1 variation among samples that is not explained by Population.

Coefficients tells us information about the relationship between population and Diam1. The Estimate for Intercept is the modeled Diam1 value for the first population when arranged alphabetically or numerically; in this case F1. The remaining estimates are the differences in mean D1 from that of the F1s in the remaining populations. In other words, the modeled Diam1 value for the F1s is 99.367 mm, the mean effect of the F2 population is to increase Diam1 by 0.986 mm over that of the F1, etc. Thus, the model estimate for Diam1 in the F2s is 99.367 + 0.986 = 100.353.

The Std. Error is the standard deviation of the sampling variation. The t value is a test statistic similar to standard deviation units of differences between populations. The last column is the P-value, or how probable the observed data would be under the hypothesis that there is no real difference between populations. These P-values by themselves don’t tell us much. The P-value for the intercept will always be extremely small, because it represents the probability of the data if the true value of Diam1 in the F1s is zero. The other P-values represent the probability of the data if the other populations have a true Diam1 difference of zero from the F1s.

The more useful P-value is the one in the last line of output associated with the F-statistic. This is the probability of observing the actual amount of variation if population has no effect overall on Diam1. In this case, the P-value is 0.68, which means that sample sets with no true variation between populations would show at least this much sample variation about 68% of the time. This shouldn’t be surprising since we already noted that there wasn’t much difference in Diam1 between populations compared to the within population variation.

Another useful bit of information just above the F-statistic line is the Multiple R-squared value. This tells us how much of the overall variance in Diam1 values is explained by the effect of population. As we might expect, this is very small, only about one quarter of 1% of the overall Diam1 variation involves differences between populations.

Now let’s examine dDiam21, for which the populations differences appeared to be more substantial:

dDiam21ByPop <- lm(dDiam21 ~ Pop, data = traitDataAll)
summary(dDiam21ByPop)

## 
## Call:
## lm(formula = dDiam21 ~ Pop, data = traitDataAll)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.145 -12.145  -2.145  10.855  83.855 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    9.964      3.968   2.511  0.01233 * 
## PopF2        -12.819      4.086  -3.137  0.00180 **
## PopMY         -4.357      5.612  -0.776  0.43782   
## PopSP        -18.670      6.456  -2.892  0.00398 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21 on 531 degrees of freedom
##   (157 observations deleted due to missingness)
## Multiple R-squared:  0.02794,    Adjusted R-squared:  0.02244 
## F-statistic: 5.087 on 3 and 531 DF,  p-value: 0.001766

Here we see a somewhat different story. Two of the populations have observed differences from the F1s that are pretty improbable if there were no real difference (P < 0.01). The overall P-value for the model (P = 0.001766) gives a pretty clear indication that the data are pretty improbable if real differences in winter diameter growth don’t exist between populations – thus leading to the conclusion that there are real differences between population.

Finally, we analyze the effects of population on reproductive season diameter growth (dDiam32): dDiam21ByPop <- lm(dDiam21 ~ Pop, data = traitDataAll)

dDiam32ByPop <- lm(dDiam32 ~ Pop, data = traitDataAll)
summary(dDiam32ByPop)

## 
## Call:
## lm(formula = dDiam32 ~ Pop, data = traitDataAll)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -125.839  -19.839   -0.839   17.161   96.161 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -20.200      4.641  -4.352 1.57e-05 ***
## PopF2         -4.961      4.788  -1.036    0.301    
## PopMY        -31.014      6.962  -4.455 9.95e-06 ***
## PopSP          2.973      7.471   0.398    0.691    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.46 on 628 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.04251,    Adjusted R-squared:  0.03793 
## F-statistic: 9.294 on 3 and 628 DF,  p-value: 5.091e-06

This time, the data are really unlikely to have occurred by chance, with a P-value of less than 1 in 100,000. The Coefficients table shows us that the main contributor is the MY population having a much greater diameter loss than the other populations. Perhaps surprisingly, neither the F2 nor the SP population differ significantly from the F1s.

The proportion of overall variance explained by population is still pretty small, a little over 4%. However, for a quantitative genetic dataset, this is actually pretty large, as we often find considerable genetic variation within as well as between populations as well as huge amounts of non-genetic sources of variation due to environmental factors and simple developmental “noise.”

Finally, we can use the plot function with lm objects to produce a set of diagnostic plots that can sometimes be useful:

plot(dDiam32ByPop)

Most of these plots aren’t very useful with categorical predictors such as Pop. They become more important when the predictor (independent variable; the X-axis variable) is quantitative, making the relationship a linear regression. We’ll deal with this in the next tutorial!

The second plot, analogous to the qqnorm function we used earlier, examines how well the residuals (the remaining variation not explained by population) fit a normal distribution. If the fit is poor, the statistical assumptions that underlie tests of linear models are violated and the resulting P-values might be suspect. However, in this case the fit looks quite strong.

9. Another assignment:

Analyze linear models for the following traits: Infl, meanBLvs, LatRating0215, meanOrder, and meanFLNode. Examine the summaries and write a brief interpretation of the results.

Analyzing Quantitative Trait Data in R: an example in Arabidopsis lyrata

David Remington

UPDATED June 27, 2023