Intro to RDA as a flexible tool

Instructor: Riginos

RDA as a flexible tool

GEAs: Genotype-by-environment associations

( = environmental associations)

Genome-wide: related to demographic history, ecological speciation/diversification, isolation by environment

Processes leading to IBE - Wang & Bradburd 2014

Tutorial on RDAs, GDM, and GF mostly sit in this category.

Finding loci contribute to heritable genetic variation of selected traits

Rellstab et al 2015 (Image modified from Sork et al 2013)

There is a huge literature on the topic of finding candidate loci for environmental selection that we cannot cover in one week. Population genetic tests of selection such as outlier tests or genomic scans are frequently used. There are also tests of selection that specifically look for associations of individual loci to environmental attributes - often called GEA, genotype-environment association or EAA, environment association analysis - this will be more of our focus given that we are studying landscape genomics. We will come back to this topic on Friday. (Rellstab et al. 2015 and Storfer et al 2018 have good reviews of the various methods if you are looking for further reading on this topic.)

Why RDA (and other multivariate methods)?

RDA allows you to have multiple response variables (loci) and multiple predictors (spatial or environmental attributes)

Predictive variables	Response variables	Appropriate family of methods *
One	One	Univariate/Simple regression
Many	One	Multiple regression
Many	Many	Multivariate regression

* Note that your genetic data not conform to expectations of parametric methods, including independence of data points

RDA was developed for community ecology data and has been adopted for genetic analyses. In an ecological analysis, there will be various spatial and environmental predictive variables (n sites by m predictive variables) and biological response variables (typically, species abundance organised in n sites by p species). Much of the literature about using ordination and multivariate analyses for environmental analyses will be described as ecological communities).

RDA is an example of methods that focus on nodes, following the schematic below.

Statistical models - Wagner & Fortin 2016

A key element of multivariate methods (including the RDA family) is ordination (which falls within the broader descriptor of eigenanalysis).

Ordination

Ordination allows

Compression and visualization of complex data
Flexibility to use distances to describe spatial and biological distances (between populations or individuals)
Flexibility to use allele frequencies (populations) or genotypes (individuals)

Pitfalls on using ordination with genetic data

Missing observations need to be excluded or imputed
Uneven sampling can skew results
Ordination is pattern description: many processes can yield the same patterns

We focus on RDA as a multivariate method because

it is flexible structure is very useful for landscape genomic data
it can describe genome-wide patterns
RDA performs well in genotype-environment-association analyses with low false positives and high true positives
The regression on one matrix on another.
Uses a combination of linear regression and PCA.

RDA can address the following questions (citing Capblancq et al. 2021):

1. What environmental/spatial processes drive patterns of genetic variation?

2. What is the genetic basis of local adaptation to the environment?

3. How is adaptive genetic variation distributed across landscapes?

4. What are the impacts of climate and/or landscape change on the distribution of adaptive genetic variation?

RDA takes linear combinations of the explanatory variables (X) and uses them to maximise the variance explained in linear combinations of Y (where X and Y are matrices)

Keep this linearity in mind: it might not be appropriate for your data

A schematic for how RDA works:

Conceptual overview of RDA - From Capblancq et al. 2021

Review of unconstrained ordination (PCA & PCoA)

PCA - principal components analysis

No predictive variables: unconstrained.
Fits a line through the data along an axis that maximizes the variance described by the first PC axis (PC1).
Fits the next line to maximize the variance described for the second PC axis (PC2) where PC1 and PC2 are orthogonal (their variances are not correlated).
And so on.
Note that this is a linear procedure: lines are being fitted.
Eigenvalues ($\lambda$) describe the amount of variance explained by each eigenvector (= line = PC axis)

PCoA or PCO: Principal coordinates analysis

same as metric multidimensional scaling
takes pairwise distances ($d_{ij}$, sometimes called dissimilarities) between points and uses ordination to find axes that maximally explain dissimilarities
PCoA on Euclidean distances is the same as PCA
Sometimes you get negative eigenvalues and have to use a correction
(non metric multidimensional scaling is similar but does not require linear data)

Simple redundancy analysis (constrained ordination)

Example from Capblancq & Forester 2021 - lodgepole pine

use ~3000 “intergenic” SNPs as neutral loci (genotyped from 50K SNP chip)
geography, climate and neutral structure are all confounded

Goals

to undertake a GEA and ignore effect of geography+structure -> possible false positives
to undertake a GEA and remove geography+structure -> possible false negatives
There is no clear solution to the problems of false positives and false negatives

Data preparation for RDA

get environmental variables and scale them to a mean of 0, SD = 1
categorical predictors can be used as dummy variables (0/1)
look for colinearity among variables - might remove variables with variance inflation factor greater than ?

Variable selection with forward model building

Build up a predictive model and assists with variable reduction

Test significance of global model (= all variables)
Start with “empty” model (= intercept only) and sequentially add variables
Two stopping criteria to avoid overfitting - end with either criterion
- permutation based significance test
- adjusted $R^2$ from global model

Simple RDA output - Capblancq & Forester 2021

partial RDA

Allows independent estimation of sets of variables together with confounded effects caused by collinearity.

Complex models and variance partitioning

Models upon models - Zbinden et al. 2023

Variance partitioning - averages across species - Zbinden et al 2023

Variance partitioning - species details - Zbinden et al 2023

Extra topic - dealing with spatial correlation structures

Variety of approaches (to be explored in the computer activities that follow)

Include spatial position as a covariate
Use genetic distances and linearize with PCoA
Use Moran Eigenvector Mapping

Further resources

The absolutely go-to package in R for all ecological eigenanalysis is vegan. It is very well documented and has excellent vignettes that are well-worth working through.
Really nice slide deck that demonstrates some vegan functions with excellent illustrations Intro to vegan
Another great slide deck that runs through RDA and associated vegan functions Redundancy Analysis

Notes developed from:

Capblancq, T., & Forester, B. R. (2021). Redundancy analysis: A Swiss Army Knife for landscape genomics. Methods in Ecology and Evolution. doi:10.1111/2041-210x.13722

Rellstab, C., Gugerli, F., Eckert, A. J., Hancock, A. M., & Holderegger, R. (2015). A practical guide to environmental association analysis in landscape genomics. Molecular Ecology, 24 (17), 4348-4370.

Wang, I. J., & Bradburd, G. S. (2014). Isolation by environment. Molecular Ecology, 23(23), 5649-5662. doi:papers3://publication/doi/10.1111/mec.12938

Zbinden, Z. D., Douglas, M. R., Chafin, T. K., & Douglas, M. E. (2022). Riverscape community genomics: A comparative analytical approach to identify common drivers of spatial structure. Molecular Ecology, 32, XXX-XXX. doi:10.1111/mec.16806

Computer tutorial - undertaking RDA

Tutorial based on “RDA applications in landscape genomics, by Thibaut Capblancq & Brenna Forester, 2021

The original tutorial can be found on GitHub and shows how to locate and prepare original data files.

This 2023 update uses files that have been compiled and tidied. Much of the code is also updated to make use of sf and terra but some of the code is from the original tutorial. The original code is likely to break soon as support for older spatial packages is removed - you will likely see some warnings.

Original comments by Capblancq & Forester will be indented

like this!

Note that after publication, the authors discovered some minor errors in their code - the updated results (that should match our analyses) can be found here.

Introduction

This tutorial provides code and explanation associated with a review of the different applications of RDA in the field of landscape genomics written by Thibaut Capblancq & Brenna Forester (2021) Redundancy Analysis (RDA): a Swiss-army knife for landscape genomics.

We highly recommend the following book for those interested in RDA: Borcard D, Gillet F, Legendre P (2018) Numerical Ecology with R, 2nd edition.

(Cynthia also endorses the same book)

Datafiles

Climatic variables

The values of 27 bioclimate variables were extracted for all 281 populations from the ClimateNA database for the period 1961-1990, and projections for 2050 and 2080. The projections are based on an ensemble of 15 AOGCMs using CMIP5.

These data have been compiled and projected into the WGS84 reference coordinate system following instructions from Capblancq & Forester and saved as a spatial raster.

ras_6190.tif = present day

Population allele frequencies

Original genetic data and metadata from Mahony et al. 2019, available here. SNP genotypes were derived for individual seedlings from 281 populations of lodgepole pine (Pinus contorta).

Pre-processing:

Called genotypes were recoded as numeric in order to run RDA using the vegan library. Although some other packages will conduct PCAs and RDAs, vegan is recommended. In the numeric formatting, 0 represents an individual homozygous for the major allele, 1 represents a heterozygote, and 2 is homozygous for the alternative allele.
Dataframes were subset to individuals with spatially relevant metadata.
The mean allele frequency per population (281 populations) was computed
Loci with missing data from 12 or more populations were excluded
For population-by-locus combinations with NA values, the values were imputed as the median across all populations (cannot have missing data for PCA and RDA)
MAFs < 0.05 or > 0.95 were excluded from full SNP data set (but not “neutral” SNPs)

Datafiles

AllFreq: 281 populations, 28658 loci
AllFreq_neutral: 281 populations, 3936 loci
PlSeedlots.csv: the geographic coordinates of each source population.

Look at all these files using tools and techniques that you have learned. What is in the ras_6190.tif object? (try plotting it)

Some comments from C & F:

An RDA model can either be conducted with individual-based genotypes (0, 1, 2 format) or population-based allele frequencies (ranging from 0 to 1). We decided here to work with allele frequencies for the main reason that several individuals were genotyped at each sampling site (i.e., source population) and experienced the exact same climatic conditions. Plus, the sample sizes varied across populations.

(For full data set) SNPs with a minor allele frequency inferior to 5% were filtered out to avoid giving too much importance to rare alleles when looking for loci associated with environmental variation. Doing so means assuming that local adaptation is driven by consequent changes in adaptive allele frequency along environmental gradients.

(For neutral data set) No filtering on MAF was applied here because small genetic variations are expected to be involved in differentiating neutral genetic groups.

Load libraries and data files

library(terra)
library(vegan)
library(corrplot)

path<-"/home/data/rda/" #if using Amazon server
#path<-"/resources/riverlandsea/exercise_data/rda" #if using Lund server


#Climate data
ras_6190<-terra::rast(paste0(path, "ras_6190.tif"))
names(ras_6190) <- c("AHM","bFFP","CMD","DD_0","DD_18","DD18","DD5","eFFP","EMT", "Eref", "EXT", "FFP", "MAP", "MAR", "MAT", "MCMT", "MSP", "MWMT", "NFFD", "PAS", "PPT_sm", "PPT_wt", "RH", "SHM", "Tave_sm", "Tave_wt", "TD")

# Genetic data
load(paste0(path, "AllFreq.RData"))
load(paste0(path, "Neutral.RData"))

#Sampling sites
Coordinates <- read.table(paste0(path, "PlSeedlots.csv"), sep = ",", header = T, row.names = 1)

# Tidy up the coordinates data and check them
Coordinates <- Coordinates[match(row.names(AllFreq), Coordinates$id2, nomatch = 0),]
colnames(Coordinates) <- c("Population", "Latitude", "Longitude", "Elevation")
Coordinates[1:5,]

Preparing data for RDA

A series of RDAs will be the heart of the analyses. Before you can do an RDA, however, you need to have all your data prepared and organised. The genetic data are already largely prepared, but some extra work is needed for the environmental data and for estimating neutral population structure.

Extract environmental data from rasters and scale them

#Extract environmental data from the sampling site locations
Env<-extract(ras_6190, Coordinates[,3:2])

# look at your extracted data
Env[1:5,1:8]

Env <- scale(Env, center=TRUE, scale=TRUE) # center=TRUE, scale=TRUE are the defaults for scale()

row.names(Env) <- c(Coordinates$Population)

Use “neutral” SNPs to estimate population structure

To account for population structure in some of the following RDA-based procedures we conducted a principal component analysis (PCA) on the set of 3,934 intergenic SNPs and retained the first three PCs as proxy of population evolutionary history.

## Running a PCA on neutral genetic markers
pca <- rda(Neutral[,-1], scale=T) # PCA in vegan uses the rda() call without any predictors
# scale = T will take care of scaling for you

Examine the pca object (call pca and summary(pca)) - think about how many PC’s are informative. When you call pca you will see the formula, the “Inertia” and the Eigenvalues for the PC axes. Notice that all the Inertia is unconstrained. Later you will compare this to the full RDA output. Divide the Eigenvalue of PC1 but the total Inertia: this will be percent of variance explained by PC1. The “Species scores” are your column variables, which are loci in this case. The “Site scores” are row variables or populations. The values are loadings, which enable you to plot loci or populations in PC space.

If this were my data and analysis, I would plot the PCA at this point and make sure it all made sense.

biplot(pca, choices=c(1,2), scaling = "symmetric", type= c("points", "text"))

What I like to see in a pca is balanced clouds of points. If there is very little population structure, you might see one big cloud of points. If you have substantial structuring, there could be multiple clouds of points. Smears along a single axis (as we see here) are a bit worrisome especially if they do not make geographical sense. Another attribute to be aware of is that imputing allele frequencies can pull values to the middle. For the purposes of this tutorial, we will just continue, but in a real analysis I would try to understand my data at this point and make sure that there are no odd dynamics that might influence later analyses.

People often use screeplots to look at variance and decide on the number of PCs to keep.

# look at screeplot to decide importance
screeplot(pca, type = "barplot", npcs=10, main="PCA Eigenvalues")

Based on the screeplot, two or three PCs would be a reasonable set to retain as a proxy for neutral population structure in downstream analyses. In this case, we decided to keep the first three PCs.

Values from these first three PCs can now be extracted. It can be very useful to construct a dataframe containing all the spatial predictors. That’s the strategy that C & F follow:

PCs <- scores(pca, choices=c(1:3), display="sites", scaling=0)
PopStruct <- data.frame(Population = Neutral[,1], PCs)
colnames(PopStruct) <- c("Population", "PC1", "PC2", "PC3")

#check the object 
PopStruct[1:5,]

## Table gathering all variables inlcuding environment
Variables <- data.frame(Coordinates, PopStruct[,-1], Env) #original scripts include traits - we are ignoring those for now

Variables[1:5,]

Note that PC eigenvectors will already be scaled. You can test this using sum(PopStruct$PC1). There might be a small value due to rounding error but it will be very small.

Building a full RDA model for environmental variables with forward selection

Forward selection starts from a “null” model where the response is explained only by an intercept. Variables are then added to the model one by one to try to reach the amount of variance explained by a “full” model (i.e., model including all the explanatory variables), while limiting the amount of redundancy among included variables.

Whether forward selection is the best approach or not is debatable, but it is the most common procedure for dealing with multiple variables. To do this, you first build the intercept only model and then define a full model.

## Null model
RDA0 <- rda(AllFreq ~ 1,  Variables) 

## Full model - this will take a bit of time to run
RDAfull <- rda(AllFreq ~ AHM + bFFP + CMD + DD_0 + DD_18 + DD18 + DD5 + eFFP + EMT + Eref + EXT + FFP + MAP + MAR + MAT + MCMT + MSP + MWMT + NFFD + PAS + PPT_sm + PPT_wt + RH + SHM + Tave_sm + Tave_wt + TD, Variables)

Examine the structure of RDA0 by just typing RDA0 and also summary(RDA0). There is a lot of information but the structure is essentially a PCA, since we did not define any explanatory variables. As before, the “Species scores” are your column variables, which are loci in this case. The “Site scores” are row variables or populations.

Once you have looked at RDA0, take a look at RDAfull. Now you will see constrained inertia that represents the variance explained by your predictor variables. (Notice that most variance remains unconstrained, this is quite common). Remember that the predictor variables have now been ordinated as RDA eigenvectors where each RDA axis is orthogonal to the others. The eigenvalues represent how much variance in response variables (allele frequencies) is predicted by that eigenvector, e.g., RDA1 predicts 29.4/578 percent of the variance in allele frequencies.

The last table in summary(RDAfull) shows how the different environmental variables load onto each RDA axis.

To make a quick plot of your RDA, now use ordiplot. This will make a biplot that combines your populations (points) and predictor variables (arrows).

ordiplot(RDAfull)

The vegan package is very well documented and it is worth spending time learning some of the options if you are going to be using ordination in your toolbox of analyses.

To conduct the selection procedure we used the ordiR2step function of the package vegan and the following stopping criteria: variable significance of p < 0.01 using 1000 permutations, and the adjusted R2 of the global model.

Because this command will take a long time to run, I suggest that you start it and watch what it is doing, but I have saved the output mod for you to import, so you can stop the process to speed things up. In

Interrupt the processing for the sake of time and load the output of the full ordiR2step output after you see the initial round of results:

## Stepwise procedure with ordiR2step function
mod <- ordiR2step(RDA0, RDAfull, Pin = 0.01, R2permutations = 1000, R2scope = T) #this will take time to run - about 30 min on my laptop

# load the output
load(paste0(path, "mod.Rdata"))
# if you can't load this file, don't worry about it. Read below and move on.

Here is what my screen output looks like when running ordiR2step. It starts by evaluating which of the predictors you supplied explains the greatest amount of variance (MAR) in this case. Once it adds one predictor, it will assess which of the remaining variables should be added (EMT, in this case). It evaluates whether the model fit improves substantially with each variable using the adjusted R2. See ?ordiR2step for more information.

Step: R2.adj= 0 
Call: AllFreq ~ 1 
 
                R2.adjusted
<All variables> 0.081012244
+ MAR           0.022167460
+ EMT           0.019112369
+ PPT_wt        0.016854043
+ Tave_wt       0.016052084
+ RH            0.015642759
+ MCMT          0.015420933
+ eFFP          0.015105975
+ DD_0          0.014822828
+ NFFD          0.014717984
+ PPT_sm        0.014492755
+ TD            0.014274591
+ MAP           0.013217792
+ DD_18         0.012420377
+ MAT           0.012363359
+ MSP           0.010931973
+ FFP           0.010056901
+ CMD           0.009675979
+ SHM           0.008713027
+ Eref          0.007020628
+ bFFP          0.006082542
+ AHM           0.006006697
+ EXT           0.005173569
+ DD5           0.004382995
+ DD18          0.004240597
+ MWMT          0.003519758
+ Tave_sm       0.003343331
+ PAS           0.002501768
<none>          0.000000000

      Df    AIC      F Pr(>F)   
+ MAR  1 1782.8 7.3476  0.002 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Step: R2.adj= 0.02216746 
Call: AllFreq ~ MAR 
 
                R2.adjusted
<All variables>  0.08101224
+ EMT            0.04451733
+ TD             0.04380193
(and so on)

As before, examine the mod object in detail. How much variance does RDA1 explain? Is RDA1 dominated by a single environmental variable or spread among variables?

mod$anova will give you most of the output shown in Table 1 from C & F. You would just need to do some subtraction to get the R2 for each term - for example, EMT = 0.044517-0.022167 = 0.02235.

 > mod$anova
 
                 R2.adj Df    AIC      F Pr(>F)   
+ MAR           0.022167  1 1782.8 7.3476  0.002 **
+ EMT           0.044517  1 1777.2 7.5261  0.002 **
+ MWMT          0.054435  1 1775.3 3.9158  0.002 **
+ CMD           0.063109  1 1773.7 3.5645  0.002 **
+ Tave_wt       0.067203  1 1773.4 2.2115  0.002 **
+ DD_18         0.070171  1 1773.5 1.8775  0.002 **
+ MAP           0.071918  1 1774.0 1.5159  0.002 **
+ Eref          0.073516  1 1774.5 1.4708  0.002 **
+ PAS           0.075318  1 1774.9 1.5302  0.002 **
<All variables> 0.081012                           
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(And you don’t need to report all those decimal places, IMO)

From C & F:

In total, nine of the 27 bioclimate variables were selected: MAR, EMT, MWMT, CMD, Tave_wt, DD_18, MAP, Eref and PAS.

Notes on interpretation and best practices: We remind users that this predictive approach to variable selection optimizes the variance explained, but does not necessarily identify the ecological or mechanistic drivers of genetic variation. Additionally, pairwise predictor correlations can be very high, e.g., among seasonal calculations of temperature or precipitation. While one variable may maximize variance explained, it may be another, correlated variable, potentially even unmeasured, that is the mechanistic driver of variation. The ubiquitous nature of environmental correlation means that it is critical to carefully investigate selected variables but also avoid overinterpretation of variable importance in downstream analyses unless mechanistic data support observed relationships.

Variance partitioning of the RDA

Variance partitioning with partial RDA (pRDA) can identify the contribution of different factors to reducing gene flow and triggering genetic divergence among populations. We apply pRDA-based variance partitioning to the lodgepole pine data to decompose the contribution of climate, neutral population structure, and geography in explaining genetic variation. We used three sets of variables: 1) the nine selected bioclimate variables (‘clim’); 2) three proxies of neutral genetic structure (population scores along the first three axes of a genetic PCA conducted on the 3,934 neutral loci; ‘struct’); and 3) population coordinates (longitude and latitude) to characterize geographic variation (‘geog’).

Full model

Build the full model with population structure, geography, and environmental variables

## Full model
pRDAfull <- rda(AllFreq ~ PC1 + PC2 + PC3 + Longitude + Latitude + MAR + EMT + MWMT + CMD + Tave_wt + DD_18 + MAP + Eref + PAS,  Variables)

RsquareAdj(pRDAfull)

#anova(pRDAfull) this can take a long time to run - I have saved an output below... it is giving you the significance of the whole model using permutation

> anova(pRDAfull)

Permutation test for rda under reduced model
Permutation: free
Number of permutations: 999

Model: rda(formula = AllFreq ~ PC1 + PC2 + PC3 + Longitude + Latitude + MAR + EMT + MWMT + CMD + Tave_wt + DD_18 + MAP + Eref + PAS, data = Variables)
          Df Variance      F Pr(>F)    
Model     14    84.62 3.2586  0.001 ***
Residual 266   493.41                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

These values and the subsequent models correspond to Table 2. (There might be some minor discrepancies due to data preparation and permutation based tests.) For some reason, in Table 2, R2 rather than adjusted R2 is reported. You should use adjusted R2: it is penalized by the number of predictor variables (somewhat analogous to AIC).

Climate model

The climate model is the first of a series of conditioned models. This structure estimates the variance of constrained variables, given conditioning on series of other variables. In this instance, the model is exploring the effect of environmental predictors conditioned upon geography and neutral population structure.

Conditioned models are also called partial RDAs. The conditioned variables show up in the regression model following the “|”: Climate model is $F \sim clim | geog + struct)$

## Pure climate model
pRDAclim <- rda(AllFreq ~ MAR + EMT + MWMT + CMD + Tave_wt + DD_18 + MAP + Eref + PAS + Condition(Longitude + Latitude + PC1 + PC2 + PC3),  Variables)

RsquareAdj(pRDAclim)

# can skip to save time 
#anova(pRDAclim)

Take a look at pRDAclim and you will see how Inertia is partitioned among various categories. Information from this model is feeding into the second row of Table 2.

Conditional = the amount of variance explained by conditional variables and removed
Constrained = amount of variance uniquely explained by explanatory variables
Unconstrained = rest of the variance that is unexplained

Population structure model

Similarly a partial RDA can be undertaken for population structure…

## Pure neutral population structure model  
pRDAstruct <- rda(AllFreq ~ PC1 + PC2 + PC3 + Condition(Longitude + Latitude + MAR + EMT + MWMT + CMD + Tave_wt + DD_18 + MAP + Eref + PAS),  Variables)

RsquareAdj(pRDAstruct)

#anova(pRDAstruct)

Geography

And another pRDA for geography

##Pure geography model 
pRDAgeog <- rda(AllFreq ~ Longitude + Latitude + Condition(MAR + EMT + MWMT + CMD + Tave_wt + DD_18 + MAP + Eref + PAS + PC1 + PC2 + PC3),  Variables)

RsquareAdj(pRDAgeog)

#anova(pRDAgeog)

To replicate Table 2 in the manuscript, we extract the following from the above results:

Total inertia (aka variance)

Constrained inertia

Proportion of variance explained by constraints

Model R^2^

Model p-value

For the “confounded” row, this should be the full model minus the sum of the partial models, for example, the proportion of explainable variance is 1 - (0.27+0.22+0.05). You can dig into this further with the function varpart().

Note: It is interesting to look at the degree of correlation among variables using a correlogram:

#explore other options
corrplot(cor(Variables[, c("PC1","PC2","PC3","Longitude","Latitude","MAR","EMT","MWMT","CMD","Tave_wt","DD_18","MAP","Eref","PAS")]), type="upper")

Notes on interpretation and best practices: In this case, the largest proportion of genetic variance could not be uniquely attributed to any of the three sets of predictors, a common occurrence given the ubiquitous nature of spatial autocorrelation in environmental and genetic data sets. This confounded effect reflects a high degree of collinearity among explanatory variables. This is critical information given that most landscape genomic studies look for correlation between climatic and genetic variation (i.e., GEA) and either assume no collinearity or, on the contrary, totally remove this commonly explained variation. In the first case, GEA detections could potentially be subject to high false positive rates, while in the latter case detections might show high false negative rates. Selecting an appropriate approach to account for demographic history and geographic distance is of major importance when searching for selection in the genome. Variance partitioning can be a useful step to explore the (statistical) association among available descriptors, to better understand the covariation of environmental and genetic gradients, and to determine how much overall genetic variation is shaped by environmental, geographic, and demographic factors before conducting further landscape genomics study.

Another way to look for correlations is to use variance inflation factors, where the square root of the value > 2 indicates multicollinearity.

sqrt(vif.cca(pRDAfull))

(Not really a good sign!)

Points for class discussion

What are your thoughts on the filtering and preprocessing of genetic data? Could there be unintentional biases?
RDAs can be undertaken on either population allele frequencies or individual genotypes. When would it be strategic to pick one over the other? Are there any special considerations with either data type?
Notice that this tutorial does not look for correlations among environmental variables. What arguments can you make for or against reducing your environmental variables using approaches like variance inflation factors? Would you look at correlations across your whole landscape or across your study sites only?
What do you think about exploring environmental variables first and then building a model with population structure, geopositioning, and environmental variables?

RDA as a flexible tool

GEAs: Genotype-by-environment associations

Why RDA (and other multivariate methods)?

Ordination

We focus on RDA as a multivariate method because

RDA can address the following questions (citing Capblancq et al. 2021):

A schematic for how RDA works:

Review of unconstrained ordination (PCA & PCoA)

PCA - principal components analysis

PCoA or PCO: Principal coordinates analysis

Simple redundancy analysis (constrained ordination)

Example from Capblancq & Forester 2021 - lodgepole pine

Goals

Data preparation for RDA

Variable selection with forward model building

partial RDA

Complex models and variance partitioning

Extra topic - dealing with spatial correlation structures

Further resources

Notes developed from:

Computer tutorial - undertaking RDA

Tutorial based on “RDA applications in landscape genomics, by Thibaut Capblancq & Brenna Forester, 2021

Introduction

Datafiles

Climatic variables

Population allele frequencies

Load libraries and data files

Preparing data for RDA

Extract environmental data from rasters and scale them

Use “neutral” SNPs to estimate population structure

Building a full RDA model for environmental variables with forward selection

Variance partitioning of the RDA

Full model

Climate model

Population structure model

Geography

Other important topics (not from C&F tutorial)

The amazing flexibility of RDA

Populations or individuals

Using link based predictors

Moran Eigenvector Maps

Points for class discussion