Genetic differentiation

Describing genetic differentiation and genetic structuring

library(tidyverse)

Investigating population structure with PCA

The first thing we will do is investigate population structure using principal components analysis. Examining population structure can give us a great deal of insight into the history and origin of populations. Model-free methods for examining population structure and ancestry, such as principal components analysis are extremely popular in population genomic research. This is because it is typically simple to apply and relatively easy to interpret.

Essentially, PCA aims to identify the main axes of variation in a dataset with each axis being independent of the next (i.e. there should be no correlation between them). The first component summarizes the major axis variation and the second the next largest and so on, until cumulatively all the available variation is explained. In the context of genetic data, PCA summarizes the major axes of variation in allele frequencies and then produces the coordinates of individuals along these axes.

To perform a PCA on our cichlid data, we will use plink - specifically version 1.9 (although be aware older and newer versions are available). Note that plink was originally written with human data in mind and has also subsequently been extended to include some model species. As a result, we need to provide a bit of extra info to get it to work on our dataset.

Linkage pruning

One of the major assumptions of PCA is that the data we use is independent - i.e. there are no spurious correlations among the measured variables. This is obviously not the case for most genomic data as allele frequencies are correlated due to physical linkage and linkage disequilibrium. So as a first step, we need to prune our dataset of variants that are in linkage.

First things first, we will make a directory called pca

# move to your home directory
cd ~
# make a pca directory
mkdir pca
# move into it
cd pca

Next we need to get our data, which we can do like so:

cp /resources/riverlandsea/exercise_data/pca/Pundamilia_subset.vcf.gz .

We will also simplify our code using some environmental variables. Primarily we set one for our filtered VCF.

VCF=~/pca/Pundamilia_subset.vcf.gz

This will make it very easy for plink to read in our data. Next we run the linkage pruning. Run the command and we will breakdown what all the arguments mean.

# perform linkage pruning - i.e. identify prune sites
plink --vcf $VCF --double-id --allow-extra-chr \
--set-missing-var-ids @:# \
--indep-pairwise 50 10 0.1 --out cichlids

So for our plink command, we did the following:

--vcf - specified the location of our VCF file.
--double-id - told plink to duplicate the id of our samples (this is because plink typically expects a family and individual id - i.e. for pedigree data - this is not necessary for us.
--allow-extra-chr - allow additional chromosomes beyond the human chromosome set. This is necessary as otherwise plink expects chromosomes 1-22 and the human X chromosome.
--set-missing-var-ids - also necessary to set a variant ID for our SNPs. Human and model organisms often have annotated SNP names and so plink will look for these. We do not have them so instead we set ours to default to chromosome:position which can be achieved in plink by setting the option @:# - see here for more info.
--indep-pairwise - finally we are actually on the command that performs our linkage pruning! The first argument, 50 denotes we have set a window of 50 Kb. The second argument, 10 is our window step size - meaning we move 10 bp each time we calculate linkage. Finally, we set an r² threshold - i.e. the threshold of linkage we are willing to tolerate. Here we prune any variables that show an r² of greater than 0.1.
--out Produce the prefix for the output data.

As well as being versatile, plink is very fast. It will quickly produce a linkage analysis for all our data and write plenty of information to the screen. When complete, it will write out two files cichlids.prune.in and cichlids.prune.out. The first of these is a list of sites which fell below our linkage threshold - i.e. those we should retain. The other file is the opposite of this. In the next step, we will produce a PCA from these linkage-pruned sites.

Perform a PCA

Next we rerun plink with a few additional arguments to get it to conduct a PCA. We will run the command and then break it down as it is running.

# prune and create pca
plink --vcf $VCF --double-id --allow-extra-chr --set-missing-var-ids @:# \
--extract cichlids.prune.in \
--make-bed --pca --out cichlids

This is very similar to our previous command. What did we do here?

--extract - this just lets plink know we want to extract only these positions from our VCF - in other words, the analysis will only be conducted on these.
--make-bed - this is necessary to write out some additional files for another type of population structure analysis - a model based approach with admixture.
--pca - fairly self explanatory, this tells plink to calculate a principal components analysis.

Once the command is run, we will see a series of new files. We will break these down too:

PCA output:

cichlids.eigenval - the eigenvalues from our analysis
cichlids.eigenvec- the eigenvectors from our analysis

plink binary output

cichlids.bed - the cichlids bed file - this is a binary file necessary for admixture analysis. It is essentially the genotypes of the pruned dataset recoded as 1s and 0s.
cichlids.bim - a map file (i.e. information file) of the variants contained in the bed file.
cichlids.fam - a map file for the individuals contained in the bed file.

Plotting the PCA output

Next we turn to R to plot the analysis we have produced!

Setting up the R environment

First load the tidyverse package and ensure you have moved the plink output into the working directory you are operating in. You may want to set up an RStudio Project to manage this analysis. See here for a guide on how to do this.

# load tidyverse package
library(tidyverse)

Then we will use a combination of readr and the standard scan function to read in the data. NB - you will need to edit the path to the files if you are writing your own R scripts.

pca <- read_table("./data/cichlids.eigenvec", col_names = FALSE)
eigenval <- scan("./data/cichlids.eigenval")

Cleaning up the data

Unfortunately, we need to do a bit of legwork to get our data into reasonable shape. First we will remove a nuisance column (plink outputs the individual ID twice). We will also give our pca data.frame proper column names.

# sort out the pca data
# remove nuisance column
pca <- pca[,-1]
# set names
names(pca)[1] <- "ind"
names(pca)[2:ncol(pca)] <- paste0("PC", 1:(ncol(pca)-1))

Next we can add a species, location and if required, a species x location vector. We will do this using the R version of grep. We then use paste0 to combine the columns.

# sort out the individual species and pops
# spp
spp <- rep(NA, length(pca$ind))
spp[grep("PunPund", pca$ind)] <- "pundamilia"
spp[grep("PunNyer", pca$ind)] <- "nyererei"
# location
loc <- rep(NA, length(pca$ind))
loc[grep("Mak", pca$ind)] <- "makobe"
loc[grep("Pyt", pca$ind)] <- "python"
# combine - if you want to plot each in different colours
spp_loc <- paste0(spp, "_", loc)

With these variables created, we can remake our data.frame like so. Note the use of as.tibble to ensure that we make a tibble for easy summaries etc.

# remake data.frame
pca <- as_tibble(data.frame(pca, spp, loc, spp_loc))

Plotting the data

Now that we have done our housekeeping, we have everything in place to actually visualise the data properly. First we will plot the eigenvalues. It is quite straightforward to translate these into percentage variance explained (although note, you could just plot these raw if you wished).

# first convert to percentage variance explained
pve <- data.frame(PC = 1:20, pve = eigenval/sum(eigenval)*100)

With that done, it is very simple to create a bar plot showing the percentage of variance each principal component explains.

# make plot
a <- ggplot(pve, aes(PC, pve)) + geom_bar(stat = "identity")
a + ylab("Percentage variance explained") + theme_light()

Cumulatively, they explain 100% of the variance but PC1, PC2 and possible PC3 together explain about 30% of the variance. We could calculate this with the cumsum function, like so:

# calculate the cumulative sum of the percentage variance explained
cumsum(pve$pve)

Next we move on to actually plotting our PCA. Given the work we did earlier to get our data into shape, this doesn’t take much effort at all.

# plot pca
b <- ggplot(pca, aes(PC1, PC2, col = spp, shape = loc)) + geom_point(size = 3)
b <- b + scale_colour_manual(values = c("red", "blue"))
b <- b + coord_equal() + theme_light()
b + xlab(paste0("PC1 (", signif(pve$pve[1], 3), "%)")) + ylab(paste0("PC2 (", signif(pve$pve[2], 3), "%)"))

Note that this R code block also includes arguments to display the percentage of variance explained on each axis. Here we only plot PC1 and PC2. From this figure, we can see PC1 separates out the geographical locations and PC2 separates out the species.

Admixture

ADMIXTURE is a clustering software similar to STRUCTURE with the aim of inferring populations and individual ancestries. It was developed by David Alexander, John Novembre and Kenneth Lange. You can find the manual here.

Generating the input file

ADMIXTURE requires unlinked (i.e. LD-pruned) SNPs in the format that plink writes out. As we saw in the last practical, this is very easy to produce from a VCF. However when we were producing a PCA, we used a vcf with only whole-genome resequenced individuals (i.e. a much smaller sample size). So for this practical we will use a RAD dataset from the same Pundamilia species which includes more than 4 individuals per population and some putative hybrid individuals.

Linked sites, monomorphic or multiallelic sites, or sites with more than 25% missing data have already been filtered out. Also sites with MAF smaller than 0.05 or Phred quality lower than 30 were removed. So there is no need to redo the filtering or linkage-pruning we learned about earlier. We proceed to uisng plink to generate the .bed file which can be read by ADMIXTURE (nb it will also produce other files we do not need).

First we will make a directory specific for this analysis in our home directory and move into it:

# move to your home directory
cd ~
# make a pca directory
mkdir admixture
# move into it
cd admixture

Next we need to get our data, which we can do like so:

cp /resources/riverlandsea/exercise_data/admixture/Pundamilia.RAD.vcf.gz .

As before, we will simplify our code using some environmental variables. Primarily we set one for our filtered VCF.

VCF=~/admixture/Pundamilia.RAD.vcf.gz

We can also create one to make it easier to create an output file from plink and to perform our downstream tweaks to the file to get it to work with ADMIXTURE

FILE=Pundamilia.RAD

Now we are ready to use plink to generate the input file that we will use in our analysis.

# Generate the input file in plink format
plink --vcf $VCF --make-bed --out $FILE --allow-extra-chr

Once we’ve run this, use ls to have a look at the files written out. An important note. The PLINK bed file is a binary biallelic genotype table (not to be confused with UCSC bed files).

Like with many tools, we have to do a bit of coding in order to get a file in a suitable shape for a downstream analysis. ADMIXTURE does not accept chromosome names that are not human chromosomes. We therefore need to change the first column with 0. We can do this relatively easily with awk. We make a temporary file and when we’re sure it is how we want it to look, we can move it to become our new input file.

awk '{$1="0";print $0}' $FILE.bim > $FILE.bim.tmp
mv $FILE.bim.tmp $FILE.bim

Running `ADMIXTURE`

Now, we are ready to run ADMIXTURE. An important step is to run the program with cross-validation. This allows us to determine the error in our estimates of K, providing insight into what the ‘true’ value of K might be. We will first run the program to test a K of 2.

admixture --cv $FILE.bed 2 > log2.out

Note that here we just the default CV option, which is for 5 times cross validation. IF you want to increase this, you can set an option like so cv=10.

Looking at our directory, we can see that ADMIXTURE produced 2 files: .Q which contains cluster assignments for each individual and .P which contains for each SNP the population allele frequencies.

Let’s now run it in a for loop with K=3 to K=5 and direct the output into log files.

for i in {3..5}
do
 admixture --cv $FILE.bed $i > log${i}.out
done

To identify the best value of K clusters which is the value with lowest cross-validation error, we need to collect the cv error output into a single files. Combining awk and cut, we can easily extract them like so:


grep "CV" *out | awk '{print $3,$4}' | cut -c 4,7-20 > $FILE.cv.error

To make plotting easier, we can also make a file with the individual names in one column and the species names in the second column. As the species name is in the individual name, it is very straightforward to extract the species name from the individual name using awk:

awk '{split($1,name,"."); print $1,name[2]}' $FILE.nosex > $FILE.list

Now we’ve performed our analyses, we are ready to visualise the results.

Visualising results using R

There are many different ways to plot ADMIXTURE output and other tutorials go into this in more detail (e.g., this is nice and clear). There are also specific programs i.e.pong. However, to make things easy for today’s practical, we will use an R written by Joana Meier.

This script can be used via the command line and takes four arguments: 1. the prefix for the ADMIXTURE output files (-p ) 2. the file with the species information (-i <file.list>) 3. the maximum number of K to be plotted (-k 5) 4. a list with the populations or species separated by commas (-l <pop1,pop2…>)

This last option, i.e. the list of populations provided with -l gives the order in which the populations or species will be plotted. The script is available here but for today, we can also get it from the data directory:

cp /resources/riverlandsea/exercise_data/admixture/plotADMIXTURE.r .

Now, we can run it like below. Remember our $FILE variable makes it much easier to run without inputting the full filenames.

Rscript plotADMIXTURE.r -p $FILE -i $FILE.list -k 5 -l PunNyerMak,PunPundMak,PunNyerPyt,PunHybrPyt,PunPundPyt

By default, the script generates a tiff file that uses the same prefix as the one provided with -p. In our case $FILE.tiff. This can be changed with -o

if necessary, but for now we will keep it as it is.

To examine the file, you need to download it to your local machine using either scp (Mac OS, Linux) or an scp client such as filezilla.

Some points to keep in mind

Remember that ADMIXTURE/STRUCTURE plots can be quite misleading as different demographic histories can lead to the same results (see. e.g. Lawson et al, 2018) These plots are great to detect recent hybrids but they are not ideal to infer complex demographic histories with hybrid origins of entire populations

To evaluate if the ADMIXTURE plot is a good fit, you can use evaladmix and we highly recommend to also use other methods to help infer the demographic history and evidence of hybridisation such as independent tests of gene flow or demographic analyses.