Self training

PCA using 1000 Genome Project Dataset

In this self-learning module, we would like you to put your hands on the 1000 Genome Project data and apply the skills you have learned to this mini-project.

Aim

Aim:

Download 1000 Genome VCF files.
Perform PCA using 1000 Genome samples.
Plot the PCs of these individuals.
Interpret the results.

Here is a brief overview of this mini project.

The ultimate goal of this assignment is simple, which is to help you get familiar with the skills and the most commonly used datasets in complex trait genomics.

Tip

Please pay attention to the details of each step. Understanding why and how we do certain steps is much more important than running the sample code itself.

1. Download the publicly available 1000 Genome VCF

Download the files we need from 1000 Genomes Project FTP site:

Autosome VCF files
Ancestry information file
Reference genome sequence
Strict mask

Tip

Autosome VCF: chr1 - chr22 files: in this directory.
Ancestry information file: download here
Reference genome sequence: in this directory.
Strict mask: download here

Note

If it takes too long or if you are using your local laptop, you can just download the files for chr1.

Sample shell script for downloading the files

#!/bin/bash
for chr in $(seq 1 22)  #Note: If it takes too long, you can download just chr1.
do
wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi
done

wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai

wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
wget https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/accessible_genome_masks/20141020.strict_mask.whole_genome.bed

2. Re-align, normalize and remove duplication

We need to use bcftools to process the raw vcf files.

Install bcftools

http://www.htslib.org/download/

Since the variants are not normalized and also have many duplications, we need to clean the vcf files.

Normalize: https://samtools.github.io/bcftools/bcftools.html#norm
Remove duplication: https://samtools.github.io/bcftools/bcftools.html#common_options

Re-align with the reference genome, normalize variants and remove duplications

#!/bin/bash
for chr in $(seq 1 22)
do
    bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
      ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \
      bcftools annotate -I +'%CHROM:%POS:%REF:%ALT' | \
        bcftools norm -Ob --rm-dup both \
          > ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf 
    bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf
done

3. Convert VCF files to plink binary format

Example

#!/bin/bash
for chr in $(seq 1 22)
do
plink \
      --bcf ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.bcf \
      --keep-allele-order \
      --vcf-idspace-to _ \
      --const-fid \
      --allow-extra-chr 0 \
      --split-x b37 no-fail \
      --make-bed \
      --out ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes
done

4. Using SNPs only in strict masks

Strict masks are in this directory.

Strict mask

The overlapped region with this mask is “callable” (or credible variant calls). This mask was developed in the 1KG main paper and it is well explained in https://www.biostars.org/p/219634/

Tip

Use plink --make-set option with the BED files to extract SNPs in the strict mask.

5. QC it and prune it to ~ 100K variants.

Tip

Use PLINK.

QC: only SNPs (exclude indels), MAF>0.1

Pruning: plink --indep-pariwise

6. Perform PCA

Tip

plink --pca

7. Visualization and interpretation.

Draw PC1 - PC2 plot and color each individual by ancestry information (from ALL.panel file). Interpret the result.

Tip

You can use R, python, or any other tools you like (even Excel can do the job.)

(If you are having trouble performing any of the steps, you can also refer to: https://www.biostars.org/p/335605/.)

Checklist

What does variant normalization mean and What are the two principles for variant normalization?
For chromosome 1, what is the proportion of commom variants(MAF >5%) / low-frequency-variants(1<=MAF <5%) / and rare variants(MAF <1%) ? If possible, please draw a figure showing the distribution of MAF. (plink --freq)
What pattern did you observe from the 1KG PCA plot?

Reference

https://www.biostars.org/p/335605/
1000 Genomes Project Consortium. "A global reference for human genetic variation." Nature 526.7571 (2015): 68.