Sample Dataset

504 EAS individuals from 1000 Genomes Project Phase 3 version 5

CHB: Han Chinese in Beijing, China
JPT: Japanese in Tokyo, Japan
CHS: Southern Han Chinese
CDX: Chinese Dai in Xishuanagbanna, China
KHV: Kinh in Ho Chi Minh City, Vietnam

Url: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

Genome build: human_g1k_v37.fasta (hg19)

Genotype Data Processing

Selected only autosomal variants
Split multi-allelic variants
Variants were normalized
Remove duplicated variants
Selected only SNP (ATCG)
Selected 2% rare SNPs (plink --mac 2 --max--maf 0.01 --thin 0.02)
Selected 15% common SNPs (plink --maf 0.01 --thin 0.15)
Converted to plink bed format and merged to a single file
Randomly added some missing data points

Download

Note

The sample dataset 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip has been included in 01_Dataset when you clone the repository. There is no need to download it again if you clone this repository.

You can also simply run download_sampledata.sh in 01_Dataset and the dataset will be downloaded and decompressed.

./download_sampledata.sh

Sample dataset is currently hosted on Dropbox which may not be accessible for users in certain regions.

or you can manually download it from this link.

Unzip the dataset unzip -j 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip, and you will get the following files:

1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bed
1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim
1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam

Phenotype Simulation

Phenotypes were simply simulated using GCTA with the 1KG EAS dataset.

gcta  \
  --bfile 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015 \
  --simu-cc 250 254  \
  --simu-causal-loci causal.snplist  \
  --simu-hsq 0.8  \
  --simu-k 0.5  \
  --simu-rep 1  \
  --out 1kgeas_binary

$ cat causal.snplist
2:55620927:G:A 3
8:97094292:C:T 3
20:42758834:T:C 3
7:134326056:G:T 3
1:167562605:G:A 3

Warning

This simulation is just used for showing the analysis pipeline and data format. The trait was simulated under an unreal condition (effect sizes are extremely large) so the result itself is meaningless.

Allele frequency and Effect size

Reference

1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature, 526(7571), 68.
Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1), 76-82.