Sample Dataset
504 EAS individuals from 1000 Genomes Project Phase 3 version 5
- CHB: Han Chinese in Beijing, China
- JPT: Japanese in Tokyo, Japan
- CHS: Southern Han Chinese
- CDX: Chinese Dai in Xishuanagbanna, China
- KHV: Kinh in Ho Chi Minh City, Vietnam
Url: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
Genome build: human_g1k_v37.fasta (hg19)
Genotype Data Processing
- Selected only autosomal variants
- Split multi-allelic variants
- Variants were normalized
- Remove duplicated variants
- Selected only SNP (ATCG)
- Selected 2% rare SNPs (
plink --mac 2 --max--maf 0.01 --thin 0.02
) - Selected 15% common SNPs (
plink --maf 0.01 --thin 0.15
) - Converted to plink bed format and merged to a single file
- Randomly added some missing data points
Download
Note
The sample dataset 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip
has been included in 01_Dataset
when you clone the repository. There is no need to download it again if you clone this repository.
You can also simply run download_sampledata.sh
in 01_Dataset
and the dataset will be downloaded and decompressed.
./download_sampledata.sh
Sample dataset is currently hosted on Dropbox which may not be accessible for users in certain regions.
or you can manually download it from this link.
Unzip the dataset unzip -j 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.zip
, and you will get the following files:
1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bed
1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim
1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.fam
Phenotype Simulation
Phenotypes were simply simulated using GCTA with the 1KG EAS dataset.
gcta \
--bfile 1KG.EAS.auto.snp.norm.nodup.split.rare002.common015 \
--simu-cc 250 254 \
--simu-causal-loci causal.snplist \
--simu-hsq 0.8 \
--simu-k 0.5 \
--simu-rep 1 \
--out 1kgeas_binary
$ cat causal.snplist
2:55620927:G:A 3
8:97094292:C:T 3
20:42758834:T:C 3
7:134326056:G:T 3
1:167562605:G:A 3
Warning
This simulation is just used for showing the analysis pipeline and data format. The trait was simulated under an unreal condition (effect sizes are extremely large) so the result itself is meaningless.
Allele frequency and Effect size
Reference
- 1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature, 526(7571), 68.
- Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1), 76-82.