Simulation of genotypes and phenotypes for MR-link¶

This document describes how genotypes, and phenotypes for causal relationships where simulated in the MR-link manuscript.

We will simulate data in two steps:

Simulation of genotypes from non finnish european individuals. (fairly long)
Simulation of phenotypes that can be used for causal relationships. (shorter)

After running these steps, a user can run MR-link to identify causal relationships.

If you already have two genotype files from which you want to simulate phenotypes in two separate cohort please go to step 2.

Requirements for simulations of genotypes¶

For the simulation of genotypes, we require the following:

30 Gb of empty disk space and about 20Gb of RAM memory
bash shell with awk and wget
internet access to download the 1000g chromosome 2 (1.2Gb),
The HAPGEN2 program
Python version >= 3.6.
The downloaded genome integration package on your filesystem.
plink version 1.9

Requirements for the simulation of phenotypes¶

The genome_integration package should be installed. 2 genotype files in plink bed format, with the same set of variants.

Step 1. Simulation of genotypes from 1000 genomes.¶

Before we start. Please go to the directory relative to where genome_integration was downloaded / cloned ./mr_link/simulation_of_genotypes/.

In this folder, the genotypes for our region of interest will be simulated.

Download chromosome 2 from the 1000 genomes FTP website

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Isolate a region 100 - 105 Mb from chromosome 2 from the file that was just downloaded which were downloaded and the header and region was retained using an awk script.

awk '{ if( (substr($1, 1, 1) == "#") || (($2 > 100000000) && ($2 < 105000000))) print $0 }' <(zcat ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz) | gzip -c > ALL.chr2.phase3.100.to.105.Mb.1000g.vcf.gz

To save disk space, the full file will be removed below and only the small region was kept.

rm ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Using another awk script, we make another file with only the snp names and the positions used in the next step for the interpolation.

awk '{if(substr($1, 1, 1) != "#") print $1, $2, $3}' <(zcat ALL.chr2.phase3.100.to.105.Mb.1000g.vcf.gz) > genetic_variants_in_region.txt 

For these genotypes, the recombination rates were interpolated based on the HAPMAP3 recombination rates

Download this file using the following command

wget ftp://ftp.ncbi.nlm.nih.gov/hapmap/recombination/2011-01_phaseII_B37/genetic_map_HapMapII_GRCh37.tar.gz

and unpack only the second chromosome using

tar -xvf genetic_map_HapMapII_GRCh37.tar.gz genetic_map_GRCh37_chr2.txt

Interpolation is done using the following python script

python3 interpolate_genetic_map.py

Haplotypes of biallelic snps were isolated from the individuals which are part of the European superpopulation, except for Finnish individuals. The list of individuals to include are saved in the file non_finnish_european_individuals.txt. which was made based on the sample table from samples in the 20130606_sample_info.xlsx. from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/.

We have made .haps and .legend files for input into HAPgen2 using the following python script:

python3 isolate_variants_make_hapgen_compatible.py

Using the HAPGEN 2 program genotypes were simulated for 25,000 individuals. The hapgen2 executable should be in your working directory, please download it if you have not done so already.

./hapgen2 -m interpolated_genetic_map.100Mb.to.105Mb.map -l europeans_1000g_chr2_100_to_105_Mb.legend -h europeans_1000g_chr2_100_to_105_Mb.hap -o simulated_geno -int 100000000 105000000 -n 25000 0 -dl 100202391 0 0.1 0.2

which are stored in the file prepend simulated_geno.*, where the summary filetype shows the command used. Control individuals are only considered in later analysis. chromosome names were set using sed:

sed 's/^snp_[0-9]\+ /2 /g' simulated_geno.controls.gen > tmp_output.gen
mv tmp_output.gen simulated_geno.controls.gen

Using plink the files are converted into bed files for use in the subsequent simulations. Filtering for MAF 1% in the full population, and then divided up into two 5,000 individual cohorts and one 15,000 individual which are subsequently not MAF filtered, to ensure variants remain the same across cohorts.

The full dataset is saved into full_simulated.(bed|bim|fam), and is divided up into three cohorts: exposure, reference, and an outcome cohort (15,000 individuals.).

plink  --data simulated_geno.controls --maf 0.01 --make-bed --out all_simulated 

Make the three separate plink genotype bed files: exposure_cohort, reference_cohort and outcome_cohort

awk '{print $1, $2} all_simulated.fam | sort -R > all_individuals_random.txt'
head -n 5000 all_individuals_random.txt > exposure_cohort_individuals.txt
plink --bfile all_simulated --chr 2 --from-mb 102 --to-mb 103 --keep exposure_cohort_individuals.txt --make-bed --out exposure_cohort   

head -n 10000 | tail -n 5000 all_individuals_random.txt > reference_cohort_individuals.txt
plink --bfile all_simulated --chr 2 --from-mb 102 --to-mb 103 --keep reference_cohort_individuals.txt --make-bed --out reference_cohort

tail -n 15000 all_indivduals_random.txt > outcome_cohort_individuals.txt
plink --bfile all_simulated --chr 2 --from-mb 102 --to-mb 103 --keep outcome_cohort_individuals.txt --make-bed --out outcome_cohort

And this finalizes the simulation of genotypes for the cohorts, so that we can simulate causal relationships.

STEP 2. simulation of causal relationships¶

The python script mr_link/simulate_phenotypes.py can be used to simulate phenotypes which are causally related in 2 different cohorts, which are then by default stored in the mr_link/simulated_phenotypes/ directory in a file format that the MR-link script (mr_link/MRlink.py) can understand.

In the following example the exposure plink genotype is named exposure_cohort.(bed|bim|fam), the outcome plink genotype files are named exposure_cohort.(bed|bim|fam)

It is possible to simulate phenotypes using the simulate_phenotypes.py script, if you run

python3 mr_link/simulate_phenotypes.py --help

you will read the following:

usage: simulate_phenotypes.py [-h] --bed_cohort_1 BED_COHORT_1 --bed_cohort_2
                              BED_COHORT_2 [--out_prepend OUT_PREPEND]
                              [--save_as {numpy,text}]
                              [--exposure_1_causal EXPOSURE_1_CAUSAL]
                              [--exposure_2_causal EXPOSURE_2_CAUSAL]
                              [--n_causal_exposure_1 N_CAUSAL_EXPOSURE_1]
                              [--n_causal_exposure_2 N_CAUSAL_EXPOSURE_2]
                              [--overlapping_causal OVERLAPPING_CAUSAL]
                              [--directional_pleiotropy DIRECTIONAL_PLEIOTROPY]
                              [--phenotypes_to_simulate PHENOTYPES_TO_SIMULATE]

optional arguments:
  -h, --help            show this help message and exit
  --bed_cohort_1 BED_COHORT_1
                        The bed file for cohort 1 (from which the exposure is
                        calculated and saved), alleles need to be harmonized
                        with bed_cohort_2
  --bed_cohort_2 BED_COHORT_2
                        The bed file for cohort 2 (from which the outcome is
                        calculated and saved)alleles need to be harmonized
                        with bed_cohort_1
  --out_prepend OUT_PREPEND
                        The prepend for where to place the file
  --save_as {numpy,text}
                        How to save the results. 'numpy' will be more
                        efficient and saves all data to a single file.'text'
                        will be human readable but save to eight separate
                        files per simulation.The latter can blow up the number
                        of files in the file system.
  --exposure_1_causal EXPOSURE_1_CAUSAL
                        causal effect of exposure 1
  --exposure_2_causal EXPOSURE_2_CAUSAL
                        causal effect of the pleiotropic exposure 2
  --n_causal_exposure_1 N_CAUSAL_EXPOSURE_1
                        number of causal SNPs for exposure 1
  --n_causal_exposure_2 N_CAUSAL_EXPOSURE_2
                        number of causal SNPs for the pleiotropic exposure 2
  --overlapping_causal OVERLAPPING_CAUSAL
                        number of causal SNPs that overlap between exposure 1
                        and exposure 2
  --directional_pleiotropy DIRECTIONAL_PLEIOTROPY
                        If the effects of the second exposure should be
                        directional
  --phenotypes_to_simulate PHENOTYPES_TO_SIMULATE
                        How many phenotypes are saved could take a long time.
                        Results are nonetheless saved after each simulation

and you can simulate phenotypes using the default settings by running:

python3 simulate_phenotypes.py --bed_cohort_1 exposure_cohort --bed_cohort_2 outcome_cohort

Which will save simulated phenotypes in the ./mr_link/simulated_files/ folder. By default, the exposure phenotype has no causal effect on the outcome, and the unobserved (pleiotropic) exposure does. You can edit the simulation parameters by changing the options, which are explained in the above help.

With these genotypes and phenotypes simulated and the summary statistics calculated, it’s now possible to run MR-link. Please read the section that explains how to run MR-link