module: genome_integration.simulate_mr

genome_integration.simulate_mr.geno_functions.do_gcta_cojo_conditional(reference_geno, associations, indices_of_interest, ref_loci)

Recreates the conditional analysis of the GCTA cojo paper. Does NOT do the stepwise regression, only does joint effects.

In testing are accurate up to 3 significant digits due to (I’m assuming) floating point differences.

See the COJO paper (JIAN YANG 2011, Nature Genetics) supplemental methods for the derivation. Their software is also open source, for an implementation, there.

Parameters:
  • reference_geno – Genotype matrix
  • associations – dict Dict of the GeneticAssociation class containing a superset of associations to do cojo from,.
  • indices_of_interest – nparray of indices. Indices of ref_loci with which we want to do associations.
  • ref_loci – list of str The names of the SNPs which are selected for COJO (keys for associations)
Returns:

numpy array of shape (len(indices_of_interest), 2) COJO results: conditional beta and se on the columns.

genome_integration.simulate_mr.geno_functions.do_gwas_on_scaled_variants(geno_vec, dependent, remove_three_encoding=True)

Will do a univariate association on a genotype vector and a phenotype vector.

Parameters:
  • geno_vec – numpy array of (n): genotypes (0,1,2, (missing:3) )
  • dependent – numpy array of (n): phenotypes (R)
  • remove_three_encoding – bool, to remove genotypes that are encoded as three.
Returns:

tuple of beta and se estimate.

genome_integration.simulate_mr.geno_functions.do_gwas_on_scaled_variants_fast(geno_vec, dependent, remove_three_encoding=True)

Will do a univariate association on a genotype vector and a phenotype vector.

This is the faster version, based on some simultaitons, the standard errors are a little less conservative than the slower statsmodels version. In simulations, the absolute difference is about 0.0004 in very significant variants. Something you may want to live with to more efficiently iterate over simulations.

Parameters:
  • geno_vec – numpy array of (n): genotypes (0,1,2, (missing:3) )
  • dependent – numpy array of (n): phenotypes (R)
  • remove_three_encoding – bool, to remove genotypes that are encoded as three.
Returns:

tuple of beta and se estimate.

genome_integration.simulate_mr.geno_functions.geno_frq(geno_vec, remove_three_encoding=True, encoding_of_a1=1)

Return the frequency of a genotype vector

Parameters:
  • geno_vec – numpy vector of genotypes.
  • remove_three_encoding – remove three encoding from plinkio
Returns:

allele frequency of the allele encoding the one.

genome_integration.simulate_mr.geno_functions.ld_from_geno_mat(geno_mat)

make an LD mat, :param geno_mat: n x m numpy matrix,

n number of individuals, m number of variants, has values 0, 1, 2 and missing is 3
Parameters:axis – int

axis on which to apply the function. :return: pearson correlation matrix

genome_integration.simulate_mr.geno_functions.read_geno_mat_plinkio(bed_location)

Reads in a genotype matrix, using plinkio, Takes a lot of mem, and a lot of time if there are a lot of individuals.

Parameters:bed_location – path of the bed file for reading. make sure you’ve limited the it enough before you continue.
Returns:genotype matrix, and the plinkio documentation
genome_integration.simulate_mr.geno_functions.scale_geno_vec(geno_vec, remove_three_encoding=True)

Scale a genotype vector in (0,1,2) to mean zero, variance one. based on the allele frequency.

Parameters:
  • geno_vec – numpy vector of genotypes
  • remove_three_encoding – remove the three encoding from plinkio, sets values to zero.
Returns:

scaled genotype vector.

genome_integration.simulate_mr.simulation_functions.choose_exposure_snps(exposure_1_n_causal, exposure_2_n_causal, overlapping_causal_snps, exposure_geno, exposure_ld, upper_ld_bound, known_exposure_lower_ld_bound, lower_ld_bound)

This is a helper function for simulate_phenotypes to choose which indices will be considered for causal SNPs. It’s in it’s own function because it can fail to find SNPs from a selection, so we make the parent function more robust to failures.

It implements steps 1 and 2 in the simulation function:

  1. exposure 1 SNPs are chosen, and they are chosen not to be in higher ld than 0.95 R ^ 2
  2. exposure 2 SNPs are chosen in two ways. overlapping_causal_snps are chosen without replacement from the exposure 1 SNPs, the rest are chosen from SNPs that are in LD with exposure 1.
Parameters:
  • exposure_1_n_causal
  • exposure_2_n_causal
  • overlapping_causal_snps
  • exposure_geno
  • exposure_ld
  • upper_ld_bound
  • known_exposure_lower_ld_bound
  • lower_ld_bound
Returns:

genome_integration.simulate_mr.simulation_functions.simulate_phenotypes(exposure_1_causal, exposure_2_causal, exposure_1_n_causal, exposure_2_n_causal, exposure_geno, exposure_ld, outcome_geno, exposure_ld_r_sq=None, overlapping_causal_snps=0, error_sd=1.0, confounder_sd=0.5, inside_phi=0.0, directional_pleiotropy=True, known_exposure_lower_ld_bound=0.0, center_phenotypes=True, known_outcome_covariate=None, known_outcome_covariate_effect_size=None, max_tries=100)

This function simulates two cohorts which are under the same genetic control This function is present to ensure that across simulation scenarios, the same parameters are used.

Step by step:

  1. exposure 1 SNPs are chosen, and they are chosen not to be in higher ld than 0.95 R ^ 2
  2. exposure 2 SNPs are chosen in two ways. overlapping_causal_snps are chosen without replacement from the exposure 1 SNPs, the rest are chosen from SNPs that are in LD with exposure 1.
  3. The confounder is simulated, based on some error and if inside_phi is non-zero, it is given a genetic effect.
  4. The effect sizes of exposure 1 and of exposure 2 eQTLs are chosen.
  5. The phenotypes are simulated based on the function simulate_phenotypes_extended, in the exposure cohort, and the outcome cohort indepdently.
  6. returning the following:
    exposure_phenotype, outcome_phenotype, exposure_1_causal_snps, exposure_1_betas, exposure_2_causal_snps, exposure_2_betas
Parameters:exposure_1_causal

The causal effect on exposure 1

Parameters:exposure_2_causal

The causal effect on exposure 2

Parameters:exposure_1_n_causal

The number of causal SNPs for exposure 1e

Parameters:exposure_2_n_causal

The number of causal SNPs for exposure 2

Parameters:exposure_geno

The genotype matrix for the exposure

Parameters:exposure_ld

The ld R matrix for the exposure genotypes – NB. NOT squared correlation

Parameters:outcome_geno

The genotype matrix for the outcome

Parameters:outcome_ld

The ld R matrix for the outcome genotypes. currently not used.

Parameters:overlapping_causal_snps

An integer number representing the number of causal SNPs for exposure 2 that overlap exposure 1.

Parameters:error_sd

The scale (standard deviation) parameter of the error that is added to a phenotype

Parameters:phi (inside) –

The phi parameter (from the egger paper notation), that makes the confounder dependent of the genotype. if non zero, a genetic effect with effect size (0,2) is added to the confounder, using the exposure 2 causal SNPs

Parameters:directional_pleiotropy

if there should be directional pleiotropy or not.

:param known_exposure_lower_ld_bound
float in the range 0-1 The lower LD bound to find the instruments for. May make it more difficult to find causal variants if this value is large. Default: 0.0
:param center_phenotypes
Boolean indicating if phenotypes need to be centered default=True
:param known_outcome_covariate,
list or numpy array of the same size as there are individuals in the outcome genotypes. This will add a covariate to the phenotype simulation that the causal inference method will need to account for. default: None (No covariate will be applied to the outcome)
:param known_outcome_covariate_effect_size
float, The effect size for the known covariate specified in. defualt: None (No covariate will be applied to the outcome.)
:param max_tries
int: the number of tries to give the instrument selection. default 100
Returns:

Returns a tuple of the following numpy arrays:

1 by n vector of simulated phenotypes for the exposure, 1 by n vector simulated phenotypes for the outcome, 1 by n causal vector of the indices of the causal snps of exposure 1. 1 by n causal vector of the effect sizes of the causal snps of exposure 1. 1 by n causal vector of the indices of the causal snps of exposure 2. 1 by n causal vector of the effect sizes of the causal snps of exposure 2.

genome_integration.simulate_mr.simulation_functions.simulate_phenotypes_binary_outcome(exposure_1_causal, exposure_2_causal, exposure_1_n_causal, exposure_2_n_causal, exposure_geno, exposure_ld, outcome_geno, exposure_ld_r_sq=None, overlapping_causal_snps=0, error_sd=1.0, confounder_sd=0.5, inside_phi=0.0, directional_pleiotropy=True, known_exposure_lower_ld_bound=0.0, proportion_cases=0.5, known_outcome_covariate=None, known_outcome_covariate_effect_size=None)

This function simulates two binary outcome phenotypes in cohorts which are under the same genetic control

Step by step:

  1. exposure 1 SNPs are chosen, and they are chosen not to be in higher ld than 0.95 R ^ 2

  2. exposure 2 SNPs are chosen in two ways. overlapping_causal_snps are chosen without replacement from the exposure 1 SNPs, the rest are chosen from SNPs that are in LD with exposure 1.

  3. The confounder is simulated, based on some error and if inside_phi is non-zero, it is given a genetic effect.

  4. The effect sizes of exposure 1 and of exposure 2 eQTLs are chosen.

  5. The phenotypes are simulated based on the function simulate_phenotypes_extended, in the exposure cohort, and the outcome cohort independently.

  6. The outcome phenotypes are transformed using a logistic function. The they are converted into case control variables following the binomial function with the logistically transformed variables as probabilities.

    If the parameter proportion_cases is not 0.5, we shift the simulated phenotypes by a factor so that the proportion cases is correct. As far as I know, there is no analytical solution to this, so this is done using an optimization function.

  7. returning the following:

    exposure_phenotype, outcome_phenotype, exposure_1_causal_snps, exposure_1_betas, exposure_2_causal_snps, exposure_2_betas

Parameters:exposure_1_causal

The causal effect on exposure 1

Parameters:exposure_2_causal

The causal effect on exposure 2

Parameters:exposure_1_n_causal

The number of causal SNPs for exposure 1e

Parameters:exposure_2_n_causal

The number of causal SNPs for exposure 2

Parameters:exposure_geno

The genotype matrix for the exposure

Parameters:exposure_ld

The ld R matrix for the exposure genotypes

Parameters:outcome_geno

The genotype matrix for the outcome encoded

Parameters:outcome_ld

The ld R matrix for the outcome genotypes.

Parameters:overlapping_causal_snps – int

An integer number representing the number of causal SNPs for exposure 2 that overlap exposure 1.

Parameters:error_sd – float

The scale (standard deviation) parameter of the error that is added to a phenotype

Parameters:phi (inside) – float

The phi parameter (from the egger paper notation), that makes the confounder dependent of the genotype. if non zero, a genetic effect with effect size (0,2) is added to the confounder, using the exposure 2 causal SNPs

Parameters:directional_pleiotropy – bool

if there should be directional pleiotropy or not.

Parameters:proportion_cases – float The proportion of cases and for the outcome phenotypes.
Returns:

Returns a tuple of the following numpy arrays:

1 by n vector of simulated phenotypes for the exposure, 1 by n vector simulated binary phenotypes for the outcome, 1 = case, 0 = control 1 by n causal vector of the indices of the causal snps of exposure 1. 1 by n causal vector of the effect sizes of the causal snps of exposure 1. 1 by n causal vector of the indices of the causal snps of exposure 2. 1 by n causal vector of the effect sizes of the causal snps of exposure 2.

genome_integration.simulate_mr.simulation_functions.simulate_phenotypes_extended(geno, exposure_1_causal_snps, exposure_1_causal_effect, exposure_1_betas, exposure_2_causal_snps, exposure_2_causal_effect, exposure_2_betas, error_scale, confounder)

This function will simulate phenotypes based on parameters passed through it. Simulation is done through up two exposures, which can both be causal for some outcome.

The math behind it is the following: exposure 1: y_{e1} = X_{1} cdot b_{e1} + C + e exposure 2: y_{e2} = X_{2} cdot b_{e2} + C + e

outcome: y_{o} = y_{e1} b_{1} + X_{1} cdot b_{p,1} + X_{2} cdot b_{p,1} + y_{e2} b_{2} + C + e

Parameters:geno

A genotype matrix, and n individuals by m variants, numpy integer matrix, representing the genotypes of individuals.

Parameters:exposure_1_causal_snps

A vector of length i (<= m), number of causal snps for exposure 1, with indices in the genotype matrix from which to choose causal snps. X_{1} is the genotype matrix, with only these cauasal indices.

Parameters:exposure_1_causal_effect

A float representing the causal effect of exposure 1 on the outcome

Parameters:exposure_1_betas

b_{1} is a i by 1 vector of floats, representing the causal effect of the snp on the exposure.

Parameters:exposure_2_causal_snps

A vector of length j (<= m), number of causal snps for exposure 2, with indices in the genotype matrix from which to choose causal snps. X_{2} is the genotype matrix, with only these causal indices.

Parameters:exposure_2_causal_effect

A float representing the causal effect of exposure 1 on the outcome

Parameters:exposure_2_betas

b_{2} is a j by 1 vector of floats, representing the causal effect of the snp on the exposure.

Parameters:error_scale

e ~ N(0, error_scale) The scale parameter of the e term, drawn from a normal distribution with this standard deviation

Parameters:confounder

a n by 1 vector of confounder values per individual.

Returns:

three length n vectors representing in order: phenotypes for exposure 1, exposure 2 and the outcome.