module: genome_integration.association

class genome_integration.association.association_classes.Association(dependent_name, explanatory_name, n_observations, beta, se, r_squared=None)

This class will be the summary statistics of a univariate linear association without intercept. Data on the association is necessary, in contrast to the parent class BaseAssociation

dependent_name: str

Name of the dependent variable, i.e. the y in the solved equation y=xb+e

this name is not used anywhere.

explanatory_name: str
Name of the explanatory variable, i.e. the x in the solved equation y=xb+e
beta: float or castable to float
value of the slope variable, i.e. the b in the solved equation y=xb+e
se: float or castable to float
value of the standard error of the slope variable, i.e. the se(b) in the solved equation y=xb+e
n_observations: int or castable to int
The number of observations upon which the equation y=xb+e is solved.
r_squared: float or castable to float
The coefficient of determination or how much variance the model explains out of total variance.
set_wald_p_val():
Sets the wald test p value of the estimate. Warning: This identifying this p value is only sufficient if you have sufficient observations. otherwise a t statistic is more meaningful.
set_p_val(pval)

This class will set a p value you calculated yourself.

Parameters:pval
Returns:self
set_wald_p_val(pval)

This class will set a p value you calculated yourself.

Parameters:pval
Returns:self
class genome_integration.association.association_classes.BaseAssociation(dependent_name=None, explanatory_name=None, n_observations=None, beta=None, se=None, r_squared=None)

This class will be the summary statistics of a univariate linear association without intercept. No data on the association is necessary, as this is a parent class of which data should be in a marginal association

dependent_name: str

Name of the dependent variable, i.e. the y in the solved equation y=xb+e

this name is not used anywhere.

explanatory_name: str
Name of the explanatory variable, i.e. the x in the solved equation y=xb+e
beta: float or castable to float
value of the slope variable, i.e. the b in the solved equation y=xb+e
se: float or castable to float
value of the standard error of the slope variable, i.e. the se(b) in the solved equation y=xb+e
n_observations: int or castable to int
The number of observations upon which the equation y=xb+e is solved.
r_squared: float or castable to float
The coefficient of determination or how much variance the model explains out of total variance.

None

class genome_integration.association.association_classes.GeneticAssociation(dependent_name, explanatory_name, n_observations, beta, se, r_squared=None, chromosome=None, position=None, major_allele=None, minor_allele=None, minor_allele_frequency=None, reference_allele=None, effect_allele=None)

This class will represent a genetic association. Depends on the SNP and association parent classes.

By definition of this class:

definition: THE MINOR ALLELE IS THE EFFECT ALLELE This decicion is not great, but it’s grandfathered in. but for now it’s good enough

Attributes specific to the GeneticAssociation class.

effect_allele: str
Which allele is used as the effect allele, meaning the allele which increases the variable x Important to know, as then it’s possible to identify the risk allele.

Other attributes are inherited from their respective classes, so they are subject to change.

Specific to the GeneticAssociation class:

add_snp_data(self, snp_data, overwrite=False)
Adds data from another SNP class-like object. Masked from the SNP class, as it will also the ‘-1*b’ if alleles are flipped. Otherwise will do the same.

Other methods are inherited from the other classes. See their documentation, as it is subject to change.

add_snp_data(snp_data, overwrite=False)

This class will return itself with updated snp data. It will only change data from a class if the snp_name is the same, or if the position

Parameters:a SNP object or something that was extended from it (snp_data,) –
Return self:but with updated allele, name information.
genome_integration.association.binary_files.add_p_values_to_associations_dict(associations)

This class will add wald p values to Association classes. As this is expensive to estimate one at a time, this is done in vector form.

Parameters:associations
Returns:
genome_integration.association.binary_files.read_bin_file(file_name, bim_data)

Turns a compressed file of associations into a dict that fully contains the GeneticAssociation class.

Parameters:
  • file_name – File where the binary associations are stored
  • bim_data – the BimFile object (see genome_integration.variants) that is used to know which alleles were used for effect alele.
Returns:

dictionary of all Genetic associations

genome_integration.association.binary_files.read_bin_file_region(file_name, bim_data, gene_region)

Turns a compressed file of associations into a dict that fully contains the GeneticAssociation class.

Parameters:
  • file_name – File where the binary associations are stored
  • bim_data – the BimFile object (see genome_integration.variants) that is used to know which alleles were used for effect alele.
  • gene_region – The StartEndRegion class, which is used to only filter for a specific region, making it a bit faster and less expensive in mem to read than the full file.
Returns:

dictionary of all Genetic associations

Turns a byte array into a tuple of associations.

Parameters:bytes
Returns:tuple of ints and floats, representing the summary statistics.

This function parses a plink association file and turns it into a binary association file.

Parameters:
  • in_file – File name of the plink .qassoc file
  • out_file – File name to output to.
  • gene_name – name of the gene for which the associations are made.
Returns:

This turns a non header line from a standard plink repeated univariate association file (*.qassoc) into a byte format. The format represents in order:

short (h): chromosome (only numeric chromosome allowed) int (i): position int (i): number of observations float (f): b slope estimate of the equation y = xb + e.

y is phenotype vector, x is genotype vector of a variant, e is residual error

float (f): se(b) standard error of the beta estimate float (f): R^2 variance explained by the model.

Parameters:line
Returns:either a byte array with the format, or if something went wrong, an empty byte array.