module: genome_integration.utils¶

These classes are used to do miscelaneous utilities..

genome_integration.utils.file_utils.read_newline_separated_file_into_list(file_name)¶

Reads a file into a list of strings. :param file_name: str

filename to read

Returns:	list of str newline characters are removed.

genome_integration.utils.file_utils.read_newline_separated_gz_file_into_list(filename)¶

Reads a gzipped file into a list of strings.

Parameters:	file_name – str filename to read
Returns:	list of str newline characters are removed.

genome_integration.utils.file_utils.write_list_to_newline_separated_file(in_list, file_name)¶

Convenience function: Writes list into newline separated file.

Parameters:	in_list – list of str list to write away. Lines need to be not newline separated. file_name – str out file name
Returns:	None

class genome_integration.utils.gcta_cojo_utils.CojoCmaFile(file_loc, name)¶

Contains the GCTA COJO CMA file results. The conditional joint associations are located in the ma results part of the file.

name: str: name of the association
ma_results: dict: dict with snp_names as keys, and the CojoCmaLine as values.

class genome_integration.utils.gcta_cojo_utils.CojoCmaLine(line, name)¶

This is a helper class arounc the GeneticAssociation class. but adds the information that GCTA also keeps.

From the PCTG, GCTA documentation: Columns are:

chromosome; SNP; physical position; frequency of the effect allele in the original data; the effect allele; effect size, standard error and p-value from the original GWAS or meta-analysis; estimated effective sample size; frequency of the effect allele in the reference sample; effect size, standard error and p-value from a joint analysis of all the selected SNPs; LD correlation between the SNP i and SNP i + 1 for the SNPs on the list.

Attributes specific to this class, rest is inherited from GeneticAssociation:

beta_initial: float: initial beta from input
se_intitial: float: initial se from input
p_intitial: float: initial p value from input
freq_geno: float: frequency of the reference populaiton
n_estimated:: number of estimated individuals in the population

genome_integration.utils.gcta_cojo_utils.do_gcta_cojo_joint(bfile_prepend, ma_file, out_prepend, p_val='1e-8', maf='0.01', gc=1.0)¶

Does GCTA COJO stepwise selection and joint effects

Parameters:	bfile_prepend – bedfile location ma_file – ma file location out_prepend – where to output p_val – p value threshold for stepwise selection maf – minor allele frequency threshold for stepwise selection gc – genomic correction factor, default is 1.0. Make sure to check this in your associations
Returns:	CojoCmaFile object with the results.

genome_integration.utils.gcta_cojo_utils.do_gcta_cojo_joint_on_only_snps_genetic_associations(genetic_associations, bfile, tmp_prepend, _keep_ma_files=False)¶

Parameters:

genetic_associations – a dict of genetic associations, keys should be explantory name
bfile – plink bed file
snps_to_condition_on – list like object of SNPs that are in the genetic associations object on which we condition.
tmp_prepend – temporary name of files where to store.
_keep_ma_files – This is used for testing. MA files are used to know the exact floating point input.

Returns:

Cojo results a Cojo CMA file object, which is an extension of the geneticassociation file.

genome_integration.utils.gcta_cojo_utils.do_gcta_cojo_on_genetic_associations(genetic_associations, bfile, tmp_prepend, p_val_thresh=0.05, maf=0.01, calculate_ld=False, clump=False, create_tmp_subset_of_bed=True, individuals_to_analyze=None)¶

Parameters:	genetic_associations – a dict of genetic associations, keys should be explantory name bfile – plink bed file tmp_prepend – temporary name of files where to store. p_val_thresh – p value threshold as a float maf – minor allele frequency as a float
Returns:	Cojo results a Cojo CMA file object, which is an extension of the geneticassociation file.

genome_integration.utils.gcta_cojo_utils.do_gcta_cojo_slct(bfile_prepend, ma_file, out_prepend, p_val=1e-08, maf=0.01, gc=1.0, n_threads=1)¶

Doeas GCTA COJO stepwise selection (no joint effects)

Parameters:	bfile_prepend – bedfile location ma_file – ma file location out_prepend – where to output p_val – p value threshold for stepwise selection maf – minor allele frequency threshold for stepwise selection gc – genomic correction factor, default is 1.0. Make sure to check this in your associations
Returns:	CojoCmaFile object with the results.

genome_integration.utils.gcta_cojo_utils.do_gcta_joint_on_specified_snps(bfile_prepend, ma_file, out_prepend, gc=1.0)¶

Doeas GCTA COJO joint.

Parameters:	bfile_prepend – bedfile location ma_file – ma file location out_prepend – where to output snp_file_name – this contains the variant on which it should be conditional p_val – p value threshold for stepwise selection gc – genomic correction factor, default is 1.0. Make sure to check this in your associations
Returns:	CojoCmaFile object with the results.

genome_integration.utils.gcta_cojo_utils.make_gcta_ma_header()¶

Will create an ma header. for GCTA-COJO

Returns:	String with an ma file header.

genome_integration.utils.gcta_cojo_utils.make_gcta_ma_line(genetic_association)¶

Makes a GCTA line of the genetic variant.

Will only return a string not newline ended, will not write to a file, the user is expected to do this himself.

:param genetic association class object. :return tab separated string that can be part of ma file:

class genome_integration.utils.gcta_ma_utils.MaFile(file_loc, name)¶

Implements a reader and writer to the MA file as described by GCTA COJO

name: str: name of the file.
ma_results: dict: keys are the SNP names and values are of class MaLine

snp_names(self, no_palindromic = False): return the snp names in the mafile
write_results(self, file_name): write the results to a file name
add_bim_data(self, bim_data): add the position and chromosome from a BimFile reference to all the SNPS.

add_bim_data(bim_data)¶

add the position and chromosome from a BimFile reference to all the SNPS.

Parameters:	bim_data –
Returns:	None

snp_names(no_palindromic=False)¶

return the snp names in the mafile

Parameters:	no_palindromic – bool If also palindromic SNPs should be returned (default is False)
Returns:	list of SNP names.

write_result(file_name)¶

write the results to a file name :param file_name: str

file name to write to

Returns:	None

class genome_integration.utils.gcta_ma_utils.MaLine(line)¶

Class that inherits from GeneticAssociation, reading an MA line.

All the attributes of the GeneticAssociation class.

class genome_integration.utils.plink_utils.FamFile(fam_loc)¶

implements a fam file

fam_loc: str: path to the fam file.
sample_names: list: sample names of the fam. Currently using <fid>~__~<iid>
fam_samples: dict: dictionary with sample names as keys and FamSample as the value.

class genome_integration.utils.plink_utils.FamSample(fid, iid, sex, name, phenotype=None)¶

Extends the sample class, will also contain family ID and sample ID.

special to this class

fid: fid from the plink file iid: iid from the plink file

class genome_integration.utils.plink_utils.PlinkFile(bfile_location)¶

A reader for the plink bed file format.

requires a valid location of the bed file.

bed_loc: str: of the bed file location
bim_loc: str: of the bim file location
fam_loc: str: of the fam file location
genotypes: float numpy array of variants on columns samples on the rows.: Numpy array of genotypes default is number of minor alleles per variant and position. not initialized, need to call read_bed_file_into_numpy_array
bim_data: BimFile object.: contains all the variants from the bim file.
fam_data: FamFile object: contains all the Samples from the bim file
_decoder: dict: bit dict of the file encoding.

read_bed_file_into_numpy_array(self, allele_2_as_zero=True, missing_encoding=3): reads in the bed file takes about 5 seconds for 5,000 individuals ~14,000 variants. saves it to the class.
prune_for_a_region(self, region):: prunes for a StartEndRegion
harmonize_genotypes(self, other_plink_file): Harmonized another PlinkFile object to this files’ alleles. (flips alleles, and flips genotypes) (MAJOR and minor will be false names.)

harmonize_genotypes(other_plink_file)¶

Harmonizes the other plink file to the alleles of self. requires that the variants are the same between files. requires that the variants have the same alleles

WARNING: Will mean that the other plink file will have flipped major and minor alleles.

Parameters:	other_plink_file – Plink File to harmonize
Returns:	PlinkFile

output_genotypes_to_bed_file(output_prepend)¶

Writes a bed file to the final list.

Parameters:	output_prepend – This is the output prepend for after which .bed, .bim .fam are appended.
Returns:	Nothing, but bed, bim fam are written

prune_for_a_list_of_snps(snp_list, verbose=False)¶

prunes a list of variants,

Parameters:	snp_list – list of variants that should be at least partially overlapping with the variants in the .bim_data attribute of this class verbose – print things about what is happening
Returns:	self, with only the variants specified in the snp_list.

prune_for_a_region(region)¶

Prunes for a region in the plink file.

Parameters:	region – StartEndRegion region to prune for
Returns:	self with the variants outside the region removed.

read_bed_file_into_numpy_array(allele_2_as_zero=True, missing_encoding=3, dtype=<class 'float'>)¶

Reads a bed file into a numpy array

Parameters:	allele_2_as_zero – bool allele 2 (often the major allele in plink) is encoded as zero, making an increase in the minor allele an increase in number. This is opposite to the plinkio encoding. But this makes the minor allele often also the effect allele. missing_encoding – int encodes missing values as three, plinkio default.
Returns:	n indivduals by m variants numpy array (floats) of genotypes.

genome_integration.utils.plink_utils.isolate_snps_of_interest_make_bed(ma_file, exposure_name, b_file, tmp_file_prepend, plink_files_out, calculate_ld=False, individuals_to_isolate=None, no_palindromic=False)¶

Isolate snps of interest for a gene, and make a bed file

Parameters:	ma_file – exposure_name – b_file – tmp_file_prepend – plink_files_out – calculate_ld –
Returns:	the name_of the bedfile with only the snps

genome_integration.utils.plink_utils.plink_isolate_clump(bed_file, associations, threshold, r_sq=0.5, tmploc='', return_snp_file=False)¶

will prune for ld in a list of snps. from a bed file location. will output a list after prune.

Parameters:	bed_file – associations –
Returns:	list of snps after prune

genome_integration.utils.plink_utils.read_region_from_plink(bed_file, out_location, region, variants=None)¶

Reads a region from a plink file and writes it to an output.

This function can be used if you want to read only a small part of a plink file.

Parameters:	bed_file – str prepend filelocation of a bed file. out_location – prepend file location of the pruned file. region – StartEndRegion Region to look for. variants – iterable of str iterable containing the variant names to keep for analysis.
Returns:	None

genome_integration.utils.plink_utils.score_and_assess_auc(genetic_associations, bed_file, tmp_file='tmp_score', p_value_thresh=1.0, resolution=500)¶

Using scoring, we determine the auc

Parameters:	genetic_associations – bed_file – tmp_file – p_value_thresh – resolution –
Returns:

genome_integration.utils.plink_utils.score_individuals(genetic_associations, bed_file, tmp_file='tmp_score', p_value_thresh=1)¶: Used to score individual. :param genetic_associations: :param bed_file: prepend of a bed file :param tmp_file: prepend of temporary files. :param p_value_thresh: p value threshold of which the genetic associations should be part of. :return: dict with keys corresponding to individuals,

values: tuple with the phenotype [0] and score [1] of the individual.