module: genome_integration.utils

These classes are used to do miscelaneous utilities..

genome_integration.utils.file_utils.read_newline_separated_file_into_list(file_name)

Reads a file into a list of strings. :param file_name: str

filename to read
Returns:list of str newline characters are removed.
genome_integration.utils.file_utils.read_newline_separated_gz_file_into_list(filename)

Reads a gzipped file into a list of strings.

Parameters:file_name – str filename to read
Returns:list of str newline characters are removed.
genome_integration.utils.file_utils.write_list_to_newline_separated_file(in_list, file_name)

Convenience function: Writes list into newline separated file.

Parameters:
  • in_list – list of str list to write away. Lines need to be not newline separated.
  • file_name – str out file name
Returns:

None

class genome_integration.utils.gcta_cojo_utils.CojoCmaFile(file_loc, name)

Contains the GCTA COJO CMA file results. The conditional joint associations are located in the ma results part of the file.

name: str
name of the association
ma_results: dict
dict with snp_names as keys, and the CojoCmaLine as values.
class genome_integration.utils.gcta_cojo_utils.CojoCmaLine(line, name)

This is a helper class arounc the GeneticAssociation class. but adds the information that GCTA also keeps.

From the PCTG, GCTA documentation: Columns are:

chromosome; SNP; physical position; frequency of the effect allele in the original data; the effect allele; effect size, standard error and p-value from the original GWAS or meta-analysis; estimated effective sample size; frequency of the effect allele in the reference sample; effect size, standard error and p-value from a joint analysis of all the selected SNPs; LD correlation between the SNP i and SNP i + 1 for the SNPs on the list.

Attributes specific to this class, rest is inherited from GeneticAssociation:

beta_initial: float
initial beta from input
se_intitial: float
initial se from input
p_intitial: float
initial p value from input
freq_geno: float
frequency of the reference populaiton
n_estimated:
number of estimated individuals in the population
genome_integration.utils.gcta_cojo_utils.do_gcta_cojo_joint(bfile_prepend, ma_file, out_prepend, p_val='1e-8', maf='0.01', gc=1.0)

Does GCTA COJO stepwise selection and joint effects

Parameters:
  • bfile_prepend – bedfile location
  • ma_file – ma file location
  • out_prepend – where to output
  • p_val – p value threshold for stepwise selection
  • maf – minor allele frequency threshold for stepwise selection
  • gc – genomic correction factor, default is 1.0. Make sure to check this in your associations
Returns:

CojoCmaFile object with the results.

genome_integration.utils.gcta_cojo_utils.do_gcta_cojo_joint_on_only_snps_genetic_associations(genetic_associations, bfile, tmp_prepend, _keep_ma_files=False)
Parameters:
  • genetic_associations – a dict of genetic associations, keys should be explantory name
  • bfile – plink bed file
  • snps_to_condition_on – list like object of SNPs that are in the genetic associations object on which we condition.
  • tmp_prepend – temporary name of files where to store.
  • _keep_ma_files – This is used for testing. MA files are used to know the exact floating point input.
Returns:

Cojo results a Cojo CMA file object, which is an extension of the geneticassociation file.

genome_integration.utils.gcta_cojo_utils.do_gcta_cojo_on_genetic_associations(genetic_associations, bfile, tmp_prepend, p_val_thresh=0.05, maf=0.01, calculate_ld=False, clump=False, create_tmp_subset_of_bed=True, individuals_to_analyze=None)
Parameters:
  • genetic_associations – a dict of genetic associations, keys should be explantory name
  • bfile – plink bed file
  • tmp_prepend – temporary name of files where to store.
  • p_val_thresh – p value threshold as a float
  • maf – minor allele frequency as a float
Returns:

Cojo results a Cojo CMA file object, which is an extension of the geneticassociation file.

genome_integration.utils.gcta_cojo_utils.do_gcta_cojo_slct(bfile_prepend, ma_file, out_prepend, p_val=1e-08, maf=0.01, gc=1.0, n_threads=1)

Doeas GCTA COJO stepwise selection (no joint effects)

Parameters:
  • bfile_prepend – bedfile location
  • ma_file – ma file location
  • out_prepend – where to output
  • p_val – p value threshold for stepwise selection
  • maf – minor allele frequency threshold for stepwise selection
  • gc – genomic correction factor, default is 1.0. Make sure to check this in your associations
Returns:

CojoCmaFile object with the results.

genome_integration.utils.gcta_cojo_utils.do_gcta_joint_on_specified_snps(bfile_prepend, ma_file, out_prepend, gc=1.0)

Doeas GCTA COJO joint.

Parameters:
  • bfile_prepend – bedfile location
  • ma_file – ma file location
  • out_prepend – where to output
  • snp_file_name – this contains the variant on which it should be conditional
  • p_val – p value threshold for stepwise selection
  • gc – genomic correction factor, default is 1.0. Make sure to check this in your associations
Returns:

CojoCmaFile object with the results.

genome_integration.utils.gcta_cojo_utils.make_gcta_ma_header()

Will create an ma header. for GCTA-COJO

Returns:String with an ma file header.
genome_integration.utils.gcta_cojo_utils.make_gcta_ma_line(genetic_association)

Makes a GCTA line of the genetic variant.

Will only return a string not newline ended, will not write to a file, the user is expected to do this himself.

:param genetic association class object. :return tab separated string that can be part of ma file:

class genome_integration.utils.gcta_ma_utils.MaFile(file_loc, name)

Implements a reader and writer to the MA file as described by GCTA COJO

name: str
name of the file.
ma_results: dict
keys are the SNP names and values are of class MaLine
snp_names(self, no_palindromic = False)
return the snp names in the mafile
write_results(self, file_name)
write the results to a file name
add_bim_data(self, bim_data)
add the position and chromosome from a BimFile reference to all the SNPS.
add_bim_data(bim_data)

add the position and chromosome from a BimFile reference to all the SNPS.

Parameters:bim_data
Returns:None
snp_names(no_palindromic=False)
return the snp names in the mafile
Parameters:no_palindromic – bool If also palindromic SNPs should be returned (default is False)
Returns:list of SNP names.
write_result(file_name)

write the results to a file name :param file_name: str

file name to write to
Returns:None
class genome_integration.utils.gcta_ma_utils.MaLine(line)

Class that inherits from GeneticAssociation, reading an MA line.

All the attributes of the GeneticAssociation class.

implements a fam file

fam_loc: str
path to the fam file.
sample_names: list
sample names of the fam. Currently using <fid>~__~<iid>
fam_samples: dict
dictionary with sample names as keys and FamSample as the value.

Extends the sample class, will also contain family ID and sample ID.

special to this class

fid: fid from the plink file iid: iid from the plink file

A reader for the plink bed file format.

requires a valid location of the bed file.

bed_loc: str
of the bed file location
bim_loc: str
of the bim file location
fam_loc: str
of the fam file location
genotypes: float numpy array of variants on columns samples on the rows.
Numpy array of genotypes default is number of minor alleles per variant and position. not initialized, need to call read_bed_file_into_numpy_array
bim_data: BimFile object.
contains all the variants from the bim file.
fam_data: FamFile object
contains all the Samples from the bim file
_decoder: dict
bit dict of the file encoding.
read_bed_file_into_numpy_array(self, allele_2_as_zero=True, missing_encoding=3)
reads in the bed file takes about 5 seconds for 5,000 individuals ~14,000 variants. saves it to the class.
prune_for_a_region(self, region):
prunes for a StartEndRegion
harmonize_genotypes(self, other_plink_file)
Harmonized another PlinkFile object to this files’ alleles. (flips alleles, and flips genotypes) (MAJOR and minor will be false names.)

Harmonizes the other plink file to the alleles of self. requires that the variants are the same between files. requires that the variants have the same alleles

WARNING: Will mean that the other plink file will have flipped major and minor alleles.

Parameters:other_plink_file – Plink File to harmonize
Returns:PlinkFile

Writes a bed file to the final list.

Parameters:output_prepend – This is the output prepend for after which .bed, .bim .fam are appended.
Returns:Nothing, but bed, bim fam are written

prunes a list of variants,

Parameters:
  • snp_list – list of variants that should be at least partially overlapping with the variants in the .bim_data attribute of this class
  • verbose – print things about what is happening
Returns:

self, with only the variants specified in the snp_list.

Prunes for a region in the plink file.

Parameters:region – StartEndRegion region to prune for
Returns:self with the variants outside the region removed.

Reads a bed file into a numpy array

Parameters:
  • allele_2_as_zero – bool allele 2 (often the major allele in plink) is encoded as zero, making an increase in the minor allele an increase in number. This is opposite to the plinkio encoding. But this makes the minor allele often also the effect allele.
  • missing_encoding – int encodes missing values as three, plinkio default.
Returns:

n indivduals by m variants numpy array (floats) of genotypes.

Isolate snps of interest for a gene, and make a bed file

Parameters:
  • ma_file
  • exposure_name
  • b_file
  • tmp_file_prepend
  • plink_files_out
  • calculate_ld
Returns:

the name_of the bedfile with only the snps

will prune for ld in a list of snps. from a bed file location. will output a list after prune.

Parameters:
  • bed_file
  • associations
Returns:

list of snps after prune

Reads a region from a plink file and writes it to an output.

This function can be used if you want to read only a small part of a plink file.

Parameters:
  • bed_file – str prepend filelocation of a bed file.
  • out_location – prepend file location of the pruned file.
  • region – StartEndRegion Region to look for.
  • variants – iterable of str iterable containing the variant names to keep for analysis.
Returns:

None

Using scoring, we determine the auc

Parameters:
  • genetic_associations
  • bed_file
  • tmp_file
  • p_value_thresh
  • resolution
Returns:

Used to score individual. :param genetic_associations: :param bed_file: prepend of a bed file :param tmp_file: prepend of temporary files. :param p_value_thresh: p value threshold of which the genetic associations should be part of. :return: dict with keys corresponding to individuals,

values: tuple with the phenotype [0] and score [1] of the individual.