module: genome_integration.utils¶
These classes are used to do miscelaneous utilities..
-
genome_integration.utils.file_utils.read_newline_separated_file_into_list(file_name)¶ Reads a file into a list of strings. :param file_name: str
filename to readReturns: list of str newline characters are removed.
-
genome_integration.utils.file_utils.read_newline_separated_gz_file_into_list(filename)¶ Reads a gzipped file into a list of strings.
Parameters: file_name – str filename to read Returns: list of str newline characters are removed.
-
genome_integration.utils.file_utils.write_list_to_newline_separated_file(in_list, file_name)¶ Convenience function: Writes list into newline separated file.
Parameters: - in_list – list of str list to write away. Lines need to be not newline separated.
- file_name – str out file name
Returns: None
-
class
genome_integration.utils.gcta_cojo_utils.CojoCmaFile(file_loc, name)¶ Contains the GCTA COJO CMA file results. The conditional joint associations are located in the ma results part of the file.
- name: str
- name of the association
- ma_results: dict
- dict with snp_names as keys, and the CojoCmaLine as values.
-
class
genome_integration.utils.gcta_cojo_utils.CojoCmaLine(line, name)¶ This is a helper class arounc the GeneticAssociation class. but adds the information that GCTA also keeps.
From the PCTG, GCTA documentation: Columns are:
chromosome; SNP; physical position; frequency of the effect allele in the original data; the effect allele; effect size, standard error and p-value from the original GWAS or meta-analysis; estimated effective sample size; frequency of the effect allele in the reference sample; effect size, standard error and p-value from a joint analysis of all the selected SNPs; LD correlation between the SNP i and SNP i + 1 for the SNPs on the list.
Attributes specific to this class, rest is inherited from GeneticAssociation:
- beta_initial: float
- initial beta from input
- se_intitial: float
- initial se from input
- p_intitial: float
- initial p value from input
- freq_geno: float
- frequency of the reference populaiton
- n_estimated:
- number of estimated individuals in the population
-
genome_integration.utils.gcta_cojo_utils.do_gcta_cojo_joint(bfile_prepend, ma_file, out_prepend, p_val='1e-8', maf='0.01', gc=1.0)¶ Does GCTA COJO stepwise selection and joint effects
Parameters: - bfile_prepend – bedfile location
- ma_file – ma file location
- out_prepend – where to output
- p_val – p value threshold for stepwise selection
- maf – minor allele frequency threshold for stepwise selection
- gc – genomic correction factor, default is 1.0. Make sure to check this in your associations
Returns: CojoCmaFile object with the results.
-
genome_integration.utils.gcta_cojo_utils.do_gcta_cojo_joint_on_only_snps_genetic_associations(genetic_associations, bfile, tmp_prepend, _keep_ma_files=False)¶ Parameters: - genetic_associations – a dict of genetic associations, keys should be explantory name
- bfile – plink bed file
- snps_to_condition_on – list like object of SNPs that are in the genetic associations object on which we condition.
- tmp_prepend – temporary name of files where to store.
- _keep_ma_files – This is used for testing. MA files are used to know the exact floating point input.
Returns: Cojo results a Cojo CMA file object, which is an extension of the geneticassociation file.
-
genome_integration.utils.gcta_cojo_utils.do_gcta_cojo_on_genetic_associations(genetic_associations, bfile, tmp_prepend, p_val_thresh=0.05, maf=0.01, calculate_ld=False, clump=False, create_tmp_subset_of_bed=True, individuals_to_analyze=None)¶ Parameters: - genetic_associations – a dict of genetic associations, keys should be explantory name
- bfile – plink bed file
- tmp_prepend – temporary name of files where to store.
- p_val_thresh – p value threshold as a float
- maf – minor allele frequency as a float
Returns: Cojo results a Cojo CMA file object, which is an extension of the geneticassociation file.
-
genome_integration.utils.gcta_cojo_utils.do_gcta_cojo_slct(bfile_prepend, ma_file, out_prepend, p_val=1e-08, maf=0.01, gc=1.0, n_threads=1)¶ Doeas GCTA COJO stepwise selection (no joint effects)
Parameters: - bfile_prepend – bedfile location
- ma_file – ma file location
- out_prepend – where to output
- p_val – p value threshold for stepwise selection
- maf – minor allele frequency threshold for stepwise selection
- gc – genomic correction factor, default is 1.0. Make sure to check this in your associations
Returns: CojoCmaFile object with the results.
-
genome_integration.utils.gcta_cojo_utils.do_gcta_joint_on_specified_snps(bfile_prepend, ma_file, out_prepend, gc=1.0)¶ Doeas GCTA COJO joint.
Parameters: - bfile_prepend – bedfile location
- ma_file – ma file location
- out_prepend – where to output
- snp_file_name – this contains the variant on which it should be conditional
- p_val – p value threshold for stepwise selection
- gc – genomic correction factor, default is 1.0. Make sure to check this in your associations
Returns: CojoCmaFile object with the results.
-
genome_integration.utils.gcta_cojo_utils.make_gcta_ma_header()¶ Will create an ma header. for GCTA-COJO
Returns: String with an ma file header.
-
genome_integration.utils.gcta_cojo_utils.make_gcta_ma_line(genetic_association)¶ Makes a GCTA line of the genetic variant.
Will only return a string not newline ended, will not write to a file, the user is expected to do this himself.
:param genetic association class object. :return tab separated string that can be part of ma file:
-
class
genome_integration.utils.gcta_ma_utils.MaFile(file_loc, name)¶ Implements a reader and writer to the MA file as described by GCTA COJO
- name: str
- name of the file.
- ma_results: dict
- keys are the SNP names and values are of class MaLine
- snp_names(self, no_palindromic = False)
- return the snp names in the mafile
- write_results(self, file_name)
- write the results to a file name
- add_bim_data(self, bim_data)
- add the position and chromosome from a BimFile reference to all the SNPS.
-
add_bim_data(bim_data)¶ add the position and chromosome from a BimFile reference to all the SNPS.
Parameters: bim_data – Returns: None
-
snp_names(no_palindromic=False)¶ - return the snp names in the mafile
Parameters: no_palindromic – bool If also palindromic SNPs should be returned (default is False) Returns: list of SNP names.
-
write_result(file_name)¶ write the results to a file name :param file_name: str
file name to write toReturns: None
-
class
genome_integration.utils.gcta_ma_utils.MaLine(line)¶ Class that inherits from GeneticAssociation, reading an MA line.
All the attributes of the GeneticAssociation class.
-
class
genome_integration.utils.plink_utils.FamFile(fam_loc)¶ implements a fam file
- fam_loc: str
- path to the fam file.
- sample_names: list
- sample names of the fam. Currently using <fid>~__~<iid>
- fam_samples: dict
- dictionary with sample names as keys and FamSample as the value.
-
class
genome_integration.utils.plink_utils.FamSample(fid, iid, sex, name, phenotype=None)¶ Extends the sample class, will also contain family ID and sample ID.
special to this class
fid: fid from the plink file iid: iid from the plink file
-
class
genome_integration.utils.plink_utils.PlinkFile(bfile_location)¶ A reader for the plink bed file format.
requires a valid location of the bed file.
- bed_loc: str
- of the bed file location
- bim_loc: str
- of the bim file location
- fam_loc: str
- of the fam file location
- genotypes: float numpy array of variants on columns samples on the rows.
- Numpy array of genotypes default is number of minor alleles per variant and position. not initialized, need to call read_bed_file_into_numpy_array
- bim_data: BimFile object.
- contains all the variants from the bim file.
- fam_data: FamFile object
- contains all the Samples from the bim file
- _decoder: dict
- bit dict of the file encoding.
- read_bed_file_into_numpy_array(self, allele_2_as_zero=True, missing_encoding=3)
- reads in the bed file takes about 5 seconds for 5,000 individuals ~14,000 variants. saves it to the class.
- prune_for_a_region(self, region):
- prunes for a StartEndRegion
- harmonize_genotypes(self, other_plink_file)
- Harmonized another PlinkFile object to this files’ alleles. (flips alleles, and flips genotypes) (MAJOR and minor will be false names.)
-
harmonize_genotypes(other_plink_file)¶ Harmonizes the other plink file to the alleles of self. requires that the variants are the same between files. requires that the variants have the same alleles
WARNING: Will mean that the other plink file will have flipped major and minor alleles.
Parameters: other_plink_file – Plink File to harmonize Returns: PlinkFile
-
output_genotypes_to_bed_file(output_prepend)¶ Writes a bed file to the final list.
Parameters: output_prepend – This is the output prepend for after which .bed, .bim .fam are appended. Returns: Nothing, but bed, bim fam are written
-
prune_for_a_list_of_snps(snp_list, verbose=False)¶ prunes a list of variants,
Parameters: - snp_list – list of variants that should be at least partially overlapping with the variants in the .bim_data attribute of this class
- verbose – print things about what is happening
Returns: self, with only the variants specified in the snp_list.
-
prune_for_a_region(region)¶ Prunes for a region in the plink file.
Parameters: region – StartEndRegion region to prune for Returns: self with the variants outside the region removed.
-
read_bed_file_into_numpy_array(allele_2_as_zero=True, missing_encoding=3, dtype=<class 'float'>)¶ Reads a bed file into a numpy array
Parameters: - allele_2_as_zero – bool allele 2 (often the major allele in plink) is encoded as zero, making an increase in the minor allele an increase in number. This is opposite to the plinkio encoding. But this makes the minor allele often also the effect allele.
- missing_encoding – int encodes missing values as three, plinkio default.
Returns: n indivduals by m variants numpy array (floats) of genotypes.
-
genome_integration.utils.plink_utils.isolate_snps_of_interest_make_bed(ma_file, exposure_name, b_file, tmp_file_prepend, plink_files_out, calculate_ld=False, individuals_to_isolate=None, no_palindromic=False)¶ Isolate snps of interest for a gene, and make a bed file
Parameters: - ma_file –
- exposure_name –
- b_file –
- tmp_file_prepend –
- plink_files_out –
- calculate_ld –
Returns: the name_of the bedfile with only the snps
-
genome_integration.utils.plink_utils.plink_isolate_clump(bed_file, associations, threshold, r_sq=0.5, tmploc='', return_snp_file=False)¶ will prune for ld in a list of snps. from a bed file location. will output a list after prune.
Parameters: - bed_file –
- associations –
Returns: list of snps after prune
-
genome_integration.utils.plink_utils.read_region_from_plink(bed_file, out_location, region, variants=None)¶ Reads a region from a plink file and writes it to an output.
This function can be used if you want to read only a small part of a plink file.
Parameters: - bed_file – str prepend filelocation of a bed file.
- out_location – prepend file location of the pruned file.
- region – StartEndRegion Region to look for.
- variants – iterable of str iterable containing the variant names to keep for analysis.
Returns: None
-
genome_integration.utils.plink_utils.score_and_assess_auc(genetic_associations, bed_file, tmp_file='tmp_score', p_value_thresh=1.0, resolution=500)¶ Using scoring, we determine the auc
Parameters: - genetic_associations –
- bed_file –
- tmp_file –
- p_value_thresh –
- resolution –
Returns:
-
genome_integration.utils.plink_utils.score_individuals(genetic_associations, bed_file, tmp_file='tmp_score', p_value_thresh=1)¶ Used to score individual. :param genetic_associations: :param bed_file: prepend of a bed file :param tmp_file: prepend of temporary files. :param p_value_thresh: p value threshold of which the genetic associations should be part of. :return: dict with keys corresponding to individuals,
values: tuple with the phenotype [0] and score [1] of the individual.