algorithm pseudo code

Pseudo code for POSTGAP (master branch)

Ed Mountjoy (15 March 2018)

##### Pre-process EFO/disease terms
If EFOs are provided as input (can be EFO, HP, GO, Orphanet),  
  use OLS to convert to IRI terms
Else,  
  search for disease string using zooma and return top hit's EFO  

##### Find child terms
If child_terms flag is used, find all child terms for each IRI

##### Run using disease/efo as starting point
If disease/efo terms used as starting point  
  Scan all disease DBs (GWAS Catalog, etc) to find SNPs associated with query disease  
  Cluster associated GWAS SNPs based on LD

  For each cluster:  
    Find variants in LD with the top GWAS hit  

    Using combined top SNP + all LD SNPS (cluster_SNPs) get SNP-gene evidence from each cis-regulatory source:  

      (1) GTEx
        Get list of genes within 1Mb of min/max SNP position
        If num genes < num cluster_SNPs:
          For each gene:
            For each tissue (specified by user):
              Find all SNPs associated with gene (if in cluster_SNPs)
              Assign association "score" as 1-pval
        Else:
          For each cluster_SNP:
            For each tissue (specified by user):
              Find all genes associated with SNP
              Assign association "score" as 1-pval
        (Note: both branches of this conditional statement produce the same output [SNP->gene scores])

      (2) VEP
        For each SNP in cluster_SNPs:
          Get VEP consequences (from Ensembl API)
          Save result for each consequence
          Assigned "score" depends on VEP "impact"
            'HIGH': 4,
            'MEDIUM': 3,
            'LOW': 2,
            'MODIFIER': 1,
            'MODERATE': 1

      (3) Fantom5
        Extract all Fantom5 intervals that overlap SNPs (using bedtools). Example Fantom5 in Appendix 1 below.
        For each row in extracted Fantom5:
          Calc evidence score linking SNP to gene. This uses a custom "STOPGAP_FDR" scoring function:
            Loads pickle containing Fantom5 specific model
            Calc distance from SNP to gene TSS
            If dist > MAX_DISTANCE (from model), return 0
            Calc bin, = distance / BIN_WIDTH (from model)
            Look up bin in model to get FDR
            Assign score as 1 - FDR
            (Note: The score is basically function of distance from SNP to TSS. It is not based in any way on column 5 [FDR] from the input data. Therefore the input data must have been pre-filtered to only contain "significant" results)

      (4) DHS
        Same as Fantom5, except using a DHS specific FDR model pickle
        Example DHS bed in Appendix 2.
        (Note: This is ENCODE DHS-promoter correlation data. Possibly from this paper https://www.nature.com/articles/nature11232)

      (5) PCHIC
        Extract all PCHIC intervals that overlap SNPs (using bedtools). Example PCHIC in Appendix 3 below.
        For each row in extracted PCHIC:
          Calc evidence score linking SNP to gene = score / (max score in dataset)
        (probably from https://www.cell.com/cell/abstract/S0092-8674(16)31322-8)

      (6) Nearest gene
        For each SNP:
          Find the closest gene TSS (in Ensembl_TSSs.bed)
          Assign score of 1

    Using same set of top SNP + all LD SNPS (cluster_SNPs) get SNP-specific evidence from each regulatory source:

      (1) Regulome DB
        Find overlap between RegulomeDB bed (appendix 4) and cluster_SNP positions
        For each overlap (row):
          Assign score for whether that SNP is regulatory
          Set score = 1 if regulomedb_score starts with 1 or 2, else score = 0.5. See appendix 5 for summary of RegulomeDB scores

      (2) VEP_reg
        (Note: finds colocated SNPS, I'm not sure why)
        For each SNP in cluster_SNPs:
          Queries Ensembl VEP API
          Extracts "colocated_variants"
          Saves with score = 0

    For each SNP and its corresponding cis-reg and reg evidence record (from above):
      For each gene in evidence record, calculate overall SNP-to-gene score:

        For each piece of evidence:
        (Note: reg evidence [not cis-reg] is gene ambiguous, yet is still contributing to SNP-gene score ??)
          If evidence score > scores[gene][evidence source] (finds largest piece of evidence for this type of evidence):
            scores[gene][evidence source] = evidence score
          If evidence is from VEP:
            scores[gene][VEP_count] + 1
            scores[gene][VEP_score] + evidence score
            (This is done as there can be multiple consequences from VEP)

        scores[gene][PCHiC] = min(scores[gene][PCHiC], 1) (bounds PCHiC to max 1)

        If there is any VEP evidence for this SNP-gene:
          Calc mean VEP score across consequences

        Create weighted sum for SNP-gene across evidence sources using specified weights (appendix 6)

    For each SNP-gene association:
      Multiply its weighted score by degree of LD (r2) with the top GWAS SNP in the cluster. Call this the "gene score".

    Find max score

    For each SNP-gene association:
      Query OMIM for that gene.
      If record exists:
        Set evidence score to max score
        (Note: This is called the "OMIM exception", I have no idea why its doing it)

    Precompute dictionary: For each LD SNP in cluster, adjust top hit's P-value using PICS score (see appendix 7 for function). Normalise so that all sum to 1.

    Across all SNP-gene associations in the cluster, calculate a "total score" (representing the most likely gene for this cluster of SNPs):
        A = pics * (pics ** (1/3))
        B = gene_score * (gene_score ** (1/3))
        total score = ((A + B) / 2) ** 3

Note: I don't think tissues are currently being weighted in any way

Appendix 1: Example Fantom5 bed

chrom
chromStart
chromEnd
HGNC
FDR:fdr

  1       858256  858648  SAMD11  0
  1       941791  942135  HES4    6.32949888009368e-07
  1       945769  946034  ATAD3B  2.39613349948099e-06
  1       945769  946034  CCNL2   3.21786973606028e-06
  1       956563  956812  AGRN    0
  1       956563  956812  ATAD3A  8.9714551881892e-07
  1       956563  956812  ATAD3A  9.48658943745863e-07

Appendix 2: Example DHS bed

chrom
chromStart
chromEnd
HGNC
Correlation

1       87640   87790   AL627309.1      0.87171
1       87640   87790   RP11-34P13.10   0.820162
1       87640   87790   RP11-34P13.7    0.820162
1       115600  115750  RP11-34P13.7    0.823767
1       118840  118990  AL627309.1      0.908176

Appendix 3: Example PCHIC bed

chrom
chromStart
chromEnd
score
gene_id

1       83360   84049   5.21321231832259        ENSG00000139117
1       83360   84049   5.21321231832259        ENSG00000257718
1       91320   91586   6.47948309357412        ENSG00000177947
1       91320   91586   6.47948309357412        ENSG00000188076
1       91320   91586   6.47948309357412        ENSG00000232495
1       91320   91586   6.47948309357412        ENSG00000254559

Appendix 4: Example Regulome bed

chrom
chromStart
chromEnd
regulome DB score (appendix 5)

1       10327   10328   3a
1       10440   10441   2b
1       10469   10470   2b
1       16142   16143   3a
1       16243   16244   3a
1       16243   16244   3a
1       29393   29394   2b

Appendix 5: RegulomeDB scores

1a eQTL + TF binding + matched TF motif + matched DNase Footprint + DNase peak 1b eQTL + TF binding + any motif + DNase Footprint + DNase peak 1c eQTL + TF binding + matched TF motif + DNase peak 1d eQTL + TF binding + any motif + DNase peak 1e eQTL + TF binding + matched TF motif 1f eQTL + TF binding / DNase peak 2a TF binding + matched TF motif + matched DNase Footprint + DNase peak 2b TF binding + any motif + DNase Footprint + DNase peak 2c TF binding + matched TF motif + DNase peak 3a TF binding + any motif + DNase peak 3b TF binding + matched TF motif 4 TF binding + DNase peak 5 TF binding or DNase peak 6 other

Appendix 6: POSTGAP evidence weights (master branch 15/3/18)

EVIDENCE_WEIGHTS = { 'Regulome': 1, 'VEP': 1, 'GTEx': 1, 'Fantom5': 1, 'DHS': 1, 'PCHIC': 1, 'nearest_gene': 1 }

Appendix 7: PICS score (adjusted p-value based on LD)

def PICS(ld, pvalue):
    minus_log_pvalue = - math.log(pvalue) / math.log(10);
    SD = dict()
    Mean = dict()
    prob = dict()
    sum = 0

    for snp in ld.keys():
        if snp in ld:
            SD[snp] = math.sqrt(1 - math.sqrt(ld[snp]) ** 6.4) * math.sqrt(minus_log_pvalue) / 2;
            Mean[snp] = ld[snp] * minus_log_pvalue;
        else:
            # Defaults for remote SNPs
            SD[snp] = 0
            Mean[snp] = 1 + minus_log_pvalue

        # Calculate the probability of each SNP
        if SD[snp]:
            prob[snp] = 1 - pnorm(minus_log_pvalue, Mean[snp], SD[snp])
        else:
            prob[snp] = 1

        # Normalisation sum
        sum += prob[snp]

    # Normalize the probabilies so that their sum is 1.
    return dict((snp, prob[snp] / sum) for snp in prob.keys())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly