-
Notifications
You must be signed in to change notification settings - Fork 21
algorithm pseudo code
Ed Mountjoy (15 March 2018)
##### Pre-process EFO/disease terms
If EFOs are provided as input (can be EFO, HP, GO, Orphanet),
use OLS to convert to IRI terms
Else,
search for disease string using zooma and return top hit's EFO
##### Find child terms
If child_terms flag is used, find all child terms for each IRI
##### Run using disease/efo as starting point
If disease/efo terms used as starting point
Scan all disease DBs (GWAS Catalog, etc) to find SNPs associated with query disease
Cluster associated GWAS SNPs based on LD
For each cluster:
Find variants in LD with the top GWAS hit
Using combined top SNP + all LD SNPS (cluster_SNPs) get SNP-gene evidence from each cis-regulatory source:
(1) GTEx
Get list of genes within 1Mb of min/max SNP position
If num genes < num cluster_SNPs:
For each gene:
For each tissue (specified by user):
Find all SNPs associated with gene (if in cluster_SNPs)
Assign association "score" as 1-pval
Else:
For each cluster_SNP:
For each tissue (specified by user):
Find all genes associated with SNP
Assign association "score" as 1-pval
(Note: both branches of this conditional statement produce the same output [SNP->gene scores])
(2) VEP
For each SNP in cluster_SNPs:
Get VEP consequences (from Ensembl API)
Save result for each consequence
Assigned "score" depends on VEP "impact"
'HIGH': 4,
'MEDIUM': 3,
'LOW': 2,
'MODIFIER': 1,
'MODERATE': 1
(3) Fantom5
Extract all Fantom5 intervals that overlap SNPs (using bedtools). Example Fantom5 in Appendix 1 below.
For each row in extracted Fantom5:
Calc evidence score linking SNP to gene. This uses a custom "STOPGAP_FDR" scoring function:
Loads pickle containing Fantom5 specific model
Calc distance from SNP to gene TSS
If dist > MAX_DISTANCE (from model), return 0
Calc bin, = distance / BIN_WIDTH (from model)
Look up bin in model to get FDR
Assign score as 1 - FDR
(Note: The score is basically function of distance from SNP to TSS. It is not based in any way on column 5 [FDR] from the input data. Therefore the input data must have been pre-filtered to only contain "significant" results)
(4) DHS
Same as Fantom5, except using a DHS specific FDR model pickle
Example DHS bed in Appendix 2.
(Note: This is ENCODE DHS-promoter correlation data. Possibly from this paper https://www.nature.com/articles/nature11232)
(5) PCHIC
Extract all PCHIC intervals that overlap SNPs (using bedtools). Example PCHIC in Appendix 3 below.
For each row in extracted PCHIC:
Calc evidence score linking SNP to gene = score / (max score in dataset)
(probably from https://www.cell.com/cell/abstract/S0092-8674(16)31322-8)
(6) Nearest gene
For each SNP:
Find the closest gene TSS (in Ensembl_TSSs.bed)
Assign score of 1
Using same set of top SNP + all LD SNPS (cluster_SNPs) get SNP-specific evidence from each regulatory source:
(1) Regulome DB
Find overlap between RegulomeDB bed (appendix 4) and cluster_SNP positions
For each overlap (row):
Assign score for whether that SNP is regulatory
Set score = 1 if regulomedb_score starts with 1 or 2, else score = 0.5. See appendix 5 for summary of RegulomeDB scores
(2) VEP_reg
(Note: finds colocated SNPS, I'm not sure why)
For each SNP in cluster_SNPs:
Queries Ensembl VEP API
Extracts "colocated_variants"
Saves with score = 0
For each SNP and its corresponding cis-reg and reg evidence record (from above):
For each gene in evidence record, calculate overall SNP-to-gene score:
For each piece of evidence:
(Note: reg evidence [not cis-reg] is gene ambiguous, yet is still contributing to SNP-gene score ??)
If evidence score > scores[gene][evidence source] (finds largest piece of evidence for this type of evidence):
scores[gene][evidence source] = evidence score
If evidence is from VEP:
scores[gene][VEP_count] + 1
scores[gene][VEP_score] + evidence score
(This is done as there can be multiple consequences from VEP)
scores[gene][PCHiC] = min(scores[gene][PCHiC], 1) (bounds PCHiC to max 1)
If there is any VEP evidence for this SNP-gene:
Calc mean VEP score across consequences
Create weighted sum for SNP-gene across evidence sources using specified weights (appendix 6)
For each SNP-gene association:
Multiply its weighted score by degree of LD (r2) with the top GWAS SNP in the cluster. Call this the "gene score".
Find max score
For each SNP-gene association:
Query OMIM for that gene.
If record exists:
Set evidence score to max score
(Note: This is called the "OMIM exception", I have no idea why its doing it)
Precompute dictionary: For each LD SNP in cluster, adjust top hit's P-value using PICS score (see appendix 7 for function). Normalise so that all sum to 1.
Across all SNP-gene associations in the cluster, calculate a "total score" (representing the most likely gene for this cluster of SNPs):
A = pics * (pics ** (1/3))
B = gene_score * (gene_score ** (1/3))
total score = ((A + B) / 2) ** 3
Note: I don't think tissues are currently being weighted in any way
- chrom
- chromStart
- chromEnd
- HGNC
- FDR:fdr
1 858256 858648 SAMD11 0
1 941791 942135 HES4 6.32949888009368e-07
1 945769 946034 ATAD3B 2.39613349948099e-06
1 945769 946034 CCNL2 3.21786973606028e-06
1 956563 956812 AGRN 0
1 956563 956812 ATAD3A 8.9714551881892e-07
1 956563 956812 ATAD3A 9.48658943745863e-07
- chrom
- chromStart
- chromEnd
- HGNC
- Correlation
1 87640 87790 AL627309.1 0.87171
1 87640 87790 RP11-34P13.10 0.820162
1 87640 87790 RP11-34P13.7 0.820162
1 115600 115750 RP11-34P13.7 0.823767
1 118840 118990 AL627309.1 0.908176
- chrom
- chromStart
- chromEnd
- score
- gene_id
1 83360 84049 5.21321231832259 ENSG00000139117
1 83360 84049 5.21321231832259 ENSG00000257718
1 91320 91586 6.47948309357412 ENSG00000177947
1 91320 91586 6.47948309357412 ENSG00000188076
1 91320 91586 6.47948309357412 ENSG00000232495
1 91320 91586 6.47948309357412 ENSG00000254559
- chrom
- chromStart
- chromEnd
- regulome DB score (appendix 5)
1 10327 10328 3a
1 10440 10441 2b
1 10469 10470 2b
1 16142 16143 3a
1 16243 16244 3a
1 16243 16244 3a
1 29393 29394 2b
1a eQTL + TF binding + matched TF motif + matched DNase Footprint + DNase peak 1b eQTL + TF binding + any motif + DNase Footprint + DNase peak 1c eQTL + TF binding + matched TF motif + DNase peak 1d eQTL + TF binding + any motif + DNase peak 1e eQTL + TF binding + matched TF motif 1f eQTL + TF binding / DNase peak 2a TF binding + matched TF motif + matched DNase Footprint + DNase peak 2b TF binding + any motif + DNase Footprint + DNase peak 2c TF binding + matched TF motif + DNase peak 3a TF binding + any motif + DNase peak 3b TF binding + matched TF motif 4 TF binding + DNase peak 5 TF binding or DNase peak 6 other
EVIDENCE_WEIGHTS = { 'Regulome': 1, 'VEP': 1, 'GTEx': 1, 'Fantom5': 1, 'DHS': 1, 'PCHIC': 1, 'nearest_gene': 1 }
def PICS(ld, pvalue):
minus_log_pvalue = - math.log(pvalue) / math.log(10);
SD = dict()
Mean = dict()
prob = dict()
sum = 0
for snp in ld.keys():
if snp in ld:
SD[snp] = math.sqrt(1 - math.sqrt(ld[snp]) ** 6.4) * math.sqrt(minus_log_pvalue) / 2;
Mean[snp] = ld[snp] * minus_log_pvalue;
else:
# Defaults for remote SNPs
SD[snp] = 0
Mean[snp] = 1 + minus_log_pvalue
# Calculate the probability of each SNP
if SD[snp]:
prob[snp] = 1 - pnorm(minus_log_pvalue, Mean[snp], SD[snp])
else:
prob[snp] = 1
# Normalisation sum
sum += prob[snp]
# Normalize the probabilies so that their sum is 1.
return dict((snp, prob[snp] / sum) for snp in prob.keys())