0% found this document useful (0 votes)
149 views16 pages

Genome Data Mining: One Linkage Score Per DNA Letter.

There is only one way to preprocess big Genome Data for AI Tools to work on: Assigning a single linkage score for ever single DNA letter (out of 3 billion) and disease term combination. Resulting database of hundreds of millions hot spots will form the backbone of AI approaches. The described algoritm solves the problems of Probe Specificity and DNA Redundancy and is the only way to get a single linkage score for every DNA letter. Basically, marker based DNA data mining is fine tuned.

Uploaded by

Korkut Vata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views16 pages

Genome Data Mining: One Linkage Score Per DNA Letter.

There is only one way to preprocess big Genome Data for AI Tools to work on: Assigning a single linkage score for ever single DNA letter (out of 3 billion) and disease term combination. Resulting database of hundreds of millions hot spots will form the backbone of AI approaches. The described algoritm solves the problems of Probe Specificity and DNA Redundancy and is the only way to get a single linkage score for every DNA letter. Basically, marker based DNA data mining is fine tuned.

Uploaded by

Korkut Vata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Genome-Data Mining

Basis:

Getting a statistical score for


every DNA position and for
every disease term to build a
keyword normalized database
for Artificial Intelligence to
work on.
Database Structure
n1 = ATGC---------------------------tt / Atherosclerosis, HER2-Positive Breast
Cancer,

n2 = ATGC---------------------------tt / Hyperlipidemia, Anemia, ……

n3 = ATGC--------------------------tc / Glioma, polydacly..

n4

. .

n= 1.000.000
----------------------------------------------------------.
Methodology

cttcacaagt(CATG)cgcgtcgtt


Search: A - 345.000
T - 250.000
0.007
0.006
seconds
seconds
G - 255.000 0.011 seconds
C - 350.000 0.009 seconds
Solution for:

Probe Specificity and DNA Redundancy


Probe Specificity:
cttcacaagt(CATG)cgcgtcgtt
A – 245.000
T – 175.000
G- 252.000
C- 218.000
DNA Redundancy:
+ 890.000 out of 1 million
If the sum (ATGC) > 1 million

then; Probe is specific for 890.000 samples

extend the probe on 5', 3' Sample size is 890.000 for the
particular probe and particular
untill; database

the sum (ATGC) < 1 million Correction factor %88


Solution for:

Probe Specificity and DNA Redundancy


Probe Specificity:
cttcacaagt(CATG)cgcgtcgtt
A – 245.000
T – 175.000
G- 252.000
C- 218.000
DNA Redundancy:
+ 890.000 out of 1 million
If the sum (ATGC) > 1 million

then; Probe is specific for 890.000 samples

extend the probe on 5', 3' Sample size is 890.000 for the
particular probe and particular
untill; database

the sum (ATGC) < 1 million Correction factor %88


Statistics for Every DNA Letter

--------2.345.789.156


1 cctggagcac ggaagattct t gcggacacaaatcgcaact gctaaataaa atttatttat


61 ttgagtgcac agccatgagt cttcacaagt(CATG)cgcgtcgtt atgcttgact tttaaccaaa


121 acacttcgat tgtttcgcgt agcaatagtc gcacaatttt tgaagctttc aaggagttcc


181 tggatttttg ggatatcggc aacgaagttt ctgcagagtc agcagttcgg gtctccagca


241 acggagcttt caacttgccg cagagttttg gcaacgaatc caacgaatat gcccacctgg


301 ctacgcctgt ggatccagcc tacggaggca acaacacgaa caacatgatg cagttcacga


361 acaatctgga aattttggcc aacaataatt ccgatggcaa taacaaaatt aatgcatgca


421 acaaattcgt ctgccacaag ggcactgatt ccgaggatga ctccacggag gtcgatatca


481 aggaggatat tccgaaaacg gtggaggtat cgggatcgga attgaccacg gaacccatgg


541 ccttcttgca gggattaaac tccgggaatc tgatgcagtt cagccagcaa tccgtgctgc


601 gcgaaatgat gctgcaggac attcagatcc aggcgaacac gctgcccaag ctagagaatc


2.346.478 ----------


DNA position 2,345.789.156 Position Identifier: “cttcacaagt(CATG)cgcgtcgtt”
Disease / Normal
Disease / Normal
A- 165.000 C- 143.000
T– 225.000 T – 281.000
G- 255.000 G- 263.000
Delta Change 25% > Treshold
C- 365.000 C - 382.000

After normalization with respect to disease term frequencies.

HotSpot Card:
For HER2 Positive Breast Cancer
Probe: cttcacaagt(CATG)cgcgtcgt,
DNA position # 2.345.789.156
Score : 27% Decrease in T content
Genome-DataMining

DNA position 2,345.789.91


DNA position 2,345.799.913
DNA position 2,945.534.915
DNA position 1,345.789.91
DNA position 2.128.867.985
HER2-Positive Breast Cancer
DNA position 1,345.789.913 A 2x increased risk

DNA position 2,345.799.913 C 3 x increased risk

DNA position 945.534.715 BRCA1(known) T 5 x increased

DNA position 1,345.789.91 T 1,5 increased risk


DNA position 2.128.867.985 G 3X decreased risk


N = 17.546 data points for
HER-2 Positive Breast Cancer
Risk Factors for HER-2 Positive Breast Cancer

Upload your genome data, search for:

“Her-2 positive Breast Cancer”
Search: “Her-2 positive Breast Cancer”


17.546 DNA letters are checked for 2 copies of
genome (maternal, fathernal) and combined
risk factor is displayed.

2.3 times higher risk.
Extent
For All Disease Terms and Disease Variants….

Problems Solved:

Repeat regions are recognized and correct positions within
multiple regions are identified.
Correction factors for database size are determined.
Result

A database for every disease term.


Enter = “HER2 Positive Breast Cancer”


Result = 8 x increased risk for individual XYZ

Based on 17.546 Hotspots on XYZ's Genome


ARTIFICIAL INTELLIGENCE, AI WORKING BASE

HotSpot Card:
For HER2 Positive Breast Cancer
Probe: cttcacaagt(CATG)cgcgtcgt,
DNA position # 2.345.789.156
Score : 27% Decrease in T content in Disease

~ 15.000 Disease Terms


~ 20.000 Hotspots / Disease Terms

~300.000.000 Data Points Above Treshold Values


Structural Mapping: promoter, exon, intron, -5', -3'
Pathway Mapping
Metabolismal Mapping
Phenotype Mapping
Literature Mapping
Restructuring Healthcare Globally

Medical Education will be reframed with a weight towards Data- Mining.

Surgery will be the main medical profession.

The 100$ Whole Genome Test will provide the best genetic and clinical diagnosis
for the next century.

All human characteristics will be known including emotional status and


tendencies to certain behaviours.

.
The global healthcare system will be runned by an international consortium.
LIMITS ?
No Limit
But

Big-Pharma Politics

International Collaboration
Against
Big-Pharma Politics
By

Korkut Vata
Scientist

Tolunay Gümüş
Actor

You might also like