0% found this document useful (0 votes)
54 views14 pages

Courrier - Ibrahima Diallo - Outlook

Uploaded by

Ibrahima DIALLO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
54 views14 pages

Courrier - Ibrahima Diallo - Outlook

Uploaded by

Ibrahima DIALLO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 14
Soumare eal BloDota Mining (2021) 1430 hps/dorg/10.1186/s13040-021-00256-7 BioData Mining New neural network classification ® method for individuals ancestry prediction aa from SNPs data H, Soumare!?" ®, S. Rezgui’, N. Gmati* and A. Benkahla? atts makina BMC Abstract Atificial Neural Network (ANN) algorithms have been widely used ta analyse genomic data, Single Nucleotide Polymorphisms(SNPs) represent the genetic vatiations, the most common in the human genome, it has been shown that they are involved in many genetic diseases, and can be used ta predict their development, Developing AN to handle this type af data can be considered as a great success in the medical ‘world, However, the high dimensionality of genomic data and the availabilty of a limited number of samples can make the learning task very complicated. In this work, we propose 2 New Neural Network classification method based on input perturbation The idea is frst to use SVD to reduce the dimensionality ofthe input data and to train a Classification network, which prediction errors are then reduced by perturbing the SYD projection matrix. The proposed methad has been evaluated on data fram individuals, ‘with different ancestral origins, the experimental results have shown the effectiveness, of the proposed method. Achieving up to 96.23% of classification accuracy, this ‘approach surpasses previous Deep learning approaches evaluated on the same dataset. Keywords: Artificial neural network, Dimensionality reduction, Input perturbation, Single nucleotide polymorphism, Singular value decomposition Introduction ‘The human genome contains three billion of base pairs, with only 0.1% difference between individuals [1]. The most common type of genetic variations between individuals is called Single Nucleotide Polymorphism (SNP) [2]. An SNP is a change from one base pair to another, which occurs about once every 1000 bases. Most of these SNPs have no impact on human health. However, many studies have shown that some of these genetic variations have important biological effects and are involved in many human diseases [3,4], SNPs are commonly used to detect genes associated with the development of adis- case within families [5]. In addition, SNPs can also help to predict a persons response to drugs or their susceptibility to develop one or more particular diseases, In genetics, othe Autor 2021 Open Aces atc eeses una a Cate Comnans Attn ‘shih parm ie hata adaplaton debian andreeraton n ay medumcr format ng aso ove nro {heat thea aha ink tothe Gestne Commons ence angie changer wee "erat pea soy nparare eyo redio tan pemssen recy Camnans Pubic Doman Dedkaten waver fio /ceatneconmonscva/pubsorah/aea// apps oe a Soumare etal BiaDataMining (2021) 1430 Page 20f14 Genome-Wide Association Studies (GWAS) are observational studies that use high- throughput genotyping technologies to identify a set of genetic variants that are associ- ated to a given trait or disease [6], by comparing variants in a group of cases with variants in a group of controls. However, this approach is only optimal for populations from the same ancestry group, as it is challenging to disease from those that characterize the genetic of human populations. In this context, ‘numerous machine learning algorithms have been used to classify individuals accord- ing to genetic differences that affect their population. Support Vector Machines (SVM) ‘ciate the variations associated with a methods have been applied to infer recent genetic ancestry of a subgroup of communities in the USA [7] or coarse ethnicity [8]. However, SVM methods are very sensitive to the choice of kernel and its parameters [9]. Deep learning algorithms, such as Neural Net- works have been widely used to analyse genomic data as well as gene expression data to classify certain diseases (10~20]. But, the high dimensionality of genomic data (when the number of input features is several times higher than the number of training examples) makes the learning task very difficult. Indeed, when data is composed of a large number of input features nr for a small number of samples 7 (n << m), the problem of overft- ting becomes inevitable. In general, overfitting in machine learning occurs when a model fits well with the training data, but not ft the unseen data. The model learns details and noise in the training data, which negatively impact the performance of the model on new data. One way to avoid the problem of overfitting is to reduce the complexity ofthe prob- lem by removing features that do not contribute or decrease the accuracy of the model [21. Different techniques are used to deal with the problem of overftting. The most well- known onesare Land £2 regularizations [22]. The idea of these techniques is to penalize the higher weights in the model by adding extra terms in the loss function. Another com monly used regularization technique, called "Dropout’, introduced by Hinton et al. [23] consists of dropping neurons at random (in hidden layers) in each learning round. How- ever, with such difference between the number of features versus the number of samples, it increases the problem of overfitting. To overcome this problem, dimensionality reduc- tion techniques need to be combined with unsupervised learning methods or other data preprocessing techniques ‘There are many ways to transform a high-dimensional data to low-dimensional data, Singular Value Decomposition (SVD), Principal Component Analysis (PCA)) and Autoencoder(AE) are the most common dimensional reduction techniques. SVD and PGA are the most popular linear dimensionality reduction techniques. Both attempt to find k orthogonal dimensions in an n-dimensional space, so that k < nt. They are related to each other, but PCA uses the covariance matrix of the input data, while SVD is performed on the input matrix itself. The Autoencoder is a Neural Network that tries to reconstruct the input data from their compressed form. Indeed, the Autoencoder is used as a method of non-linear dimensionality reduction, it works by mapping an n-dimensional input data into a k-dimensional data (with k