0% found this document useful (0 votes)
12 views

Use of The DNAChecker Algorithm For Improving Bioinformatics Re

This document describes a study that developed an algorithm called DNAChecker to analyze DNA sequences before using BLAST (Basic Local Alignment Search Tool) for bioinformatics research. DNAChecker helps identify the quality of DNA sequences by determining the number of non-template nucleotides denoted as "N" present within the sequences. The presence of many "N"s can affect BLAST results. The researchers implemented DNAChecker using Python to automate sequence quality checks, which had previously required manual processing. DNAChecker was tested on DNA sequences from a USAID project conducted in Indonesia and showed potential for improving bioinformatics research, though it requires further development to more accurately differentiate high and low quality sequences.

Uploaded by

Fina Astina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Use of The DNAChecker Algorithm For Improving Bioinformatics Re

This document describes a study that developed an algorithm called DNAChecker to analyze DNA sequences before using BLAST (Basic Local Alignment Search Tool) for bioinformatics research. DNAChecker helps identify the quality of DNA sequences by determining the number of non-template nucleotides denoted as "N" present within the sequences. The presence of many "N"s can affect BLAST results. The researchers implemented DNAChecker using Python to automate sequence quality checks, which had previously required manual processing. DNAChecker was tested on DNA sequences from a USAID project conducted in Indonesia and showed potential for improving bioinformatics research, though it requires further development to more accurately differentiate high and low quality sequences.

Uploaded by

Fina Astina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Makara Journal of Technology

Volume 23 Number 2 Article 4

8-2-2019

Use of the “DNAChecker” Algorithm for Improving Bioinformatics


Research
Nausheen Bhat
Department of Bioinformatics, School of Life Sciences, Indonesia International Institute for Life Sciences,
Jakarta Timur 13210, Indonesia

Ezra Bernadus Wijaya


Department of Bioinformatics, School of Life Sciences, Indonesia International Institute for Life Sciences,
Jakarta Timur 13210, Indonesia

Arli Aditya Parikesit


Department of Bioinformatics, School of Life Sciences, Indonesia International Institute for Life Sciences,
Jakarta Timur 13210, Indonesia, [email protected]

Follow this and additional works at: https://scholarhub.ui.ac.id/mjt

Part of the Chemical Engineering Commons, Civil Engineering Commons, Computer Engineering
Commons, Electrical and Electronics Commons, Metallurgy Commons, Ocean Engineering Commons, and
the Structural Engineering Commons

Recommended Citation
Bhat, Nausheen; Wijaya, Ezra Bernadus; and Parikesit, Arli Aditya (2019) "Use of the “DNAChecker”
Algorithm for Improving Bioinformatics Research," Makara Journal of Technology: Vol. 23 : No. 2 , Article
4.
DOI: 10.7454/mst.v23i2.3488
Available at: https://scholarhub.ui.ac.id/mjt/vol23/iss2/4

This Article is brought to you for free and open access by the Universitas Indonesia at UI Scholars Hub. It has been
accepted for inclusion in Makara Journal of Technology by an authorized editor of UI Scholars Hub.
Makara J. Technol. 23/2 (2019), 72-77
doi: 10.7454/mst.v23i2.3488

Use of the “DNAChecker” Algorithm for Improving Bioinformatics Research

Nausheen Bhat1, Ezra Bernadus Wijaya1,2, and Arli Aditya Parikesit1*

1. Department of Bioinformatics, School of Life Sciences, Indonesia International Institute for Life Sciences,
Jakarta Timur 13210, Indonesia
2. Department of Bioinformatics and Medical Engineering, Asia University, Taichung 41354, Taiwan

*
e-mail: [email protected]

Abstract

Basic Local Alignment Sequencing Tool (BLAST) is a bioinformatics tool used for analyzing nucleotide sequences
with regards to their similarity. BLAST can be found online on biological databases such as the National Center for
Biotechnology Information (NCBI) and other such repositories. The mechanism of BLAST allows the target sequence
to be compared with other sequences to find regions of local similarity, and thus, a comparability quotient that
determines the resemblance between the sequences is created. Due to the open-platform nature of the online databanks,
several sequences can be accepted with little to no interjections regarding the quality of sequence submitted. An
example of unclean nucleotide sequences can be based on the number of non-template nucleotides, denoted as “N,”
present within the sequence. Here we develop a self-established nucleotide sequence reading program known as
“DNAChecker,” which helps identify the quality of the target sequence and therefore proposes the effectiveness of the
BLAST result. DNAChecker is an inbuilt, program that runs on Python 3.4 and was implemented in the United States
Agency for International Development (USAID) project conducted in Indonesia International Institute for Life
Sciences. Although DNAChecker has proven to be useful, it has a lot of room for improvements, such as having a more
objectively accurate means of differentiating between good and bad sequences.

Abstrak

Penggunaan Algoritma “DNA Checker” untuk Pengembangan Riset Bioinformatika. Basic Sequence Alignment
Tool (BLAST) adalah aplikasi bioinformatika yang digunakan untuk menganalisis sekuens nukleotida sehubungan
dengan pensejajarannya. BLAST dapat ditemukan secara daring di database biologis seperti Pusat Nasional untuk
Informasi Bioteknologi (NCBI) dan repositori lainnya. Mekanisme BLAST memungkinkan sekuens target untuk
dibandingkan dengan sekuens lain untuk menemukan daerah kesamaan lokal, dan dengan demikian, dapat
menghasilkan perbandingan yang menentukan kemiripan antara sekuens. Karena sifat platform terbuka dari bank data
daring, beberapa urutan dapat diterima dengan sedikit atau tanpa interupsi terkait kualitas urutan yang disampaikan.
Contoh urutan nukleotida tidak baik dapat didasarkan pada jumlah nukleotida non-template, dilambangkan sebagai "N,"
yang hadir dalam urutan. Di sini kami mengembangkan program pembacaan urutan nukleotida yang dikenal sebagai
"DNAChecker," yang membantu mengidentifikasi kualitas urutan target dan karenanya meningkatkan keefektifan hasil
pencarian BLAST. DNAChecker adalah program inbuilt, yang berjalan pada Python 3.4 dan diimplementasikan di
proyek Badan Pembangunan Internasional Amerika Serikat (USAID) yang dilaksanakan di Institut Bioscientia
Internasional Indonesia. Meskipun DNAChecker terbukti bermanfaat, ia tetap seyogyanya ditingkatkan fiturnya, seperti
memiliki cara yang lebih akurat secara obyektif untuk membedakan urutan yang baik dan buruk.

Keywords: DNAChecker, Python, NCBI, BLAST, USAID

1. Introduction the use of the Basic Local Alignment Sequencing Tool


(BLAST), which is a bioinformatics tool that allows
DNA amplification is an essential process that nucleotide sequence comparison via the alignment of a
manipulates the DNA fragments of a sample such that a target sequence against a nucleotide databank, where the
sequence is produced. Sequence identification requires most similar sequences can be identified [1]. Although

72 August 2019 | Vol. 23 | No. 2


DNAChecker Algorithm for Improving Bioinformatics Research 73

the results of BLAST have a defined accuracy, it is not addition to the Python software, a DNA sequence
certain how “clean” or precise the target sequence may visualizing software, FinchTV, was used to make an
have been. During amplification, a common error initial identification of the ABI (Applied Biosystem)
occurs where the lack of concentration of any nucleotide format chromatogram file (.ab1) into a more basic text
is undetermined [2],[3], and consequently, the format (.txt), which only represents the nucleotides and
corresponding region is either left empty or identified as not the other measurable factors that FinchTV can
a non-template nucleotide, denoted simply as “N” [4]. express, such as the nucleotide concentration and such
These non-template nucleotides may be the reason for [11]. This was done to easily input the sequence into
variations between the expected and actual results from DNAChecker, since accessing and reading text files is
BLAST. In many cases, where the BLAST result is easier, compared to ABI format files.
unknown, the presence of non-template nucleotides may
allow changes within the sequence that may cause a Data Management and Coding. Figure 1 shows the
disruption in the read sequence and consequently in the interface of the DNA sequence visualizing software
overall research information [4],[5]. FinchTV. As can be seen in the figure, there are peaks
that represent nucleotides with the highest
As experienced with the United States Agency for concentrations. With this, we can conclude that a certain
International Development (USAID) project’s DNA sequence will have the given nucleotide in order.
amplification, several non-template nucleotides were However, when there is a disruption in peak, such as
present, which had to be manually dealt with, so that not shown in the beginning, the software expresses it as an
the whole sequence is read. Due to the advancements in “N.” This “N” nucleotide is the indicator of a good and
computational analysis for biological data, the Python properly expressed and analyzed sequence. The fewer
programing language can be implemented to script a “N” present within a sequence, the better the sequence.
code that allows the sequence to be read prior to
BLAST analysis and determine the quality of the Ideally, the sequence would have no non-template
sequence based on a uniform criterion [6], [7]. Several nucleotide (N), just as it is present in the DNA. Finally,
computational algorithms for large-scale DNA analysis after accessing the DNA sequence, we can copy the
have been implemented, but not for a specialized task as information and paste it onto a text document-based
explained in this research [8]–[10]. In various application, such as Notepad. Figure 2 is the
nucleotide-reading computer-based programs, uncertain representation of the previously shown raw data files
nucleotides within a sequence are defined by the letter that have been converted into text files and thus are
“N.” The more the occurrences of “N” within a given readable from a notepad application. This is an essential
nucleotide sequence, the “uglier” the sequence. The step that allows the sequence to be read by the
ideal sequence would be a chain consisting of the peak DNAChecker program; if the ABI format were
sequence of one of the four nucleotides: guanine (G), employed, the reading, as well as the coding within the
thymine (T), cytosine (C), or adenine (A). Python program, would not be efficient from the
perspective of time and memory. It is also important to
The purpose of this study is to create a calculated notice the directory of this folder. Since the Python
nucleotide sequence analyzer that can dictate accurate program will read the files from this folder, it must be
approval measures in order to reduce the chances of kept in the specified directory that Python could load
BLAST unclean nucleotide sequences and prevent into the PC’s RAM.
possible unsatisfactory results. The function of the
program, DNAChecker, designed by members of the
USAID project, is to analyze whole sequences and
eventually determine whether the sequence is clean
enough to proceed for the next step of sequence
identification, which is performed by BLAST.

2. Experimental
Material. DNAChecker was created using a Hewlett
Packard Pavilion laptop with the following specifications:
Intel® Core™ i7-4510U CPU @ 2.00 GHz 2.60 GHz,
RAM 12.0 GB, Windows 10 64-bit Operating System, ×
64-based processor.

The program used for creating DNAChecker was


Python 3.4, with additional plugins, such as Biopython, Figure 1. The Interface of the DNA Sequence Visualizing
to further upgrade the features of DNAChecker [6]. In Software, FinchTV

Makara J. Technol.  1  August 2019 Vol. 23 No. 2 


74 Bhat, et al.

Figure 2 shows the target DNA sequence that is read as development of this project, the Biopython module can
a text file. The only information that is displayed in this be applied to make use of the coding that Biopython can
format is the nucleotide in sequence. provide.

The import function allows other previously built As a part of managing biological data, it is necessary to
functions to be inserted into the DNAChecker program ensure that all files are readable and designed to be
so that its functions can be utilized. The “re” module or uniformly adjusted. Therefore, the “file” call requires
the Regular Expression module is inputted within the the “open” coding as a well-read function, represented
frame of the string operation and can be used to allow by the “r” code. To uniformly present all accessible
the recognition of various possible strings patterns [12]. data, the nucleotide pairs must be capitalized and the
While the import os is the main function to direct spacing arrangements must be taken care of to avoid the
Python into a specific directory of choice, import Bio is esthetically uneven data. For that, the coding
the function to import the Biopython module into the “file.upper” creates an uppercase default input, and as a
main Python 3.4 software (Figure 3). Although, at this part of the Python built-in rule, “/n” represents the
point, there is no use of the Biopython module since largely gapped spaces, which are then replaced with“ ,”
there are no functions that require any coding that the to express that these spaces are to be replaced with
Biopython module provides. However, for the further nothing in between and are thus equally spaced.

Figure 2. DNA Sequence Shown in a Text Document

Figure 3. The Inputted Coding for the whole DNAChecker Program

Makara J. Technol. 1 August 2019 | Vol. 23 | No. 2


DNAChecker Algorithm for Improving Bioinformatics Research 75

The “re.finditer” is a convenient tool to identify and There are no computational means of measuring DNA
match string patterns that are read from the left to right, sequence quality, as DNA amplicons can only be treated
and the “m” represents the group that holds the via wet laboratory methods. Hence, there is no
information of the sequences, shown previously as developed standard for determining how clean a
m.group(1) [13, 14]. The “re.finditer” function in this sequence should be before being processed for any
case helps to identify four sequential “N”s in a experiment.
sequence, from where the reading for the sequence will
be tracked. To end the search of the sequence, the Similarly, there is no standard for the percentage
function will have to find another group of four “N”s, as measurements used in DNAChecker, but the brackets
shown in the coding above. The representation of four are based on fine estimates of the number of non-
“N”s is a helpful way to start the sequence from an ugly template nucleotides that occur within the middle
point in the beginning and to read the rest of the neat section of the sequence. The beginning and the ending
sequence until the next four “N”s. However, if the of the sequence tend to be very “noisy,” and therefore,
occurrence of the four “N”s happens earlier than before the sequence may not qualify as a good sample.
the necessary amount of sequence is recorded, an error However, we utilize multiple N’s in the beginning and
message will be shown, saying that the sequence needs the end of the sequence to provide the starting and
to be at least 500-base-pairs long, which will be set as a ending points for the program to read the sequence. This
minimum base pair count. If the sequence count is more creates a suitable measure of the significantly more
than 500, then the program can move onto the next step. important and neater sequence, which is improved than
the earlier efforts [15-17].
Finally, the next set of coding acts as a rule for the
quantitative measurement for the sequence, which will DNAChecker has proven to be useful for the general
decide the beauty of the extracted sequence. The code identification of clean sequences for BLAST. This
summarizes the requirements to identify the number of program can be applied in various projects that deal
“N”s in the sequence and the total number of base pairs with various sequences from organisms, since it acts as
in the sequence itself. an efficient tool to filter out the good sequences from
the bad ones while providing information based on the
If the amount of “N”s per sequence reaches a certain sequence that can be used in the database for further
level, the sequence will be judged for its beauty. As comparison. In a research grant, known as the USAID
instructed to the program, if the number of “N”s per grant, Indonesia International Institute for Life Sciences
sequence is less than 5%, then the sequence is deemed (i3L) has been working on the development of the
BEAUTIFUL. If the sequence is between 5% and 20%, microbial diversity of lands, as well as the identification
the sequence is deemed FINE, whereas if the sequence of biofuel-potent microbes [6-8]. DNAChecker can be
is between 21% and 39%, the sequence is considered employed in the USAID project, as it can analyze the
OKAY and reconsideration is needed on cleaning up the identified sequences from the microbes and after
nucleotides sequence by replacing the N base with the determining the cleanliness of the sequence, it can
best peaks shown on Finch TV. Finally, if the provide a decent secondary proof of the accurate results
percentage of “N”s is above 40% of the whole seThe from BLAST, as well as improve the databases with
result of the input is the determined verdict of the DNA more information. In the research world, where
sequence quality, ranging from beautiful to unreadable. scientific measures for various processes, such as next-
Figure 5 shows the result of a sample analyzed with the generation sequencing and metagenomics, are carried
DNAChecke; the total sequence length, total amount of out on computers, DNAChecker holds the important
“N”s, and percentage of “N”s present within the role of pre-analyzing such digital results that will help
sequence are all displayed. Finally, the DNAChecker in assuring the purity of the sequence.
determines the quality of the sequence. Figure 6
displays different raw data sequences with different DNAChecker is intended to be one of three different
measurements based on the number of non-template programs, along with multiBLAST and GenBank
nucleotides present in the individual sequence. As Checker. MultiBLAST, is the process of analyzing
observed, the percentage of non-template nucleotides in several files together using BLAST to efficiently work
the first raw data is less than 5%, and therefore, the on multiple sequences and obtain data faster. Although
sequence receives a BEAUTIFUL rating. However, several multiBLAST codings are provided on the
when the percentage of non-template nucleotides within internet [18], this project intends to integrate several
the sequence is above 5% and below 20%, the sequence personal touches into creating a newer multiBLAST
is deemed FINE. Finally, the last sequence shows that coding, with the previous coding acting as a backbone.
the sequence of the Raw_12.txt file has multiple “N”s GenBank Checker introduces another level of advanced
before its minimum limit, i.e., 500 base pairs. The result data collection, which improves on the next step of
prints out as shown. multiBLAST, to identify the microorganism, along with
its details and especially its trusted publications.

Makara J. Technol.  1  August 2019 Vol. 23 No. 2 


76 Bhat, et al.

significant criteria for uploading any quality of


sequences onto DNA databases. With the help of
DNAChecker, a basis can be adapted to ensure that only
sequences of high quality are uploaded onto the online
databanks. The use of DNAChecker is not only limited
to the USAID project since it can be used to analyze any
given sequence as long as the sequence can read the
four N’s chain as a beginning and ending, so that the
sequence can print out the intended results. Considering
Figure 5. The Result of the First Sequence Shown
the above, DNAChecker can offer a lot in the world of
research.

Acknowledgement
The author would like to thanks Institute of Research
and Community Empowerment, Indonesia International
Institute for Life Sciences, and I3L-USAID Project for
supporting this research. Thanks also go to Faried
Irmansyah with his team from I3L IT Department for
providing support and infrastructure. Lastly, thanks also
goes to Direktorat Riset dan Pengabdian Masyarakat,
Direktorat Jenderal Penguatan Riset dan Pengembangan
Kementerian Riset, Teknologi dan Pendidikan Tinggi
Republik Indonesia for providing Hibah Penelitian
Dasar DIKTI/LLDIKTI III 2019 No. 1/AKM/PNT/2019.
The author declares that there is no conflicting interest.

Figure 6. The Results of Various Sequences from the References


Initial Raw Data
[1] M. Johnson, I. Zaretskaya, Y. Raytselis, Y.
Merezhuk, S. McGinnis, T.L. Madden, Nucleic
GenBank Checker is intended to access the GenBank Acids Res. 36 (2008) W5-9.
accession number of the given sequence, which also [2] P. Rice, I. Longden, A. Bleasby, Trends Genet. 16
currently exists, using the Biopython module [6, 19]. (2000) 276.
However, GenBank Checker has the property to save [3] D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J.
most of the necessary selected information into the Ostell, E.W. Sayers, Nucleic Acids Res. 37 (2009)
database, allowing the database to be significantly more D26.
detailed. [4] B. Steipe, B. Schiller, A. Plückthun, S. Steinbacher,
J. Mol. Biol. 240 (1994) 188.
Other than improvements through the development of [5] E.H. Akand, K.M. Downard, Mol. Phylogenet.
other programs, DNAChecker itself can be improved in Evol. 112 (2017) 209.
having a more detailed quantitative measurement, with [6] P.J.A. Cock, T. Antao, J.T. Chang, B.A. Chapman,
more levels of cleanliness. This program has the C.J. Cox, A. Dalke, I. Friedberg, T. Hamelryck, F.
potential to be scaled up for genome and proteome Kauff, B. Wilczynski, M.J.L. de Hoon,
annotation filtering methods that currently still use Bioinformatics 25 (2009) 1422.
simple scripting applications [20–22]. One shortcoming [7] B. Chapman, J. Chang, ACM SIGBIO Newsl. 20
that must be fixed is the identification of the N base (2000) 15.
based on the highest peak that can be visualized from [8] M.I. Khan, C. Sheel, Am. J. Bioinforma. 2 (2013)
Finch TV, and a coding that can efficiently allow such 15.
facilities needs to be added. [9] D.P. M., R. Prabha, A. Rai, D.K. Arora, Am. J.
Bioinforma. 1 (2012) 10.
3. Conclusion [10] R.M. Al-Khatib, R. Abdullah, N.A. Rashid, J.
Comput. Sci. 5 (2009) 680.
Advancements in computational science for biological [11] K.N. Mishra, D.A. Aaggarwal, D.E. Abdelhadi,
data has allowed the creation of a simple program that D.P.C. Srivastava, Int. J. Comput. Appl. 3 (2010)
can be helpful in simply providing an additional 39.
assurance for the accuracy of biological data. This [12] N. Tabuchi, E. Sumii, A. Yonezawa, Electron.
report also shines some light on the fact that there are no Notes Theor. Comput. Sci. 75 (2003) 95.

Makara J. Technol. 1 August 2019 | Vol. 23 | No. 2


DNAChecker Algorithm for Improving Bioinformatics Research 77

[13] J.C. Brown, J. Virol. Antivir. Res. 5 (2016) 1. [18] P.J.A. Cock, T. Antao, J.T. Chang, B.A. Chapman,
[14] B. Steele, J. Chandler, S. Reddy, in: Algorithms C.J.Cox, A. Dalke, I. Friedberg, T. Hamelryck, F.
Data Sci., Springer International Publishing, Cham, Kauff, B.Wilczynski, M.J.L. de Hoon. Bioinf. 25
2016, pp. 313–342. (2009) 1422.
[15] T. Cajka, L.A. Garay, I.R. Sitepu, K.L. Boundy- [19] B. Chapman, Genome Informatics 299 (2003) 298.
Mills, O. Fiehn, J. Nat. Prod. 79 (2016) 2580. [20] A.A. Parikesit, P.F. Stadler, S.J. Prohaska, Open
[16] L.A. Garay, I.R. Sitepu, T. Cajka, O. Fiehn, E. Ser. Informatics 4 (2012) 1.
Cathcart, R.W. Fry, A. Kanti, A. Joko Nugroho, [21] A.A. Parikesit, S. Prohaska, P. Stadler, N.
S.A. Faulina, S. Stephanandra, J.B. German, K.L. Biotechnol. 27 (2010) S44.
Boundy-Mills, J. Ind. Microbiol. Biotechnol. 44 [22] A.A. Parikesit, P.F. Stadler, S.J. Prohaska, in: Ext.
(2017) 1. Abstr. Ger. Conf. Bioinforma., 2011, pp. 9–11.
[17] L.A. Garay, I.R. Sitepu, T. Cajka, I. Chandra, S.  
Shi, T. Lin, J.B. German, O. Fiehn, K.L. Boundy-
Mills, J. Ind. Microbiol. Biotechnol. 43 (2016) 887.
 

Makara J. Technol.  1  August 2019 Vol. 23 No. 2 

You might also like