An Improved Data Structure for Efficient Storage of Multiple BIOsequences

Hasan, Md. Zahidul; Shimul, Anik Islam

An Improved Data Structure for Efficient Storage of Multiple BIOsequences

Files

29 An Improved Data Structure.pdf (551.13 KB)

Date

2012-11-15

Authors

Hasan, Md. Zahidul

Shimul, Anik Islam

Publisher

Department of Computer Science and Engineering (CSE), Islamic University of Technology (IUT), Board Bazar, Gazipur-1704, Bangladesh

Abstract

Compression of large DNA sequences has been a subject of great interest since the availability of genomic databases. Although only two bits are sufficient to encode four bases of DNA (namely A, G, T and C), the massive size DNA sequences forces the need for efficient compression. In this article we are going to propose an improved version of an existing algorithm known as “GtEncseq” which describes the procedure of storing multiple biological sequences of variable Character size, with customizable character transformations, “wildcard” and “separator” support, and a diverse group of internal representations optimized for different arrangements of wildcards and sequence lengths. Our main target is extensive compression of data with an attempt of eliminating the wildcard entries from the sequence but make it available for the reuse. An efficient time requirement for encoding the desired sequence is also a note to consider.

Description

Supervised by Prof. Dr. M. A. Mottalib, Co-Supervisor, Tareque Mohmud Chowdhury, Assistant Professor, Computer Science and Engineering (CSE), Islamic University of Technology (IUT), Board Bazar, Gazipur-1704. Bangladesh.

Citation

[1] Sascha Steinbiss and Stefan Kurtz, “A New Efficient Data Structure for Storage And Retrieval of Multiple BIOsequences”. [2] Shanika Kuruppu, Bryan Beresford-Smith, Thomas Conway, and Justin Zobel, ”Iterative Dictionary Construction for Compression of Large DNA Data Sets”. [3] Hieu Dinh and Sanguthevar Rajasekaran, “A memory-efficient data structure representing exactmatch overlap graphs with application for next-generation DNA assembly”. [4] Sheng Bao, Shi Chen, Zhi-Qiang Jing and Ran Ren, ” A DNA Sequence Compression Algorithm Based on LUT and LZ77”. [5] D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and E.W.Sayers, “GenBank,” Nucleic Acids Research, vol. 38, (Database Issue), pp. D46-D51, 2010. [6] A. Morgulis, G. Coulouris, Y. Raytselis, T.L. Madden, R. Agarwala, and A.A. Schaffer, “Database Indexing for Production MegaBLAST Searches,” Bioinformatics, vol. 24, no. 16, pp. 1757-1764, 2008. [7] Srinivasa K. G , Jagadish M , Venugopal K R ,LMPatnaik, “Efficient Compression of non-repetitive DNA sequences using Dynamic Programming”. [8] E. Rivals, J-P. Delahaye, M. Dauchet, and 0. Delgrange. A guaranteed compression scheme for repetitive dna sequences.” LIFL Lille I Univerisity technical report, page 285, 1995. [9] Raffaele Giancarlo∗, Davide Scaturro and Filippo Utro ,“Textual data compression in computational biology: a synopsis” Dipartimento di Matematica ed Applicazioni, Università di Palermo, Palermo, Italy. [10] Marty C. Brandon, Douglas C. Wallace and Pierre Baldi, “Data structures and compression algorithms for genomic sequence data”. [11] Gergely Korodi and Ioan Tabus, “Compression of Annotated Nucleotide Sequences”. [12] “The NCBI C Toolkit,” ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools, 2011. [13] W.J. Kent, “BLAT-the BLAST-Like Alignment Tool,” Genome Research, vol. 12, no. 4, pp. 656-664, 2002 [14] A. Do ¨ ring, D. Weese, T. Rausch, and K. Reinert, “SeqAn an Efficient, Generic C++ Library for Sequence Analysis,” BMC Bioinformatics, vol. 9, article 11, 2008. 42

URI

http://hdl.handle.net/123456789/1189

Collections

2012

Full item page

An Improved Data Structure for Efficient Storage of Multiple BIOsequences

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By