Inverse Document Frequency Genome Searching

Inverse Document Frequency Weighted
Genomic Sequence Retrieval

Kevin C. O'Kane, Ph.D.
Professor Emeritus
Computer Science Department
University of Northern Iowa
Cedar Falls, IA 50614
kc.okane@gmail.com
https://threadsafebooks.com
https://www.cs.uni.edu/~okane
Nov 11, 2023

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being: Page 1, with the Front-Cover Texts being: Page 1, and with the Back-Cover Texts being: no Back-Cover Texts.

Citation
O'Kane, K.C., The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval, Online Journal of Bioinformatics, Volume 6 (2) 162-173, 2005.
Full Text
Abstract
This software presents a method to identify weighted n-gram sequence fragments in large genomic databases whose indexing characteristics permits the construction of fast, indexed, sequence retrieval programs where query processing time is determined mainly by the size of the query and number of sequences retrieved rather than, as is the case in sequential scan based systems such as BLAST, FASTA, and Smith-Waterman, the size of the database. The weighting scheme is based on the inverse document frequency (IDF) method, a weighting formula that calculates the relative importance of indexing terms based on term distribution. In experiments (see citation above), the relative IDF weights of all segmented, overlapping, fixed n-grams of length eleven in the NCBI .nt. and other databases were calculated and the resulting n-grams ranked and used to create an inverted index into the sequence file. The system was evaluated on test cases constructed from randomly selected known sequences which were randomly fragmented and mutated and the results compared with BLAST and MegaBlast for accuracy and speed. Due to the speed of query processing, the system is also capable of database sequence clustering, examples of which are given below.
General Comments
This software is mainly aimed at Linux users.
This code has been developed under the Mint/Mate Linux distro.
Distributions
The full IDF distribution is to be found in::

Note: this distro is a simplified version of the original

Source code distribution in Mumps Language Distro Click Here
Example
The example shows a search for a randomly modified known sequence (exact match) in the gbpri* database. The IDF Score column gives the IDF metric of similarity. The Fasta results (W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448) are the result of running the same query sequence against the original gbpri input database.

IDF Genome Searching