fuzzy text finding #21

Ephilipz · 2022-12-26T18:12:26Z

I have a strange use case that I was hoping to use your project for. Specifically, I would like to use it for text matching rather than DNA sequencing. I understand that this may not be a typical use case for your project, but I was wondering if you could provide a simple example for how I could use it for this purpose.

Below is an example of a similar library in python for my use case.

import sys
import parasail

# extension_penalty must be <= open_penalty
open_penalty = 1
extension_penalty = 1


def calculate_score(ayah, search_term, profile=None):
    if profile is not None:
        result = parasail.sw_scan_profile_16(profile, ayah, open_penalty, extension_penalty)
    else:
        result = parasail.sw_scan_16(ayah, search_term, open_penalty, extension_penalty, parasail.blosum62)
    return result


query = 'hello world'
search_in = 'large-file.txt'

current_profile = parasail.profile_create_16(query, parasail.blosum62)

results = []
for line in open(search_in):
    line = line[:-1]
    result = calculate_score(line, query, current_profile)
    results.append((line, result.score, result.__dict__))

sorted_results = sorted(results, key=lambda item: item[1], reverse=True)

for i in range(0, 5):
    print(sorted_results[i])

Thank you!

The text was updated successfully, but these errors were encountered:

Daniel-Liu-c0deb0t · 2022-12-27T04:09:46Z

I actually designed this library to support operating on arbitrary bytes! I assume you are doing "global" comparison (each line is compared end-to-end to the query). You can modify the example here, but replace NucMatrix and NW1 with a custom byte scoring matrix. I don't think you should be using blosum62 with parasail because that score is only meaningful for amino acids. For arbitrary bytes, having a simple match bonus and mismatch penalty makes more sense.

Ephilipz · 2022-12-29T15:39:54Z

I see. So is this how I would go about it?
I noticed it's not very fast so I'm sure I must be doing something wrong here.
If there is a more efficient way please let me know. Thank you, Daniel

fn main() {
    let filename = "raw/transliteration.txt";
    let transliteration_bytes: &[u8] = &fs::read(filename).expect("No file found with the name {filename}"); 
    println!("File byte length : {}", transliteration_bytes.len());
    let input = get_user_input();

    // Setup the block alignment
    let block_size = 16;
    let gaps = Gaps {
        open: -2,
        extend: -1,
    };
    let query = PaddedBytes::from_bytes::<ByteMatrix>(input.as_bytes(), block_size);
    let reference = PaddedBytes::from_bytes::<ByteMatrix>(transliteration_bytes, block_size);

    // Align with traceback, but no x drop threshold.
    let mut a = Block::<true, false>::new(query.len(), reference.len(), block_size);
    a.align(&query, &reference, &BYTES1, gaps, block_size..=block_size, 0);
    let res = a.res();
    println!("Result is {:?}", res);
}

fn get_user_input() -> String {
   let mut line = String::new();
   stdin().read_line(&mut line).expect("unable to read input");
   return line;
}

Daniel-Liu-c0deb0t · 2022-12-31T02:39:54Z

What is the file byte length? You seem to be using block aligner correctly, but you aren't comparing each line of the file to the query input, you are comparing the whole file. Is that intentional?

Ephilipz · 2022-12-31T17:08:06Z

Byte size : 609444.
The file has many lines. 6236 to be exact.
I guess searching line by line is the best option here. Also running in release mode significantly sped things up so I'm optimistic about using your great library to accomplish this task.

Thanks

Ephilipz · 2022-12-31T22:00:33Z

Set it up to run line by line but I'm getting identical scores on lines that include the query compared to ones that don't include it.
Am I missing a step here?

// Setup the block alignment
    let block_size = 16;
    let gaps = Gaps {
        open: -2,
        extend: -1 
    };

    let query = PaddedBytes::from_bytes::<ByteMatrix>(input.as_bytes(), block_size);

    for line in &transliteration_lines {
        let reference = PaddedBytes::from_bytes::<ByteMatrix>(line.as_bytes(), block_size);

        // Align with traceback, but no x drop threshold.
        let mut a = Block::<true, false>::new(query.len(), reference.len(), block_size);
        a.align(
            &query,
            &reference,
            &BYTES1,
            gaps,
            block_size..=block_size,
            0,
        );

        let res = a.res();
        // add res score and line number to list
        ...
    }
     // sort result list by score and print the top results
     ...

Daniel-Liu-c0deb0t · 2022-12-31T23:41:28Z

Oh yes release mode is very important in Rust.

I'm a little confused as to what you are trying to do. Do you want to search for some query within each line? Or do you want to match the entire line against the query? It seems like you want to search for some query within the line (eg., searching for "hi" in the line "hello hi world").

Block aligner's scoring method is for global alignment (match the entire line against the query, end-to-end). If you want to search for a query that could be anywhere within the line, you should use triple_accel (my other library) that has a search feature. This search will also return the locations where the query appears in the line.

Ephilipz · 2023-01-07T12:19:51Z

Thank you for the explanation. I tested out your tiple_accel library and it's great: it's very performant. I am using this for arabic text matching which doesn't do very well with levenshtein. Do you know of any libraries that make use of the waterman algorithm for text matching?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fuzzy text finding #21

fuzzy text finding #21

Ephilipz commented Dec 26, 2022

Daniel-Liu-c0deb0t commented Dec 27, 2022

Ephilipz commented Dec 29, 2022

Daniel-Liu-c0deb0t commented Dec 31, 2022

Ephilipz commented Dec 31, 2022 •

edited

Loading

Ephilipz commented Dec 31, 2022

Daniel-Liu-c0deb0t commented Dec 31, 2022 •

edited

Loading

Ephilipz commented Jan 7, 2023

fuzzy text finding #21

fuzzy text finding #21

Comments

Ephilipz commented Dec 26, 2022

Daniel-Liu-c0deb0t commented Dec 27, 2022

Ephilipz commented Dec 29, 2022

Daniel-Liu-c0deb0t commented Dec 31, 2022

Ephilipz commented Dec 31, 2022 • edited Loading

Ephilipz commented Dec 31, 2022

Daniel-Liu-c0deb0t commented Dec 31, 2022 • edited Loading

Ephilipz commented Jan 7, 2023

Ephilipz commented Dec 31, 2022 •

edited

Loading

Daniel-Liu-c0deb0t commented Dec 31, 2022 •

edited

Loading