how to get the virus_genes/XXXX_best_genes.fna file #1

tynot · 2024-12-13T11:20:34Z

I would like to express my sincere appreciation to your research group for generously sharing the analysis code. I have been trying to use a part of the code to analyze my metagenomic sequencing data. However, I have encountered a problem and I hope you can assist me. In the "General bioinformatics workflow", in the "Build a Bowtie2 mapping index of the best genes in each environment" section, I am wondering how I can obtain the virus_genes/XXXX_best_genes.fna file. I have already got the virus_genes/XXXX_virus_genes_combined.ffn and virus_genes/XXX_best_genes.txt by using choose_genes.py. I have spent some time trying to figure it out on my own, but unfortunately, I have not been successful. I believe your expertise and knowledge in this area could provide me with the necessary guidance to overcome this obstacle. Thank you so much for your attention and I look forward to your reply.

tynot · 2024-12-13T13:52:09Z

the following is my python code, it works

def extract_sequences():
best_genes_file = "health_best_genes.txt"
combined_ffn_file = "health_virus_genes_combined.ffn"
output_fna_file = "health_best_genes.fna"

with open(best_genes_file, 'r') as f:
    best_genes = [line.strip() for line in f.readlines()]

sequences_to_extract = {}
with open(combined_ffn_file, 'r') as ffn:
    current_id = ""
    current_seq = ""
    for line in ffn:
        if line.startswith(">"):
            if current_id:
                sequences_to_extract[current_id] = current_seq
            current_id = line[1:].strip()
            current_seq = ""
        else:
            current_seq += line.strip()
    if current_id:
        sequences_to_extract[current_id] = current_seq

extracted_sequences = []
for gene in best_genes:
    if gene in sequences_to_extract:
        extracted_sequences.append(">" + gene + "\n" + sequences_to_extract[gene] + "\n")

with open(output_fna_file, 'w') as out_fna:
    out_fna.writelines(extracted_sequences)

if name == "main":
extract_sequences()

jamesck2 · 2024-12-14T19:22:06Z

Hi @tynot. I'm happy to see that you are able to use this code for your own analysis. I think I left out one small step in the workflow document between the running of choose_genes.py and running bowtie2-build, my apologies.

The output from choose_genes.py should be a text file that is a list of the "best" (longest, in this case) genes from the input combined.ffn file. To get the desired best_genes.fna, we need to subset the combined.ffn file. I believe I did this with seqkit grep. In your case, I think you will run the following with seqkit (it can be installed with conda, homebrew, or docker):

seqkit grep -n -f health_best_genes.txt health_virus_genes_combined.ffn -o health_best_genes.fna

Please let me know if this works for you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to get the virus_genes/XXXX_best_genes.fna file #1

how to get the virus_genes/XXXX_best_genes.fna file #1

tynot commented Dec 13, 2024

tynot commented Dec 13, 2024

jamesck2 commented Dec 14, 2024

how to get the virus_genes/XXXX_best_genes.fna file #1

how to get the virus_genes/XXXX_best_genes.fna file #1

Comments

tynot commented Dec 13, 2024

tynot commented Dec 13, 2024

jamesck2 commented Dec 14, 2024