Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to get the virus_genes/XXXX_best_genes.fna file #1

Open
tynot opened this issue Dec 13, 2024 · 2 comments
Open

how to get the virus_genes/XXXX_best_genes.fna file #1

tynot opened this issue Dec 13, 2024 · 2 comments

Comments

@tynot
Copy link

tynot commented Dec 13, 2024

I would like to express my sincere appreciation to your research group for generously sharing the analysis code. I have been trying to use a part of the code to analyze my metagenomic sequencing data. However, I have encountered a problem and I hope you can assist me. In the "General bioinformatics workflow", in the "Build a Bowtie2 mapping index of the best genes in each environment" section, I am wondering how I can obtain the virus_genes/XXXX_best_genes.fna file. I have already got the virus_genes/XXXX_virus_genes_combined.ffn and virus_genes/XXX_best_genes.txt by using choose_genes.py. I have spent some time trying to figure it out on my own, but unfortunately, I have not been successful. I believe your expertise and knowledge in this area could provide me with the necessary guidance to overcome this obstacle. Thank you so much for your attention and I look forward to your reply.

@tynot
Copy link
Author

tynot commented Dec 13, 2024

the following is my python code, it works

def extract_sequences():
best_genes_file = "health_best_genes.txt"
combined_ffn_file = "health_virus_genes_combined.ffn"
output_fna_file = "health_best_genes.fna"

with open(best_genes_file, 'r') as f:
    best_genes = [line.strip() for line in f.readlines()]

sequences_to_extract = {}
with open(combined_ffn_file, 'r') as ffn:
    current_id = ""
    current_seq = ""
    for line in ffn:
        if line.startswith(">"):
            if current_id:
                sequences_to_extract[current_id] = current_seq
            current_id = line[1:].strip()
            current_seq = ""
        else:
            current_seq += line.strip()
    if current_id:
        sequences_to_extract[current_id] = current_seq

extracted_sequences = []
for gene in best_genes:
    if gene in sequences_to_extract:
        extracted_sequences.append(">" + gene + "\n" + sequences_to_extract[gene] + "\n")

with open(output_fna_file, 'w') as out_fna:
    out_fna.writelines(extracted_sequences)

if name == "main":
extract_sequences()

@jamesck2
Copy link
Collaborator

Hi @tynot. I'm happy to see that you are able to use this code for your own analysis. I think I left out one small step in the workflow document between the running of choose_genes.py and running bowtie2-build, my apologies.

The output from choose_genes.py should be a text file that is a list of the "best" (longest, in this case) genes from the input combined.ffn file. To get the desired best_genes.fna, we need to subset the combined.ffn file. I believe I did this with seqkit grep. In your case, I think you will run the following with seqkit (it can be installed with conda, homebrew, or docker):

seqkit grep -n -f health_best_genes.txt health_virus_genes_combined.ffn -o health_best_genes.fna

Please let me know if this works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants