Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split FASTA by x #51

Open
DLBPointon opened this issue Nov 22, 2024 · 6 comments
Open

Split FASTA by x #51

DLBPointon opened this issue Nov 22, 2024 · 6 comments
Assignees

Comments

@DLBPointon
Copy link
Collaborator

DLBPointon commented Nov 22, 2024

We currently have split by X number of records, split by Y size (disk size) but no split (into Z chunks).

@DLBPointon DLBPointon self-assigned this Nov 22, 2024
@DLBPointon
Copy link
Collaborator Author

DLBPointon commented Nov 25, 2024

oooft

WELCOME TO Fasta Manipulator
This has been made to help prep data for use in the Treeval and curationpretext pipelines
ONLY THE yamlvalidator IS SPECIFIC TO TREEVAL, THE OTHER COMMANDS CAN BE USED FOR ANY OTHER PURPOSE YOU WANT
RUNNING SUBCOMMAND: |
-- splitbyx
RUNNING ON: |
-- linux
"mouse_test"
COUNT OF RECORDS IN FILE IS: 66314
WE WANT 10 PER FILE
RECORDS PER FILE: 6632
TOTAL RECORDS SORTED: 66314 <-- Double check counter
docker run -v /Users/dp24/Documents/FasMan/test:/app -it  fasman splitbyx -f   0.02s user 0.02s system 0% cpu 34.975 total


dp24@mib119764s FasMan % time docker run -v /Users/dp24/Documents/FasMan/test:/app -it c0f51d916736fccba1c186c69821d268754ebf9ce186159307c2b6174960e84c pyfasta split -n 10 /app/mouse_test.fa
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
creating new files:
/app/mouse_test.00.fa
/app/mouse_test.01.fa
/app/mouse_test.02.fa
/app/mouse_test.03.fa
/app/mouse_test.04.fa
/app/mouse_test.05.fa
/app/mouse_test.06.fa
/app/mouse_test.07.fa
/app/mouse_test.08.fa
/app/mouse_test.09.fa
docker run -v /Users/dp24/Documents/FasMan/test:/app -it  pyfasta split -n 10  0.02s user 0.02s system 0% cpu 5.230 total

30 second delta!

@DLBPointon
Copy link
Collaborator Author

mouse_test.fa is a 35Mb fasta file containing 66314 records

@DLBPointon
Copy link
Collaborator Author

interestingly, running this outside of the container results in:

./target/release/fasman splitbyx -f ./test/mouse_test.fa -c 10  0.23s user 2.86s system 96% cpu 3.192 total

which is backed up with samply:

samply record ./target/release/fasman splitbyx -f ./test/mouse_test.fa -c 10
image

This tells me that the container is starving the process, which to me at least means the python process is more efficient. I haven't profiled or locally run pyfasta yet.

@DLBPointon
Copy link
Collaborator Author

A significant limitation is that the noodles writer only writes 1 record at a time... a work around would be converting to Sring... joining multiple records together and "bulk" writing this way... or making a new write.

@DLBPointon
Copy link
Collaborator Author

Worst offenders
image

@DLBPointon
Copy link
Collaborator Author

DLBPointon commented Nov 26, 2024

Another tool to compare against, on the other end of the spectrum, is Seqkit written in C. And would be a direct "competitor" to FasMan.

This is an insanely fast tool, execution was almost instant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant