-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAPI and Python Bindings #49
Comments
Yes, this would certainly be good to have. I'm not familiar with the binding process, but is it necessary to expose the lower level functions (e.g. Hash data structures) for the main interface to function? These seem like implementation details that wouldn't matter to the calling code. I guess it depends on the use case you have in mind, but I would envision |
Hello @ondovb, It's good to hear that you are interested in the bindings. One of the best things about Mash is that it does its job, and it does it fast. So the python version has to accomplish the expectations. There are two ways for calling Mash from Python
A Python module would be an overkill solution. I suggest using ctypes. On the one hand, it is true that there is some overhead when calling C functions from Python via ctypes, but for intensive CPU functions, like You are right about exposing the internal implementation. It is not necessary at all. I started to do it for one main reason: get used to the Mash code. As you said, some changes would be necessary for the Mash code. Let me think a bit about it and I will show you a brief about how the API should look like and how the python library would be used. The good news is that bindings/cli tools would have the same backend and many functionalities can be shared. I will write a specification as soon as possible. Thanks! |
I am going to implement the first iteration using It is going to look like as the final version using the C API but it is going to call Mash as cmd and parse the results. If we do it in this way, we can profile the functions/objects and come with a good API. By know what I have in mind is something like this: import mash
input_files = ['fileOne.fna', 'fileTwo.fna']
params = mash.sketchParams()
params.k-mers(16)
mash_files= mash.sketch(input_files, params)
# mash_files[0] => fileOne.fna.msh
# mash_files[1] => fileTwo.fna.msh
res = mash.dist(mash_files[0], mash_files[1])
# res => { p_value = 0.022, mash_distance = 0.27, matching_hashe=475, total_hashe=100}
... I hope to find some time for this project before next week. If you have some constructive feedback I really appreciate it :) |
CTypes seems reasonable to me. My one comment about using a wrapper as a prototype is that you may ultimately want slightly lower-lever exposure than is available from the |
Hello @ondovb, I already implemented a "mock up" of mashpy using I am going to start the implementation and I would like to discuss some details with you to be sure that I am doing it in a way aligned with you. The roadmap that I have in mind is:
I can do it all at once but I guess that is better do it in chunks of features (first, for example, write a C++ API only for sketching the genomes and then expose it to C/Python). My questions are related to:
I am aware that there is much information here but I would like to know your thoughts on these points to proceed in consequence. As always I am open to discussing/detailing any of what I exposed here so do not hesitate in ask. Thanks, Jordi. |
First of all congratulations for Mash.
After reading the paper about mash I decided to try it, and I am really impressed with the good results I have obtained.
Right now sketching and comparing samples with mash is fast and easy. However, it can only be used as command line tool. Although cli are an excellent solution, for many cases there are others that could take advantage of mash as a library.
I would like to know what do you thing about having some external API (in C) that exposes the main functionalities and primitives of the library. Having this C API would be nice because then I could write bindings for other languages. The first language that comes to my mind in bioinformatics is python so would be really nice having a python library for it.
I already forked the project and started to expose mash through a C API. This is going to take me some time, so meanwhile, I used the
subprocess
module in python to expose the cmd getting the results as python dictionaries. The branches are python and mashpy respectively.Do you think it is a good idea? Would you be interested in having this on Mash?
The text was updated successfully, but these errors were encountered: