Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClumpakRerun Number of runs is not consistent between K's #22

Open
RvV1979 opened this issue Oct 20, 2024 · 6 comments
Open

ClumpakRerun Number of runs is not consistent between K's #22

RvV1979 opened this issue Oct 20, 2024 · 6 comments
Assignees
Labels

Comments

@RvV1979
Copy link

RvV1979 commented Oct 20, 2024

I have run admixturePipeline.py, submitClumpak.py, and distructRerun.py in a dedicated directory. All seem to have worked OK.

Now I want to run submitClumpak.py -b but I am getting the following error:
error occurred - Number of runs is not consistent between K's

I checked and I have 20 replicates for each k (from k=2 to k=13). However, when I open ll_all.txt there seem to be only 19 entries for k=9 and 18 for k=11. See attached file.

Do you know what may be going wrong?

Thanks, Robin
ll_all.txt

@RvV1979 RvV1979 changed the title runEvalAdmix JSONDecodeError ClumpakRerun Number of runs is not consistent between K's Oct 20, 2024
@stevemussmann
Copy link
Owner

Did you get any warnings printed to the terminal about -NaN values being found for log likelihood values? If so, that would probably be the cause since that will result in nothing being written to the ll_all.txt file for those replicates. I don't know why Admixture sometimes produces a -NaN for the log likelihood because the code is closed.

Unfortunately my best recommendation would be to randomly delete 1-2 values (as appropriate) for the other K values if you want to use the bestK method from clumpak.

@RvV1979
Copy link
Author

RvV1979 commented Oct 20, 2024

Many thanks for the quick reply and advice
I have checked the admixture *.stdout files and they all have Loglikelihood values so it must be something else.

Therefore, I checked my distructRerun.py output files and among the various ClusterRuns files there are only 19 references to *.stdout files for k=9 and 18 for k=11 so it seems the 3 replicates are already missing after distructRerun.py.
Then, I checked my clumpakOutput directory and found that the replicates were already missing there:

$ wc -l clumpakOutput/K\=9/*/clusterFiles 
  13 clumpakOutput/K=9/MajorCluster/clusterFiles
   4 clumpakOutput/K=9/MinorCluster1/clusterFiles
   2 clumpakOutput/K=9/MinorCluster2/clusterFiles
  19 total

$ wc -l clumpakOutput/K\=11/*/clusterFiles 
   8 clumpakOutput/K=11/MajorCluster/clusterFiles
   6 clumpakOutput/K=11/MinorCluster1/clusterFiles
   2 clumpakOutput/K=11/MinorCluster2/clusterFiles
   2 clumpakOutput/K=11/MinorCluster3/clusterFiles
  18 total

However, there are corresponding converted Q files in the clumpakOutput/input.files/ and clumpakOutput/converted.input.files/ directories and also in clumpakOutput/K=9/CLUMPP.files/FilesToIndex and
clumpakOutput/K=11/CLUMPP.files/FilesToIndex.

My conclusion is that the three replicates get lost somewhere in the submitClumpak.py step but I have no clue where and why.

Do you have any suggestion?

@stevemussmann
Copy link
Owner

The whole clumpakOutput directory (with the exception of the clumpakOutput/best_results sub directory) is created by clumpak. So if the Q files are present in the results.zip file when it is provided to submitClumpak.py, which appears to be the case, then something is happening inside of clumpak to exclude some of the files. You might check what is output to the terminal by clumpak to see if it gives any clues about why some files are missing. Unfortunately I don't think I capture the stdout from clumpak in a file, but perhaps I should for cases like this...

@RvV1979
Copy link
Author

RvV1979 commented Oct 21, 2024

Q files are indeed present in the results.zip:

$ unzip -l results.zip |grep 9_ |wc -l
20
$ unzip -l results.zip |grep 11_ |wc -l
20

I re-ran submitClumpak.py redirecting outputs to a logfile; see clumpak_output.txt and do not see anything out of the ordinary; except Use of uninitialized value $mean in sprintf at /app/bin//CLUMPAK.pl line 370 which looks like a warning. I also consistently read 20 entries in the MCL.out files:

$ cat clumpakOutput/K={9,11}/MCL.files/MCL.out
# cline: mcl - "-I" "2" "-tf" "gq(0.78), add(-0.78)" "-o" "clumpakOutput/K=9/MCL.files/MCL.out"
(mclheader
mcltype matrix
dimensions 20x4
)
(mclmatrix
begin
0  1 4 6 7 9 11 12 14 15 16 17 18 19 $
1  0 2 8 13 $
2  5 10 $
3  3 $
)
# cline: mcl - "-I" "2" "-tf" "gq(0.81), add(-0.81)" "-o" "clumpakOutput/K=11/MCL.files/MCL.out"
(mclheader
mcltype matrix
dimensions 20x6
)
(mclmatrix
begin
0  1 2 3 5 9 16 17 18 $
1  4 11 13 14 15 19 $
2  0 6 $
3  7 12 $
4  8 $
5  10 $
)

Note that in the MCL clustering output for K=9 cluster 3 consists of a singleton (number 3; which is replicate 9_4) that is not clustered with any other replicate. In the output for K=11 clusters 4 and 5 also consist of singletons (numbers 8 and 10; which are, respectively, replicates 11_9 and 11_11). It is exactly these replicates in singleton clusters that are missing in the clumpak output. Therefore, what I think happens is that Clumpak does not output singletons, causing the inconsistent number of runs.

Is there a reason why AdmixPipe does not allow such inconsistent numbers?

Thanks again

@stevemussmann
Copy link
Owner

To be honest this has never come up before, but that is probably because I rarely run the bestK pipeline from clumpak. I don't find it to be especially informative.

Thanks for bringing this to my attention. I will have to revise the code so that it checks for replicates that are not present in the clumpak output when I do my next revision; probably sometime this winter.

@stevemussmann stevemussmann self-assigned this Oct 21, 2024
@stevemussmann
Copy link
Owner

stevemussmann commented Dec 13, 2024

Sorry for the delayed response - I've been busy with finishing up some projects and it has taken a while to circle back to these issues.

I think I figured out a solution for this by digging through the clumpak code. There is a setting for the mclminclusterfraction in the clumpak code, and its purpose is never explained in the clumpak documentation, but it seems to determine whether a minor cluster will be included among the clumpak output files.

Briefly, the default value is set to 0.1 (in the clumpak code itself), and this seems to be the minimum threshold for including a cluster in the output. E.g., if the number of replicates included in a cluster do not represent at least mclminclusterfraction = 0.1 of the total replicates for a particular K, then they get excluded from outputs. In your case, singletons were being excluded because they only represented mclminclusterfraction = 0.05 of the total replicates.

So if you set this number to a low value (e.g., mclminclusterfraction ≤ 0.05) then singleton clusters should be included in the outputs. This is already implemented in submitClumpak.py as the -d / --DISTRUCT option. However, I now need to update my own documentation to make the purpose of this option clearer, and I will perhaps change the default behavior so that this option is automatically calculated to include singleton clusters among clumpak outputs.

I will leave this issue open until I have a chance to implement these changes.

edit: fix is implemented in github repository version. I still need to update documentation in README.md to reflect changes. Fix will be pushed to Docker container in next Docker update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants