Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training data preparation issues #12

Closed
smruti241 opened this issue May 17, 2021 · 60 comments
Closed

Training data preparation issues #12

smruti241 opened this issue May 17, 2021 · 60 comments

Comments

@smruti241
Copy link

I have run the command:
./merge.sh ./train/outdir as mentioned in your paper
A little script format to do this is supplied in merge.sh, to run this for subset a use:

./merge.sh ./train/a
but it is giving me this:
Working on Chr: 1
Usage:
qsub [-a start_time] [-A account] [-b y|n] [-e err_path] [-I] [-l
resource_list] [-m mail_options] [-M user_list] [-N job_name] [-o
out_path] [-p priority] [-pe shm task_cnt] [-P wckey] [-q destination]
[-r y|n] [-v variable_list] [-V] [-wd workdir] [-W
additional_attributes] [-h] [script]
It is asking for some parameters. I also saw your script for merge.sh and it has only qsub command. What shall I do? And your training data section is not understandable. I request you to please help me regarding this.

@rstraver
Copy link
Collaborator

Looks like there was a leftover piece assuming it would be ran on a HPC system, try replacing that script with this:

INDIR=`readlink -f $1`

for i in $(seq 1 22)
do
	echo Working on Chr: $i
	sort -n -m $INDIR/*/*.$i.start.fwd $INDIR/*/*.$i.start.rev > $INDIR/merge.$i
done

I only removed the "echo" and "qsub" pipe at the end here, similar to the other merge* scripts.

@smruti241
Copy link
Author

ok I will try. Please guide me throughout the training process.

@smruti241
Copy link
Author

Looks like there was a leftover piece assuming it would be ran on a HPC system, try replacing that script with this:

INDIR=`readlink -f $1`

for i in $(seq 1 22)
do
	echo Working on Chr: $i
	sort -n -m $INDIR/*/*.$i.start.fwd $INDIR/*/*.$i.start.rev > $INDIR/merge.$i
done

I only removed the "echo" and "qsub" pipe at the end here, similar to the other merge* scripts.

It worked. Thank you so much

@smruti241
Copy link
Author

After using this command, I used this:
for SUBSET in train/merge.*; do sh merge_1.sh ./train/; done
it is giving me this:
Working on Chr: 1
Working on Chr: 2
Working on Chr: 3
Working on Chr: 4
Working on Chr: 5
Working on Chr: 6
Working on Chr: 7
Working on Chr: 8
Working on Chr: 9
Working on Chr: 10
Working on Chr: 11
Working on Chr: 12
Working on Chr: 13
Working on Chr: 14
Working on Chr: 15
Working on Chr: 16
Working on Chr: 17
Working on Chr: 18
Working on Chr: 19
Working on Chr: 20
Working on Chr: 21
Working on Chr: 22
Working on Chr: 1
Working on Chr: 2
Working on Chr: 3
Working on Chr: 4
Working on Chr: 5
Working on Chr: 6
Working on Chr: 7
Working on Chr: 8
Working on Chr: 9
Working on Chr: 10
Working on Chr: 11
Working on Chr: 12
Working on Chr: 13
Working on Chr: 14
Working on Chr: 15
Working on Chr: 16
Working on Chr: 17
Working on Chr: 18
Working on Chr: 19
Working on Chr: 20
Working on Chr: 21
Working on Chr: 22
Working on Chr: 1
Working on Chr: 2
Working on Chr: 3
Working on Chr: 4
Working on Chr: 5
Working on Chr: 6
Working on Chr: 7
Working on Chr: 8
Working on Chr: 9
Working on Chr: 10
Working on Chr: 11
Working on Chr: 12
Working on Chr: 13
Working on Chr: 14
Working on Chr: 15
Working on Chr: 16
Working on Chr: 17
Working on Chr: 18
Working on Chr: 19
Working on Chr: 20
Working on Chr: 21
Working on Chr: 22
Working on Chr: 1
Working on Chr: 2
Working on Chr: 3
Working on Chr: 4
Working on Chr: 5
Working on Chr: 6
Working on Chr: 7
Working on Chr: 8

But it didnt create any file. Is there any error? I changed the command but i couldnt understand why it is not working

@smruti241
Copy link
Author

Even mergeSubs.sh script is also not working for me.

@rstraver
Copy link
Collaborator

In your call:

for SUBSET in train/merge.*; do sh merge_1.sh ./train/; done

You seem to wildcard a set of files: train/merge.* (at least that . insinuates that).
Looking at the call in the script:

sort -n -m $INDIR/*/*.$i.start.fwd $INDIR/*/*.$i.start.rev > $INDIR/merge.$i

It wants a folder with different subfolders, where every subfolder contains files (ending with .start.fwd and .start.rev).
Are you sure your folder structure matches? i.e. folder/subfolder/files?

@smruti241
Copy link
Author

yeah my folder structure matches. I have created .start.fwd and .start.rev in this folder /train/outdir. I used your command for merging:
INDIR=readlink -f $1

for i in $(seq 1 22)
do
echo Working on Chr: $i
sort -n -m $INDIR//.$i.start.fwd $INDIR//.$i.start.rev > $INDIR/merge.$i
done
Then I have used this command for merging subsets by using mergeSubs.sh:
INDIR=readlink -f $1

for i in $(seq 1 22)
do
echo Working on Chr: $i
sort -n -m $INDIR/*/merge.$i > $INDIR/merge.$i
done
When I am running this, it is creating the file but empty; again when I am running it is creating same file which is created in first step.

@smruti241
Copy link
Author

I used this command:
for SUBSET in a b c d e f g h i j k l; do merge.sh ./train/$SUBSET; done
but i think it is used for multiple files. Or it has some other function?

@smruti241
Copy link
Author

I am using this command:
python nuclDetector_1.py ./train//subset//merge.1 ./train/py_dir/
it is giving me this error:
Traceback (most recent call last):
File "/media/mdrcubuntu/C8CE59DBCE59C1FC/new_data/sanefalcon/nuclDetector_1.py", line 132, in
flush(curArea,lastPos)
File "/media/mdrcubuntu/C8CE59DBCE59C1FC/new_data/sanefalcon/nuclDetector_1.py", line 109, in flush
allNucl.extend([[x[0]+startPoint]+x[1:]+extra[x[0]] for x in newCenters])
File "/media/mdrcubuntu/C8CE59DBCE59C1FC/new_data/sanefalcon/nuclDetector_1.py", line 109, in
allNucl.extend([[x[0]+startPoint]+x[1:]+extra[x[0]] for x in newCenters])
TypeError: list indices must be integers or slices, not float

shall I change in script or please tell me some ideas.

@rstraver
Copy link
Collaborator

I'm trying to understand what you are running exactly as it seems it's not looking at the folder structure the way it's intended.

This step:
for SUBSET in a b c d e f g h i j k l; do merge.sh ./train/$SUBSET; done
Should create a file per batch set in the ./train/ directory.

  1. Does that work (and is there something in that file)?
  2. Are you running the scripts as the bash files they are or copy pasting their code into a terminal?
  3. Could you provide a tree output so I can see what your structure exactly looks like?

@smruti241
Copy link
Author

I think it will be used for multiple files. I will solve this later. But please help me in mergeSubs.sh

@smruti241
Copy link
Author

yeah my folder structure matches. I have created .start.fwd and .start.rev in this folder /train/outdir. I used your command for merging:
INDIR=readlink -f $1

for i in $(seq 1 22)
do
echo Working on Chr: $i
sort -n -m $INDIR//.$i.start.fwd $INDIR//.$i.start.rev > $INDIR/merge.$i
done
Then I have used this command for merging subsets by using mergeSubs.sh:
INDIR=readlink -f $1

for i in $(seq 1 22)
do
echo Working on Chr: $i
sort -n -m $INDIR/*/merge.$i > $INDIR/merge.$i
done
When I am running this, it is creating the file but empty; again when I am running it is creating same file which is created in first step.

This one

@rstraver
Copy link
Collaborator

I can't help you on that one if the input data is wrong: you are trying to run a merge step that depends on the previous merge step which apparently didn't work.
Please provide the info I asked previously first.

@smruti241
Copy link
Author

I'm trying to understand what you are running exactly as it seems it's not looking at the folder structure the way it's intended.

This step:
for SUBSET in a b c d e f g h i j k l; do merge.sh ./train/$SUBSET; done
Should create a file per batch set in the ./train/ directory.

1. Does that work (and is there something in that file)?

2. Are you running the scripts as the bash files they are or copy pasting their code into a terminal?

3. Could you provide a `tree` output so I can see what your structure exactly looks like?

I used these file: IonXpress_006.bam
My directory format is: /media/mdrcubuntu/C8CE59DBCE59C1FC/new_data/sanefalcon/train/outdir
As you mentioned /train/a/ should be the folder name and the start files are 44 in number. its format is IonXpress_006.bam.20.start.fwd and IonXpress_006.bam.20.start.rev. Each of them are upto 5 MB in size. Thats the result I got from ./prepSamples.sh (command:./prepSamples.sh ../IonXpress_006.bam outdir). Then I used ./merge_1.sh ./train/ command for merging.
I used this script for merging:
INDIR=readlink -f $1

for i in $(seq 1 22)
do
echo Working on Chr: $i
sort -n -m $INDIR//.$i.start.fwd $INDIR//.$i.start.rev > $INDIR/merge.$i
done
This created files such as merge.1, merge.2, etc and directory is /train/a.
After creating this, I used command ./mergeSubs_1.sh ./train for merging subsets. But I got same files and file size earlier in merging process, no changes at all.

@rstraver
Copy link
Collaborator

So there is no subfolders in /train/a/? The script assumes there is, as per description in the readme:

    ./test -- contains non training samples
    ./train -- contains only subfolders
    ./train/a -- contains only subfolders, 26 samples total
    ./train/a/run1 -- contains 10 samples
    ./train/a/run3 -- contains 04 samples
    ./train/a/run6 -- contains 12 samples
    ./train/b -- contains only subfolders, 23 samples total
    ./train/b/run2 -- contains 16 samples
    ./train/b/run4 -- contains 07 samples
    ./train/c/... -- contains only subfolders,
    ... -- ...

The scripts are targeting explicitly files in paths similar to /train/a/run1/, if the deepest subfolder (here /run1) doesn't exist or contain your samples (and you put them one level higher, directly in a) the scripts will fail.

Can you check if that is correct/send a folder structure tree so I can tell?

@smruti241
Copy link
Author

no I didnt create the folder in /train/a/ or train/outdir/

@smruti241
Copy link
Author

I only created /train/outdir and /train/a/ for merging. Should i create one more folder in a/ and outdir/?

@rstraver
Copy link
Collaborator

rstraver commented May 18, 2021

Create a folder train/a/run1 and put the output from prepSample.sh (looks like $SAMPLENAME.1.start.fwd) in there, then try the merge_1.sh script again. If that produces a file with contents then the structure matches and the scripts can find the files.

@smruti241
Copy link
Author

ok i am trying.

@smruti241
Copy link
Author

yes it worked. It gave me files like merge.1, merge.2 upto merge.22. It didnt give merge.X and merge.Y

@smruti241
Copy link
Author

what should i do next?

@smruti241
Copy link
Author

Create a folder train/a/run1 and put the output from prepSample.sh (looks like $SAMPLENAME.1.start.fwd) in there, then try the merge_1.sh script again. If that produces a file with contents then the structure matches and the scripts can find the files.

After this, what shall i do? how do I run mergeSubs.sh?

@rstraver
Copy link
Collaborator

It indeed doesn't produce information for X and Y, that is by design (for ethical and sample comparability reasons).

Depending on the data you have and your intentions you may have to take a moment to consider whether the merged file you created is what you really want. I don't know if you put everything together or made a bunch of subsets (the whole train/a/run1 train/b/run3 etc. thing). If you put everything together in that train/a/run1 all your data is now combined in the merge files and training as documented in the README for fetal fraction estimation etc will require more data you didn't use here yet.
It might mean you have to redistribute data over different subfolders as described and run again so you get a set of various independent batches of samples which you can train and test using the leave-x-out folding style implementation as described in the README.

@smruti241
Copy link
Author

so shall I put all the merge files from merge.sh in run1 folder or different folder?

@smruti241
Copy link
Author

or shall I run python script directly for getting nucleosome tracks?

@smruti241
Copy link
Author

I am using this command:
python nuclDetector_1.py ./train//subset//merge.1 ./train/py_dir/
it is giving me this error:
Traceback (most recent call last):
File "/media/mdrcubuntu/C8CE59DBCE59C1FC/new_data/sanefalcon/nuclDetector_1.py", line 132, in
flush(curArea,lastPos)
File "/media/mdrcubuntu/C8CE59DBCE59C1FC/new_data/sanefalcon/nuclDetector_1.py", line 109, in flush
allNucl.extend([[x[0]+startPoint]+x[1:]+extra[x[0]] for x in newCenters])
File "/media/mdrcubuntu/C8CE59DBCE59C1FC/new_data/sanefalcon/nuclDetector_1.py", line 109, in
allNucl.extend([[x[0]+startPoint]+x[1:]+extra[x[0]] for x in newCenters])
TypeError: list indices must be integers or slices, not float

shall I change in script or please tell me some ideas.

i am getting this error while running this script

@rstraver
Copy link
Collaborator

The steps you need to take are described in the README as is, I suggest you re read that carefully as I get the impression this new error is because you didn't run it as explained there. Mind the flow diagram that's included on the github page, I believe that should provide the info needed to understand the steps.

@smruti241
Copy link
Author

I ran the commands according to your README file only. First I ran merge.sh script. It gave me results. Second I ran mergeSubs.sh script and it gave the same results as well as same size of files. What I am doing right or wrong, I dont understand, thats why I am asking you. Please help me understanding this README

@smruti241
Copy link
Author

what do you mean by subset here? subset means each start.fwd and start.rev files or each bam files?

@smruti241
Copy link
Author

I am using this command:
nuclDetectorAnti.sh ./train/

and getting this: /home/mdrcubuntu/anaconda3/envs/smruti/bin/python /media/mdrcubuntu/C8CE59DBCE59C1FC/new_data/sanefalcon/nuclDetector_1.py /media/mdrcubuntu/C8CE59DBCE59C1FC/new_data/sanefalcon/train/anti.1 /media/mdrcubuntu/C8CE59DBCE59C1FC/new_data/sanefalcon/train/nucl_ex3.1

I have modified this script according to previous but still getting this, no files are created.

@rstraver
Copy link
Collaborator

rstraver commented May 21, 2021

subset = batch in 3.1 Split data set in README.md.
As shown by the folder structure there:

./train/a/
./train/b/
./train/c/

If you end up with the same files and filesizes you only made one batch/subset.

For nuclDetectorAnti.sh ./train/ it appears you did remove the qsub part at the end but not the echo and " " around the command statement.

@smruti241
Copy link
Author

Thank you @rstraver I am working on this and will let you know if any error occurs. Can you please provide your mail ID? I have to ask about wisecondor-defrag algorithm. I am using that also

@smruti241
Copy link
Author

Also, create a nucleosome track for the whole set of training data:

./nuclDetector.sh ./train

What is whole set of training data here? Is it the anti merged data or merged data?

@smruti241
Copy link
Author

I have completed upto getting nucleosome profiles. Now I am trying this:
Use the python script as follows to see how far we've come and train a simple convertor for the future:

python predictor.py trainNucl trainRef outBaseName

My question is what is trainNucl trainRef here? which file shall i use? Please tell me

@smruti241
Copy link
Author

please @rstraver how to create trainRef file? I have created trainNucl file but I am not clear about trainRef file.

@rstraver
Copy link
Collaborator

rstraver commented May 26, 2021

Could you please refer to the schematic overview and check if you understand where you are and what you are doing?
The left hand side of the schematic overview shows a combination of all data for a nucleosome track based on all data, which is what this bit is referring to:

Also, create a nucleosome track for the whole set of training data:

./nuclDetector.sh ./train

To clarify on that: all that difficult to follow work on the right hand side is to ensure you don't detect nucleosomes on the same data as you train the FF% prediction on to ensure the knowledge it learns on is realistic (i.e. new sample not influencing nucleosome position detection). I might be able to dig up a nucleosome track we made with 300+ low coverage samples to simplify/reduce work a bit although plugging that in and using it may need a bit more of understanding what goes where too.

For the trainRef input: Looking at the way that file is read it contains one line per sample and 3 columns, split by a space:
Samplename FetalFraction SampleType

  • Samplename = used to match nucleosome data to fetal fractions
  • FetalFraction = estimation of fetal fraction for this sample, used to train the ff% estimator
  • SampleType = mark the sample on whether it should be used in training (Male) or not (Female/BAD)

Looks like I didn't document this (guess I assumed people would include their own FF% estimations or drag over from defrag in WISECONDOR) but there's 2 scripts to help you out there:
getRefFF.sh and getRefFF.py
The first digs through a folder and it's subfolders to find filenames that end with .22.start.fwd, then calls the second (getRefFF.py) with the path+base filename (without the .22.start.fwd part) to get a fetal fraction estimation.
You can probably call it with something like this:
./getRefFF.sh $OUTDIR > ff_ref
where $OUTDIR is the same as in 2.2 Extracting read start positions.
Then it dumps the following info per sample into ff_ref:
pathtosample samplename fetal% medianA medianX
I think something like this:
path/to/samplename samplename 0.09232 100 91
If you copy the second and third columns and annotate per file whether it's Male or Female (we assumed you'd know for most of the data or be able to tell form the % this one gives) and store those columns as described above you should have a file it can work with. Alternatively this data can be obtained from the old WISECONDOR defrag.py script, but that's older still and I think mostly written by a colleague of mine at that time.

FYI: I'm not super responsive on questions regarding this set of scripts as it has been over 5 years since I touched it. Meanwhile I changed jobs and I don't have access to the data I used so I find it difficult to figure out what your question really is referring to. I'm really answering just by browsing the scripts I have little memory of by now and I need to find some time to dig in for that.
Please understand I can't answer every question immediately. I know this isn't the most user friendly set of tools, I had hoped someone would pick up on the idea and improve it with an easier interface. I'm not sure what your end goal is here as I would be very hesitant to apply this in any near-diagnostic setting without a really deep understanding of how it works.

@smruti241
Copy link
Author

I got this output after using getRefFF.sh:
./train/a/run1/IonXpress_006.bam IonXpress_006.bam 200.0 206.0 0.0
fetal fraction % is 200. How is it possible? is it some error?

@smruti241
Copy link
Author

Please tell me how to get reference file. I am getting error in this.

@rstraver
Copy link
Collaborator

Seems it didn't count anything on chromosome X.

Looking through the prepSamples.sh script I assume you ran before it appears the X chromosome is commented out:
for ARG_TASKID in `seq 1 22` # or "X"
Which might explain the last 0.0 in the output you got.

This should enable it to run for everything:
for ARG_TASKID in `seq 1 22` or "X"

Or just X:
for ARG_TASKID in "X"

I'd try the latter option, and rerun to get the read start positions per sample for chromosome X. Then retry getting the fetal fractions.

@smruti241
Copy link
Author

will you try or I try?

@rstraver
Copy link
Collaborator

I don't have data so you'll have to try this.

@smruti241
Copy link
Author

ok so I have to change the script actually, right?

@smruti241
Copy link
Author

I have started from prepSamples.sh script by uncommenting or "X" but after that I am facing issues, it is not taking the X and not creating X.start.fwd. What should I add to the merge script?

@rstraver
Copy link
Collaborator

Could you copy and paste your edited code here so I can see what it looks like now?

@smruti241
Copy link
Author

I have edited nothing, just changed X to 23, then I got all the results

@smruti241
Copy link
Author

@rstraver can you please help me in this thread: VUmcCGP/wisecondor#51

its a wisecondor defrag issue

@rstraver
Copy link
Collaborator

rstraver commented May 31, 2021

I already asked the colleague that worked on that script back then to take a look on defrag, just waiting for that now.

I have edited nothing, just changed X to 23, then I got all the results

Does that mean you made it look like this:
for ARG_TASKID in "X"
or not? I can't tell if you don't show your edited code.

@smruti241
Copy link
Author

smruti241 commented May 31, 2021

No i did it in this way:
for ARG_TASKID in 'seq 1 22' "X"
after this merge.sh script was not running, so I changed X chr to 23 and changed the code:
for i in $(seq 1 23)

@rstraver
Copy link
Collaborator

rstraver commented May 31, 2021

You need to pick one of the two options I provided:

This should enable it to run for everything:
for ARG_TASKID in `seq 1 22` or "X"

Or just X:
for ARG_TASKID in "X"

The first you tried (for ARG_TASKID in 'seq 1 22' "X") omitted the or that is required. The second doesn't target chromosome X anymore.

@smruti241
Copy link
Author

when I put 'seq 1 22' or "X" it was giving me this:
IN:../filtered_006.bam, OUT:./train/a_1//filtered_006.bam
[main_samview] region "chror" specifies an invalid region or unknown reference. Continue anyway.
[main_samview] region "chror" specifies an invalid region or unknown reference. Continue anyway.

so, I removed or and ran the script. Now it gave all the .start files of 1-22 and X. after getting X.start files, I changed its name to 23 and then ran the merge script on them.

@rstraver
Copy link
Collaborator

Ah I see, my mistake on the 'or' part there.
Anyway if the X files are created then that works. No need to merge for X as we only needed the read start positions, not nucleosome tracks for chrX.
Could you make sure the files are named X still, got some contents, and try running that getRefFF.sh step again?

@smruti241
Copy link
Author

oh we dont need X for merging, I already did it actually, this will throw error again?

@rstraver
Copy link
Collaborator

I think it should be ignored by other scripts as they are not interested in X anyway, if that's what you ask.

@smruti241
Copy link
Author

ok alright then I will check with getRefFF.sh and let you know. Anyway, I will ask you if any error comes up

@smruti241
Copy link
Author

I ran the command and finally got this:
./train/a/filtered_006.bam filtered_006.bam -0.9708737864077669 206.0 207.0
fetal fraction is -0.97 but why median is 206 and 207?

@rstraver
Copy link
Collaborator

rstraver commented Jun 1, 2021

It's a simple rough estimation, it takes the median of reads per bin on autosomal chromosomes (206) and chromosome X (207). The difference between the two is assumed to be half the fetal fraction for pregnancies where the fetus is male. Depending on sequencing depth and variations it is not super exact, and if you ran this for a pregnancy where the fetus is female (or male but no fetal DNA is in the sample anymore) a small negative fraction estimation is possible (a tad more reads mapped on X than on autosomal chromosomes, relatively speaking).

This value is not what SANEFALCON aims to produce, it's just a simple reference value to train SANEFALCON on (it needs to learn with known fetal fractions (which is much easier to estimate with male fetuses)). Be sure to get these values for pregnancies where the fetus is known to be male and use the resulting fractions as reference training data for SANEFACLON training afterward.

@smruti241
Copy link
Author

after this I ran predictor.py script, but I got this error:

  • Training stage:
    Traceback (most recent call last):
    File "/media/mdrcubuntu/C8CE59DBCE59C1FC/fetal_fraction/sanefalcon/predictor_1.py", line 509, in
    main()
    File "/media/mdrcubuntu/C8CE59DBCE59C1FC/fetal_fraction/sanefalcon/predictor_1.py", line 349, in main
    samples,coverages = loadNuclFile(argv[1])
    File "/media/mdrcubuntu/C8CE59DBCE59C1FC/fetal_fraction/sanefalcon/predictor_1.py", line 47, in loadNuclFile
    values=[float(x) for x in splitLine[int(sys.argv[6]):int(sys.argv[7])]]
    IndexError: list index out of range
    what should I do after this?

@smruti241
Copy link
Author

@rstraver I read your paper comprising of using SANEFALCON with DEFRAG and seqFF. It would be very kind of you if you help me both in SANEFALCON and DEFRAG because I got stuck in the last part of this tools.

@smruti241
Copy link
Author

@rstraver I didnt get any response regarding sanefalcon and defrag. Please tell me whether I can use defrag or not because there is error in defrag script. I am not able to use sanefalcon because of the above error.

@rstraver rstraver closed this as completed Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants