Skip to content

Commit

Permalink
more use case examples
Browse files Browse the repository at this point in the history
  • Loading branch information
bioinfwithjudith committed Oct 5, 2023
1 parent 7c425a2 commit eed232f
Show file tree
Hide file tree
Showing 48 changed files with 364 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"reference_matrix_path": "/data/jzr5814/repositories/YACHT/use_case_examples/example_1/20_genomes_ref_matrix_processed.npz",
"hash_to_idx_path": "/data/jzr5814/repositories/YACHT/use_case_examples/example_1/20_genomes_hash_to_col_idx.pkl",
"processed_org_file_path": "/data/jzr5814/repositories/YACHT/use_case_examples/example_1/20_genomes_processed_org_idx.csv",
"ksize": 31,
"ani_thresh": 0.95
}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
organism_name,original_index,processed_index,num_unique_kmers_in_genome_sketch,num_total_kmers_in_genome_sketch,genome_scale_factor
"VPFC01000001.1 [Empedobacter] haloabium strain ATCC 31962 contig1, whole genome shotgun sequence",0,0,6253,6571.0,1000
"SSEB01000012.1 Sinobacteraceae bacterium isolate Bin_35_3 c_000000004113, whole genome shotgun sequence",1,1,3170,3174.0,1000
"SSEF01000018.1 Candidatus Moranbacteria bacterium isolate Bin_68_2 c_000000001403, whole genome shotgun sequence",2,2,991,996.0,1000
"SSEL01000090.1 Rhodospirillaceae bacterium isolate Bin_26_3 c_000000001054, whole genome shotgun sequence",3,3,5696,5702.0,1000
"VIKH01000154.1 Gallionellaceae bacterium isolate X1_MetaBAT.29 scaffold_10015, whole genome shotgun sequence",4,4,2870,2878.0,1000
"VIKE01000141.1 Rhodocyclaceae bacterium isolate X1_MetaBAT.22 scaffold_10076, whole genome shotgun sequence",5,5,5129,5131.0,1000
"VIKI01000038.1 Comamonadaceae bacterium isolate X1_MetaBAT.31 scaffold_1017, whole genome shotgun sequence",6,6,5401,5410.0,1000
"VIKJ01000003.1 Chitinophagaceae bacterium isolate X1_MetaBAT.39 scaffold_1008, whole genome shotgun sequence",7,7,1984,1984.0,1000
"VKGY01000191.1 Spirochaetes bacterium isolate X1_MetaBAT.41 scaffold_10187, whole genome shotgun sequence",8,8,2572,2574.0,1000
"SHMW01000001.1 Candidatus Lokiarchaeota archaeon isolate BC3 1189800001, whole genome shotgun sequence",9,9,4203,4207.0,1000
"SHMX01000001.1 Candidatus Thorarchaeota archaeon isolate BC 1189500001, whole genome shotgun sequence",10,10,3115,3121.0,1000
"SHMU01000001.1 Candidatus Lokiarchaeota archaeon isolate BC1 1189600001, whole genome shotgun sequence",11,11,3820,3907.0,1000
"VMDM01000010.1 Nitrosopumilus sp. isolate 32_1 c_000000000023, whole genome shotgun sequence",12,12,1013,1015.0,1000
"VMDK01000027.1 Sphingobacteriia bacterium isolate 28_1 c_000000000062, whole genome shotgun sequence",13,13,2437,2445.0,1000
"VMDI01000049.1 Gammaproteobacteria bacterium isolate 24_3 c_000000000093, whole genome shotgun sequence",14,14,972,975.0,1000
"VMDJ01000165.1 Gammaproteobacteria bacterium isolate 27_1 c_000000000223, whole genome shotgun sequence",15,15,1376,1381.0,1000
"VMDH01000017.1 Gammaproteobacteria bacterium isolate 24_2 c_000000000070, whole genome shotgun sequence",16,16,916,916.0,1000
"VSSA01000053.1 Nocardioides sp. BGMRC 2183 Scaffold102_1, whole genome shotgun sequence",17,17,4575,4591.0,1000
"CP032507.1 Ectothiorhodospiraceae bacterium BW-2 chromosome, complete genome",18,18,3741,4101.0,1000
"WAAQ01000001.1 Microbacterium maritypicum strain DSM 12512 contig00001, whole genome shotgun sequence",19,19,3664,3665.0,1000
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
These signatures are obtained from /data/jzr5814/repositories/YACHT/tests/testdata directory. I don't know what is in these signatures but we can take a look at what's inthese signatures with the following commands.


```bash
sourmash signature fileinfo sample.sig
```

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'sample.sig'
path filetype: MultiIndex
location: sample.sig
is database? no
has manifest? yes
num signatures: 1
** examining manifest...
total hashes: 49821
summary of sketches:
1 sketches with DNA, k=31, scaled=1000, abund 49821 total hashes

```bash
sourmash signature fileinfo 20_genomes_sketches.zip
```

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from '20_genomes_sketches.zip'
path filetype: ZipFileLinearIndex
location: /data/jzr5814/repositories/YACHT/use_case_examples/example_1/20_genomes_sketches.zip
is database? yes
has manifest? yes
num signatures: 20
** examining manifest...
total hashes: 63898
summary of sketches:
20 sketches with DNA, k=31, scaled=1000, abund 63898 total hashes

```bash
python ../../make_training_data_from_sketches.py --ref_file '20_genomes_sketches.zip' --ksize 31 --out_prefix '20_genomes' --ani_thresh 0.95
```

023-09-27 08:54:59 - INFO - Loading signatures from 20_genomes_sketches.zip
2023-09-27 08:54:59 - INFO - Converting signatures to reference matrix
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 855.31it/s]
2023-09-27 08:54:59 - INFO - Removing 'same' organisms with ANI > ani_thresh
2023-09-27 08:54:59 - INFO - Writing out hash-to-row-indices file
2023-09-27 08:54:59 - INFO - Writing out organism manifest
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 570.85it/s]
2023-09-27 08:54:59 - INFO - Saving k-mer size and ani threshold to json file

```bash
python ../../run_YACHT.py --json '20_genomes_config.json' --sample_file 'sample.sig' --significance 0.99 --min_coverage 1 --outdir './'
```

2023-10-05 07:53:57 - INFO - Loading reference matrix, hash to index dictionary, and organism data.
2023-10-05 07:53:57 - INFO - Loading sample signature.
2023-10-05 07:53:57 - INFO - Computing sample vector.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 46603.38it/s]
2023-10-05 07:53:57 - INFO - Computing hypothesis recovery.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 762.46it/s]
2023-10-05 07:53:57 - INFO - Saving results to ./.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"reference_matrix_path": "/data/jzr5814/repositories/YACHT/use_case_examples/example_2/20_genomes_ref_matrix_processed.npz",
"hash_to_idx_path": "/data/jzr5814/repositories/YACHT/use_case_examples/example_2/20_genomes_hash_to_col_idx.pkl",
"processed_org_file_path": "/data/jzr5814/repositories/YACHT/use_case_examples/example_2/20_genomes_processed_org_idx.csv",
"ksize": 31,
"ani_thresh": 0.95
}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
organism_name,original_index,processed_index,num_unique_kmers_in_genome_sketch,num_total_kmers_in_genome_sketch,genome_scale_factor
"VPFC01000001.1 [Empedobacter] haloabium strain ATCC 31962 contig1, whole genome shotgun sequence",0,0,6253,6571.0,1000
"SSEB01000012.1 Sinobacteraceae bacterium isolate Bin_35_3 c_000000004113, whole genome shotgun sequence",1,1,3170,3174.0,1000
"SSEF01000018.1 Candidatus Moranbacteria bacterium isolate Bin_68_2 c_000000001403, whole genome shotgun sequence",2,2,991,996.0,1000
"SSEL01000090.1 Rhodospirillaceae bacterium isolate Bin_26_3 c_000000001054, whole genome shotgun sequence",3,3,5696,5702.0,1000
"VIKH01000154.1 Gallionellaceae bacterium isolate X1_MetaBAT.29 scaffold_10015, whole genome shotgun sequence",4,4,2870,2878.0,1000
"VIKE01000141.1 Rhodocyclaceae bacterium isolate X1_MetaBAT.22 scaffold_10076, whole genome shotgun sequence",5,5,5129,5131.0,1000
"VIKI01000038.1 Comamonadaceae bacterium isolate X1_MetaBAT.31 scaffold_1017, whole genome shotgun sequence",6,6,5401,5410.0,1000
"VIKJ01000003.1 Chitinophagaceae bacterium isolate X1_MetaBAT.39 scaffold_1008, whole genome shotgun sequence",7,7,1984,1984.0,1000
"VKGY01000191.1 Spirochaetes bacterium isolate X1_MetaBAT.41 scaffold_10187, whole genome shotgun sequence",8,8,2572,2574.0,1000
"SHMW01000001.1 Candidatus Lokiarchaeota archaeon isolate BC3 1189800001, whole genome shotgun sequence",9,9,4203,4207.0,1000
"SHMX01000001.1 Candidatus Thorarchaeota archaeon isolate BC 1189500001, whole genome shotgun sequence",10,10,3115,3121.0,1000
"SHMU01000001.1 Candidatus Lokiarchaeota archaeon isolate BC1 1189600001, whole genome shotgun sequence",11,11,3820,3907.0,1000
"VMDM01000010.1 Nitrosopumilus sp. isolate 32_1 c_000000000023, whole genome shotgun sequence",12,12,1013,1015.0,1000
"VMDK01000027.1 Sphingobacteriia bacterium isolate 28_1 c_000000000062, whole genome shotgun sequence",13,13,2437,2445.0,1000
"VMDI01000049.1 Gammaproteobacteria bacterium isolate 24_3 c_000000000093, whole genome shotgun sequence",14,14,972,975.0,1000
"VMDJ01000165.1 Gammaproteobacteria bacterium isolate 27_1 c_000000000223, whole genome shotgun sequence",15,15,1376,1381.0,1000
"VMDH01000017.1 Gammaproteobacteria bacterium isolate 24_2 c_000000000070, whole genome shotgun sequence",16,16,916,916.0,1000
"VSSA01000053.1 Nocardioides sp. BGMRC 2183 Scaffold102_1, whole genome shotgun sequence",17,17,4575,4591.0,1000
"CP032507.1 Ectothiorhodospiraceae bacterium BW-2 chromosome, complete genome",18,18,3741,4101.0,1000
"WAAQ01000001.1 Microbacterium maritypicum strain DSM 12512 contig00001, whole genome shotgun sequence",19,19,3664,3665.0,1000
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
```bash
python ../../make_training_data_from_sketches.py --ref_file '20_genomes_sketches.zip' --ksize 31 --out_prefix '20_genomes' --ani_thresh 0.95
```

2023-10-05 08:49:26 - INFO - Loading signatures from 20_genomes_sketches.zip
2023-10-05 08:49:26 - INFO - Converting signatures to reference matrix
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 201.20it/s]
2023-10-05 08:49:26 - INFO - Removing 'same' organisms with ANI > ani_thresh
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 7258.46it/s]
2023-10-05 08:49:26 - INFO - Writing out hash-to-row-indices file
2023-10-05 08:49:26 - INFO - Writing out organism manifest
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 566.45it/s]
2023-10-05 08:49:26 - INFO - Saving k-mer size and ani threshold to json file

```bash
python ../../run_YACHT.py --json '20_genomes_config.json' --sample_file 'sample.sig' --significance 0.99 --min_coverage 1 0.5 0.1 0.05 0.01 --outdir './'
```

2023-10-05 08:49:32 - INFO - Loading reference matrix, hash to index dictionary, and organism data.
2023-10-05 08:49:32 - INFO - Loading sample signature.
2023-10-05 08:49:32 - INFO - Computing sample vector.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 42366.71it/s]
2023-10-05 08:49:32 - INFO - Computing hypothesis recovery.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 837.35it/s]
2023-10-05 08:49:32 - INFO - Saving results to ./.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"reference_matrix_path": "/data/jzr5814/repositories/YACHT/use_case_examples/example_3/20_genomes_ref_matrix_processed.npz",
"hash_to_idx_path": "/data/jzr5814/repositories/YACHT/use_case_examples/example_3/20_genomes_hash_to_col_idx.pkl",
"processed_org_file_path": "/data/jzr5814/repositories/YACHT/use_case_examples/example_3/20_genomes_processed_org_idx.csv",
"ksize": 31,
"ani_thresh": 1.0
}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
organism_name,original_index,processed_index,num_unique_kmers_in_genome_sketch,num_total_kmers_in_genome_sketch,genome_scale_factor
"VPFC01000001.1 [Empedobacter] haloabium strain ATCC 31962 contig1, whole genome shotgun sequence",0,0,6253,6571.0,1000
"SSEB01000012.1 Sinobacteraceae bacterium isolate Bin_35_3 c_000000004113, whole genome shotgun sequence",1,1,3170,3174.0,1000
"SSEF01000018.1 Candidatus Moranbacteria bacterium isolate Bin_68_2 c_000000001403, whole genome shotgun sequence",2,2,991,996.0,1000
"SSEL01000090.1 Rhodospirillaceae bacterium isolate Bin_26_3 c_000000001054, whole genome shotgun sequence",3,3,5696,5702.0,1000
"VIKH01000154.1 Gallionellaceae bacterium isolate X1_MetaBAT.29 scaffold_10015, whole genome shotgun sequence",4,4,2870,2878.0,1000
"VIKE01000141.1 Rhodocyclaceae bacterium isolate X1_MetaBAT.22 scaffold_10076, whole genome shotgun sequence",5,5,5129,5131.0,1000
"VIKI01000038.1 Comamonadaceae bacterium isolate X1_MetaBAT.31 scaffold_1017, whole genome shotgun sequence",6,6,5401,5410.0,1000
"VIKJ01000003.1 Chitinophagaceae bacterium isolate X1_MetaBAT.39 scaffold_1008, whole genome shotgun sequence",7,7,1984,1984.0,1000
"VKGY01000191.1 Spirochaetes bacterium isolate X1_MetaBAT.41 scaffold_10187, whole genome shotgun sequence",8,8,2572,2574.0,1000
"SHMW01000001.1 Candidatus Lokiarchaeota archaeon isolate BC3 1189800001, whole genome shotgun sequence",9,9,4203,4207.0,1000
"SHMX01000001.1 Candidatus Thorarchaeota archaeon isolate BC 1189500001, whole genome shotgun sequence",10,10,3115,3121.0,1000
"SHMU01000001.1 Candidatus Lokiarchaeota archaeon isolate BC1 1189600001, whole genome shotgun sequence",11,11,3820,3907.0,1000
"VMDM01000010.1 Nitrosopumilus sp. isolate 32_1 c_000000000023, whole genome shotgun sequence",12,12,1013,1015.0,1000
"VMDK01000027.1 Sphingobacteriia bacterium isolate 28_1 c_000000000062, whole genome shotgun sequence",13,13,2437,2445.0,1000
"VMDI01000049.1 Gammaproteobacteria bacterium isolate 24_3 c_000000000093, whole genome shotgun sequence",14,14,972,975.0,1000
"VMDJ01000165.1 Gammaproteobacteria bacterium isolate 27_1 c_000000000223, whole genome shotgun sequence",15,15,1376,1381.0,1000
"VMDH01000017.1 Gammaproteobacteria bacterium isolate 24_2 c_000000000070, whole genome shotgun sequence",16,16,916,916.0,1000
"VSSA01000053.1 Nocardioides sp. BGMRC 2183 Scaffold102_1, whole genome shotgun sequence",17,17,4575,4591.0,1000
"CP032507.1 Ectothiorhodospiraceae bacterium BW-2 chromosome, complete genome",18,18,3741,4101.0,1000
"WAAQ01000001.1 Microbacterium maritypicum strain DSM 12512 contig00001, whole genome shotgun sequence",19,19,3664,3665.0,1000
Binary file not shown.
Loading

0 comments on commit eed232f

Please sign in to comment.