You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A colleague is working on using plasflow to analyze on all contigs >1000 bp in her dataset. After filtering using filter_sequences_by_length.pl, she has a total of 2,964,210 contigs. We are using plasflow-1.1, python-3.5 and sklearn-0.18.1 on CentOS 6.9. Plasflow was installed via Anaconda.
Importing sequences
Imported 2964210 sequences
Calculating kmer frequencies using kmer 5
Due to large number of sequences in the input file, it is splitted to smaller chunks (maximum size: 25000 sequences)
processing chunk: 1
.
.
.
processing chunk: 119
Transforming kmer frequencies
Stderr :
/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/feature_extraction/text.py:1059: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
Traceback (most recent call last):
File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 346, in <module>
vote_proba = vote_class.predict_proba(inputfile)
File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 300, in predict_proba
self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 300, in <listcomp>
self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 252, in predict_proba_tf
self.calculate_freq(data)
File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 243, in calculate_freq
test_tfidf = transformer.fit_transform(kmer_count)
File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/base.py", line 494, in fit_transform return self.fit(X, **fit_params).transform(X)
File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1084, in transform
X = normalize(X, norm=self.norm, copy=False)
File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/preprocessing/data.py", line 1352, in normalize
inplace_csr_row_normalize_l2(X)
File "sklearn/utils/sparsefuncs_fast.pyx", line 359, in sklearn.utils.sparsefuncs_fast.inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs_fast.c:12648)
File "sklearn/utils/sparsefuncs_fast.pyx", line 362, in sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs_fast.c:13750)
ValueError: Buffer dtype mismatch, expected 'int' but got 'long'
This issue leads me to think this is due to passing the underlying C-funtion, sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2, too large of a matrix. Following a path of links, lead me to this commit which makes me think that this may be fixed in a more recent version of scikit-learn. The input data, all.contigs.1000.fasta is 12GB in size
Question:
Is my assessment of this issue correct?
Is there a work-around this issue?
Is the input data too big?
Thanks.
The text was updated successfully, but these errors were encountered:
Hi, thank for submitting that issue. I will take a closer look at that and will think about the fix. However, I think that the answer to the 3rd question is yes, and limiting the number of input sequences (for example splitting in the half) should help by now.
A colleague is working on using plasflow to analyze on all contigs >1000 bp in her dataset. After filtering using
filter_sequences_by_length.pl
, she has a total of 2,964,210 contigs. We are usingplasflow-1.1
,python-3.5
andsklearn-0.18.1
on CentOS 6.9. Plasflow was installed viaAnaconda
.Running :
PlasFlow.py --input all.contigs.1000.fasta --output output.plasflow.all.contigs.csv --threshold 0.7
Yields:
Stdout:
Stderr :
This issue leads me to think this is due to passing the underlying C-funtion,
sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2
, too large of a matrix. Following a path of links, lead me to this commit which makes me think that this may be fixed in a more recent version of scikit-learn. The input data,all.contigs.1000.fasta
is 12GB in sizeQuestion:
Thanks.
The text was updated successfully, but these errors were encountered: