smaegol/PlasFlow

Possible incompatibility with underlying sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2

astrophys opened this issue · 1 comments

A colleague is working on using plasflow to analyze on all contigs >1000 bp in her dataset. After filtering using filter_sequences_by_length.pl, she has a total of 2,964,210 contigs. We are using plasflow-1.1, python-3.5 and sklearn-0.18.1 on CentOS 6.9. Plasflow was installed via Anaconda.

Running :

PlasFlow.py --input all.contigs.1000.fasta --output output.plasflow.all.contigs.csv --threshold 0.7

Yields:
Stdout:

Importing sequences
Imported  2964210  sequences
Calculating kmer frequencies using kmer 5
Due to large number of sequences in the input file, it is splitted to smaller chunks (maximum size: 25000 sequences)
processing chunk: 1
.
.
.
processing chunk: 119
Transforming kmer frequencies

Stderr :

/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/feature_extraction/text.py:1059: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
Traceback (most recent call last):
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 346, in <module>
    vote_proba = vote_class.predict_proba(inputfile)
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 300, in predict_proba
    self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 300, in <listcomp>
    self.probas_ = [clf.predict_proba_tf(X) for clf in self.clfs]
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 252, in predict_proba_tf
    self.calculate_freq(data)
  File "/opt/plasflow/1.1//envs/plasflow/bin/PlasFlow.py", line 243, in calculate_freq
    test_tfidf = transformer.fit_transform(kmer_count)
  File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/base.py", line 494, in fit_transform    return self.fit(X, **fit_params).transform(X)
  File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1084, in transform
    X = normalize(X, norm=self.norm, copy=False)
  File "/opt/plasflow/1.1/envs/plasflow/lib/python3.5/site-packages/sklearn/preprocessing/data.py", line 1352, in normalize
    inplace_csr_row_normalize_l2(X)
  File "sklearn/utils/sparsefuncs_fast.pyx", line 359, in sklearn.utils.sparsefuncs_fast.inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs_fast.c:12648)
  File "sklearn/utils/sparsefuncs_fast.pyx", line 362, in sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs_fast.c:13750)
ValueError: Buffer dtype mismatch, expected 'int' but got 'long'

This issue leads me to think this is due to passing the underlying C-funtion, sklearn.utils.sparsefuncs_fast._inplace_csr_row_normalize_l2, too large of a matrix. Following a path of links, lead me to this commit which makes me think that this may be fixed in a more recent version of scikit-learn. The input data, all.contigs.1000.fasta is 12GB in size

Question:

  1. Is my assessment of this issue correct?
  2. Is there a work-around this issue?
  3. Is the input data too big?

Thanks.

Hi, thank for submitting that issue. I will take a closer look at that and will think about the fix. However, I think that the answer to the 3rd question is yes, and limiting the number of input sequences (for example splitting in the half) should help by now.