sdv-dev/TGAN

Column names are replaced by index

upura opened this issue · 4 comments

upura commented

Dear all,

I used TGAN in PyPI (ver 0.1.0) and found that column names of the original DataFrame are replaced by index.

Problem

I suppose the problem is derived from the following lines:
https://github.com/DAI-Lab/TGAN/blob/master/tgan/data.py#L321-#L322

In L321, column names of the original DataFrame are assigned into self.columns. And in the next line, column names of the original DataFrame are replaced by index.

The codes seems to try reassigning column names in L410, but the trouble is that the function fit_transform is called several times. Since column names of the original DataFrame are replaced by index in L322, self.columns are assigned as index by the 2nd time when fit_transform is called.
https://github.com/DAI-Lab/TGAN/blob/master/tgan/data.py#L410

How to fix

I believe that one of the solution is to add if statement around L321 like following:

        if self.columns is None:
            self.columns = data.columns

It might be wrong because I can't totally understand the codes of TGAN, but I wish this issue could be helpful.

Environment

  • macOS High Sierra 10.13.6
  • Python 3.6.8 | anaconda3-5.0.0

Hi @upura and thanks for your question.

the trouble is that the function fit_transform is called several times

I haven't been able in which case that can occur, could you please provide a snippet of code that reproduce your issue?

Thanks.

upura commented

Hello @ManuelAlvarezC, thank you for your reply.

Here is the notebook I used. Sorry for the Japanese comment. After fitting TGAN, it looks that column names are replaced (at cell [16]).
https://github.com/upura/upura.hatenablog/blob/master/books_sites/tgan/tgan-titanic.ipynb

Now I rechecked the codes and I've found that what I said is wrong. But I still can't see why column names are replaced.

the trouble is that the function fit_transform is called several times

Best.

csala commented

Thanks for reporting this @upura

The problem seems to be here: https://github.com/DAI-Lab/TGAN/blob/f5b9a9cbd9e4bc2f0755bdcf24daef537594cd72/tgan/data.py#L322

The fix would be to avoid replacing the column names, and rather use an enumerate on the subsequent loop to get the right i value without having to alter the data object.

@csala can you please bit elaborate how exactly we will do it?
"The fix would be to avoid replacing the column names, and rather use an enumerate on the subsequent loop to get the right i value without having to alter the data object."
As I am also working on it and I am getting error ,
ValueError Traceback (most recent call last)
in
----> 1 tgan.fit(data)

~\Anaconda3\lib\site-packages\tgan\model.py in fit(self, data)
678 """
679 self.preprocessor = Preprocessor(continuous_columns=self.continuous_columns)
--> 680 data = self.preprocessor.fit_transform(data)
681 self.metadata = self.preprocessor.metadata
682 dataflow = TGANDataFlow(data, self.metadata)

~\Anaconda3\lib\site-packages\tgan\data.py in fit_transform(self, data, fitting)
328 if i in self.continuous_columns:
329 column_data = data[i].values.reshape([-1, 1])
--> 330 features, probs, means, stds = self.continous_transformer.transform(column_data)
331 transformed_data['f%02d' % i] = np.concatenate((features, probs), axis=1)
332

~\Anaconda3\lib\site-packages\tgan\data.py in decorated(self, data, *args, **kwargs)
61 raise ValueError('The argument data must be a numpy.ndarray with shape (n, 1).')
62
---> 63 return function(self, data, *args, **kwargs)
64
65 decorated.doc = function.doc

~\Anaconda3\lib\site-packages\tgan\data.py in transform(self, data)
238 """
239 model = GaussianMixture(self.num_modes)
--> 240 model.fit(data)
241
242 means = model.means_.reshape((1, self.num_modes))

~\Anaconda3\lib\site-packages\sklearn\mixture\base.py in fit(self, X, y)
192 self
193 """
--> 194 self.fit_predict(X, y)
195 return self
196

~\Anaconda3\lib\site-packages\sklearn\mixture\base.py in fit_predict(self, X, y)
218 Component labels.
219 """
--> 220 X = _check_X(X, self.n_components, ensure_min_samples=2)
221 self._check_initial_parameters(X)
222

~\Anaconda3\lib\site-packages\sklearn\mixture\base.py in _check_X(X, n_components, n_features, ensure_min_samples)
53 """
54 X = check_array(X, dtype=[np.float64, np.float32],
---> 55 ensure_min_samples=ensure_min_samples)
56 if n_components is not None and X.shape[0] < n_components:
57 raise ValueError('Expected n_samples >= n_components '

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
571 if force_all_finite:
572 _assert_all_finite(array,
--> 573 allow_nan=force_all_finite == 'allow-nan')
574
575 shape_repr = _shape_repr(array.shape)

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in _assert_all_finite(X, allow_nan)
54 not allow_nan and not np.isfinite(X).all()):
55 type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56 raise ValueError(msg_err.format(type_err, X.dtype))
57
58

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').