pavlin-policar/openTSNE

Problem with data from CSV file

sbembenek18 opened this issue · 7 comments

I'm loading my data from CSV and it has the follwowing format:

string, PC1, .......
There's ~50 PCA (PC1 ...) components in all and several thousand rows.

Here's a sample of the data matrix ==> Data

import csv
import numpy as np

# Read data from CSV
data = []
with open('./data.csv', 'r') as f:  
    reader = csv.reader(f)
    for row in reader:
        data.append(row)

data = np.array(data[1:], dtype=str)
X = data[:, 1:]

tsne = TSNE(
    perplexity=24.33,
    metric="euclidean",
    n_jobs=8,
    random_state=42,
    verbose=True,
)

X_2d = tsne.fit(X) # Problem here

I'm able to get this work with sklearn TSNE -- is the format for OpenTSNE different?

thanks!

I haven't tried this code myself, and it's hard to tell since you haven't posted the error message, but it seems to me you're creating a numpy array containing str objects here

data = np.array(data[1:], dtype=str)

When you subset the matrix here

X = data[:, 1:]

it will still probably have dtype str.

Perhaps just casting it to a float64 will do the trick, like this

X = data[:, 1:].astype(np.float64)

Otherwise, I can't see anything obviously wrong with this code here. If this is indeed the problem, it seems strange to me scikit-learn handles this, as I'd rather have an explicit failure on a string matrix than an implicit conversion.

I agree - there's something I am missing.

I've reworked it using pandas and have tried 3 different data sets. I can only get Open TSNE to work with 1 of the 3 sets, while sklearn TSNE works with all 3 sets. I suspect I am missing something here.

To be sure, I was able to get the example sets to run with Open TSNE. I've attached the code and data sets.

tsne_pandas_version_v0.ipynb.tar.gz

DataSets.tar.gz

Could you please paste the code and the error you're getting here on GH?

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE 

data = pd.read_csv('./data_full_set_10_clusters_35_pca.csv')
#data = pd.read_csv('./data_full_set_10_clusters_35_pca.csv') # This set work.

from openTSNE import TSNE

tsne = TSNE(
    perplexity=24.33,
    metric="euclidean",
    n_jobs=8,
    random_state=42,
    verbose=True,
)

tsne_result = tsne.fit(data.iloc[:,1:]) # Error here

data[['x','y']] = tsne_result

target_names = data['Cluster'].unique()

#Plot
colors = 'r', 'g', 'b', 'c', 'm', 'y', 'k', 'gray', 'orange', 'purple'
fig, ax = plt.subplots(figsize=(8, 6))
for color, label in zip(colors, target_names):
    selected = data[data.Cluster.eq(label)]
    ax.scatter(x='x', y='y', data=selected, c=color, label=label)
ax.legend(bbox_to_anchor=(1, 0.5), loc='center left', frameon=False)
plt.show()

Error ==>

--------------------------------------------------------------------------------
TSNE(n_jobs=8, perplexity=24.33, random_state=42, verbose=True)
--------------------------------------------------------------------------------
===> Finding 72 nearest neighbors using Annoy approximate search using euclidean distance...
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /data/Apps/anaconda3/envs/open-tsne/lib/python3.11/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3801 try:
-> 3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:

File /data/Apps/anaconda3/envs/open-tsne/lib/python3.11/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File /data/Apps/anaconda3/envs/open-tsne/lib/python3.11/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[14], line 14
      4 from openTSNE import TSNE
      6 tsne = TSNE(
      7     perplexity=24.33,
      8     metric="euclidean",
   (...)
     11     verbose=True,
     12 )
---> 14 tsne_result = tsne.fit(data.iloc[:,1:])

What I am finding is that for great than 999 rows, I get the error. Here's two more data set to check with, one with 999 rows (works) and another with 1000 (error).

data_1000_records_10_clusters_35_pca.csv

data_999_records_10_clusters_35_pca.csv

This seems to be related to pandas. See #182.

If this is indeed the case, you can simply extract the numpy matrix from the pandas dataframe like so

tsne_result = tsne.fit(data.iloc[:,1:].values)

This is technically not a bug, since openTSNE doesn't officially support pandas dataframes. I didn't want to drag that dependency into the requirements, and the solution is simply to use .values instead.

Yes, I can confirm this solves it (and agree this is not a bug in OpenTSNE). Thanks!