Problem with data from CSV file
sbembenek18 opened this issue · 7 comments
I'm loading my data from CSV and it has the follwowing format:
string, PC1, .......
There's ~50 PCA (PC1 ...) components in all and several thousand rows.
Here's a sample of the data matrix ==> Data
import csv
import numpy as np
# Read data from CSV
data = []
with open('./data.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
data.append(row)
data = np.array(data[1:], dtype=str)
X = data[:, 1:]
tsne = TSNE(
perplexity=24.33,
metric="euclidean",
n_jobs=8,
random_state=42,
verbose=True,
)
X_2d = tsne.fit(X) # Problem here
I'm able to get this work with sklearn TSNE -- is the format for OpenTSNE different?
thanks!
I haven't tried this code myself, and it's hard to tell since you haven't posted the error message, but it seems to me you're creating a numpy array containing str
objects here
data = np.array(data[1:], dtype=str)
When you subset the matrix here
X = data[:, 1:]
it will still probably have dtype str
.
Perhaps just casting it to a float64 will do the trick, like this
X = data[:, 1:].astype(np.float64)
Otherwise, I can't see anything obviously wrong with this code here. If this is indeed the problem, it seems strange to me scikit-learn handles this, as I'd rather have an explicit failure on a string matrix than an implicit conversion.
I agree - there's something I am missing.
I've reworked it using pandas and have tried 3 different data sets. I can only get Open TSNE to work with 1 of the 3 sets, while sklearn TSNE works with all 3 sets. I suspect I am missing something here.
To be sure, I was able to get the example sets to run with Open TSNE. I've attached the code and data sets.
Could you please paste the code and the error you're getting here on GH?
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
data = pd.read_csv('./data_full_set_10_clusters_35_pca.csv')
#data = pd.read_csv('./data_full_set_10_clusters_35_pca.csv') # This set work.
from openTSNE import TSNE
tsne = TSNE(
perplexity=24.33,
metric="euclidean",
n_jobs=8,
random_state=42,
verbose=True,
)
tsne_result = tsne.fit(data.iloc[:,1:]) # Error here
data[['x','y']] = tsne_result
target_names = data['Cluster'].unique()
#Plot
colors = 'r', 'g', 'b', 'c', 'm', 'y', 'k', 'gray', 'orange', 'purple'
fig, ax = plt.subplots(figsize=(8, 6))
for color, label in zip(colors, target_names):
selected = data[data.Cluster.eq(label)]
ax.scatter(x='x', y='y', data=selected, c=color, label=label)
ax.legend(bbox_to_anchor=(1, 0.5), loc='center left', frameon=False)
plt.show()
Error ==>
--------------------------------------------------------------------------------
TSNE(n_jobs=8, perplexity=24.33, random_state=42, verbose=True)
--------------------------------------------------------------------------------
===> Finding 72 nearest neighbors using Annoy approximate search using euclidean distance...
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /data/Apps/anaconda3/envs/open-tsne/lib/python3.11/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
3801 try:
-> 3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
File /data/Apps/anaconda3/envs/open-tsne/lib/python3.11/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()
File /data/Apps/anaconda3/envs/open-tsne/lib/python3.11/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[14], line 14
4 from openTSNE import TSNE
6 tsne = TSNE(
7 perplexity=24.33,
8 metric="euclidean",
(...)
11 verbose=True,
12 )
---> 14 tsne_result = tsne.fit(data.iloc[:,1:])
What I am finding is that for great than 999 rows, I get the error. Here's two more data set to check with, one with 999 rows (works) and another with 1000 (error).
This seems to be related to pandas. See #182.
If this is indeed the case, you can simply extract the numpy matrix from the pandas dataframe like so
tsne_result = tsne.fit(data.iloc[:,1:].values)
This is technically not a bug, since openTSNE doesn't officially support pandas dataframes. I didn't want to drag that dependency into the requirements, and the solution is simply to use .values
instead.
Yes, I can confirm this solves it (and agree this is not a bug in OpenTSNE). Thanks!