nomic-ai/deepscatter

Visualizing dataset with large attributes

yuanenming opened this issue · 2 comments

Hi, thanks for sharing this awesome project.

I am working on LLMs, and not very famillar with web development. and I have a question regarding to visualize datasets with large additional information for each data points.

Specifically, I have an instruction finetuning dataset, which contains millions of conversations. I have gotten the embedding using OpenAI embedding API. It renders very fast when visualizing the datapoints without the conversation information (feather file ~30MB).
image

I want to further visualize the dataset along with the conversations. But it renders very slow (feather file ~6GB).

I would greatly appreciate your advice on my problem, is it possible to visualize the conversations in my scenario.

If you want to pass all of the conversations alongside the data, the feather files will inevitably get very large. The solution is to find a way to fetch all the data except the conversations first.

By far the easiest case here will be using Atlas, our online platform--we've put a lot of work into exactly your use case with many additional features (cross filtering, regular expression search across millions of points, downloading extracts of the data...) see for example this map of the Huggingface Obelics dataset and our description of analyzing it with Atlas.

The harder way is to serve the conversation data one at a time from a location that is isn't deepscatter. For instance, you could upload all the files as individual text files into an S3 bucket so they're online at "https://static.mys3bucket.org/conversations/{id_number}.txt": then change the mouseover function so that it fetches the data from the Internet. For instance, I might create a <div id="conversation_text"> somewhere and then run the following in the javascript
```js
scatterplot.tooltip_html = function(datum) {
fetch(https://static.mys3bucket.org/conversations/${id_number}.txt).then(response =>
document.getElementById("conversation_text").innerText = response.text()
}
````

Thanks for your advice. Atlas is really cool! I will definitely try it.