Feature Request: Stop truncating text in project datasets
Ulipenitz opened this issue ยท 6 comments
Is your feature request related to a problem? Please describe.
Related to this issue: #653
Describe the solution you'd like
As I am uploading a dataset (which does not fit on my local disk) to the project, I am uploading the dataset in a loop to the project like this:
project[DATA_PATH].append(
stringify_unsupported(
{
"tokens": ["text",...., "text"],
"ner_tags": ["tag",...,"tag"]
}
)
)
Truncation to 1000 characters destroys my dataset.
As of my knowledge, there is no other way to upload a dataset from memory (without saving to a local file) directly, so this feature would be great!
Describe alternatives you've considered
I am thinking about saving these dicts {"tokens": ["text",...., "text"], "ner_tags": ["tag",...,"tag"]} to a file in each iteration and upload it as a file (e.g. data/train/0.pkl, data/train/1.pkl ... data/train/70000.pkl).
My dataset has 70.000 rows, so this is not a nice solution, since I have to make a file, upload it to neputune and delete it from local memory 70.000 times. Also when downloading the data, this will get messy as well.
Hey @Ulipenitz ๐
I've passed on this feature request to the product team for consideration and will keep the thread updated.
Meanwhile, as a workaround, can you upload the dataset to Neptune as a serialized object? Given the size of the dataset, I am assuming you wouldn't need it to be in a human-readable format on Neptune (but please correct me if I am wrong)
You can upload the dataset as a pickle direct from memory by using neptune.types.File.as_pickle()
. It would look like shown below:
import neptune
from neptune.types import File
DATA_PATH = "data/train"
data = {
"tokens": ["text",..., "text"],
"ner_tags": ["tag",...,"tag"]
}
project = neptune.init_project()
for i in range(10):
project[DATA_PATH][i].upload(File.as_pickle(data))
To download and use the dataset, you can download it from the run and load it using pickle:
import pickle as pkl
project[DATA_PATH][i].download()
with open(DOWNLOADED_FILE_PATH, "rb") as f:
downloaded_dataset = pkl.load(f)
Please let me know if this would work for you ๐
Thank you for the quick reply!
I already tried this, but unfortunately I get an error like this:
FileNotFoundError: [Errno 2] No such file or directory: 'ABSOLUTEPATH\\.neptune\\async\\project__9701b6a4-d310-4f5f-a6e0-7827a05c1e78\\exec-1708349077.259059-2024-02-19_14.24.37.259059-5884\\upload_path\\data_dummy_data-1708349077.32419-2024-02-19_14.24.37.324190.pkl'
I used this code:
project = neptune.init_project( )
data = {"a": 0, "b": 1}
project["data/dummy_data"].upload(File.as_pickle(data))
The project folder exists, but exec-1708349077
does not.
This was a bug in neptune<0.19
. Could you update neptune
to the latest version using pip install -U neptune
and try again?
Sorry, I did not realize that I was not running on the newest version. It works now!
Also, your proposed solution works! Thanks for the help! :-)
Perfect ๐
I'll keep the thread open in case the product team needs further details ๐
Quick update:
Initially I tested with a subset of the data, but with the big dataset I get this error:
----NeptuneFieldCountLimitExceedException---------------------------------------------------------------------------------------
There are too many fields (more than 9000) in the [PROJECTNAME] project.
We have stopped the synchronization to the Neptune server and stored the data locally.
I will try to chunk the data, so that I won't exceed this limit, but this workaround brings in some more complexity into our project.
Would be great to have bigger limits for bigger datasets.