[QST]how can i change int64 to float64
gukejun1 opened this issue · 7 comments
According to the content given by the official website, I run the case, but I end up reporting an error.
File "/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/csv.py", line 285, in coerce_dtypes
raise ValueError(msg)
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+--------+---------+----------+
| Column | Found | Expected |
+--------+---------+----------+
| I12 | float64 | int64 |
| I2 | float64 | int64 |
| I7 | float64 | int64 |
+--------+---------+----------+
Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'I12': 'float64',
'I2': 'float64',
'I7': 'float64'}
to the call to `read_csv`/`read_table`.
Alternatively, provide `assume_missing=True` to interpret
all unspecified integer columns as floats.
Then I modify some of the code as follows:
dtypes = {}
dtypes["label"] = np.int32
for x in cont_names:
dtypes[x] = np.int32
for x in ["I12", "I2","I7"]:
dtypes[x] = "float64"
for x in cat_names:
dtypes[x] = "hex"
But it still makes the same mistake,
I'll add the code for
dataset = nvt.Dataset(
file_list,
engine="csv",
names=cols,
part_mem_fraction=0.10,
sep="\t",
dtypes=dtypes,
assume_missing=True,
client=client,
)
A new error occurs:
File "/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py", line 1213, in astype_float_to_int_nansafe
raise IntCastingNaNError(
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
The data is the first decompressed data file(/raid/data/criteo/crit_orig/day_0).How do I handle the initial data format exception?
input_path = "/raid/data/criteo/crit_orig"
BASE_DIR = "/raid/data/criteo"
INPUT_PATH = os.environ.get("INPUT_DATA_DIR", input_path)
OUTPUT_PATH = os.environ.get("OUTPUT_DATA_DIR", os.path.join(BASE_DIR, "converted"))
# CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "0,1,2,3,4,5")
CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "3")
cluster = None # Connect to existing cluster if desired
if cluster is None:
cluster = LocalCUDACluster(
CUDA_VISIBLE_DEVICES=CUDA_VISIBLE_DEVICES,
rmm_pool_size=get_rmm_size(0.8 * device_mem_size()),
local_directory=os.path.join(OUTPUT_PATH, "dask-space"),
)
client = Client(cluster)
# Specify column names
cont_names = ["I" + str(x) for x in range(1, 14)]
cat_names = ["C" + str(x) for x in range(1, 27)]
cols = ["label"] + cont_names + cat_names
# Specify column dtypes. Note that "hex" means that
# the values will be hexadecimal strings that should
# be converted to int32
dtypes = {}
dtypes["label"] = np.int32
for x in cont_names:
dtypes[x] = np.int32
for x in cat_names:
dtypes[x] = "hex"
file_list = glob.glob(os.path.join(INPUT_PATH, "day_0"))
dataset = nvt.Dataset(
file_list,
engine="csv",
names=cols,
part_mem_fraction=0.10,
sep="\t",
dtypes=dtypes,
client=client,
)
dataset.to_parquet(
os.path.join(OUTPUT_PATH, "criteo"),
preserve_files=True,
)
This is my original full code,from notebook(https://nvidia-merlin.github.io/Merlin/main/examples/scaling-criteo/01-Download-Convert.html#conversion-script-for-criteo-dataset-csv-to-parquet)
@rnyak could you tell me the solution? I use docker images(nvcr.io/nvidia/merlin/merlin-tensorflow 22.12) from website(https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow)
@rnyak NO,i can't fix it by use \Merlin\examples\scaling-criteo\01_download_convert.ipynb. I run this code on GPU.The data is from criteo dataset . the error is from the code
dataset = nvt.Dataset(
file_list,
engine="csv",
names=cols,
part_mem_fraction=0.10,
sep="\t",
dtypes=dtypes,
client=client,
)
@rnyak the late I change the notebook code to
for x in cont_names:
dtypes[x] = np.zeros(0)## change here
# dtypes[x] = np.int32
for x in cat_names:
dtypes[x] = "hex"
#.......
dataset = nvt.Dataset(
file_list,
engine="csv",
names=cols,
part_mem_fraction=0.10,
sep="\t",
dtypes=dtypes,
client=client,
assume_missing=True ## here add
)
i run it in 6 GPUS. The modification to my way eventually worked. But I wonder why does it fail like notebook? I'm not sure if my approach deviates from the idea in the notebook.and is there a better solution?