NVIDIA-Merlin/NVTabular

[QST]how can i change int64 to float64

gukejun1 opened this issue · 7 comments

https://nvidia-merlin.github.io/Merlin/main/examples/scaling-criteo/01-Download-Convert.html#conversion-script-for-criteo-dataset-csv-to-parquet

According to the content given by the official website, I run the case, but I end up reporting an error.

File "/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/csv.py", line 285, in coerce_dtypes
   raise ValueError(msg)
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+--------+---------+----------+
| Column | Found   | Expected |
+--------+---------+----------+
| I12    | float64 | int64    |
| I2     | float64 | int64    |
| I7     | float64 | int64    |
+--------+---------+----------+

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'I12': 'float64',
      'I2': 'float64',
      'I7': 'float64'}

to the call to `read_csv`/`read_table`.

Alternatively, provide `assume_missing=True` to interpret
all unspecified integer columns as floats.

Then I modify some of the code as follows:

dtypes = {}
dtypes["label"] = np.int32
for x in cont_names:
    dtypes[x] = np.int32
for x in ["I12", "I2","I7"]:
    dtypes[x] = "float64"
for x in cat_names:
    dtypes[x] = "hex"

But it still makes the same mistake,
I'll add the code for

dataset = nvt.Dataset(
    file_list,
    engine="csv",
    names=cols,
    part_mem_fraction=0.10,
    sep="\t",
    dtypes=dtypes,
    assume_missing=True,  
    client=client,
)

A new error occurs:

File "/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py", line 1213, in astype_float_to_int_nansafe
   raise IntCastingNaNError(
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

The data is the first decompressed data file(/raid/data/criteo/crit_orig/day_0).How do I handle the initial data format exception?

input_path = "/raid/data/criteo/crit_orig"
    BASE_DIR = "/raid/data/criteo"
    INPUT_PATH = os.environ.get("INPUT_DATA_DIR", input_path)
    OUTPUT_PATH = os.environ.get("OUTPUT_DATA_DIR", os.path.join(BASE_DIR, "converted"))
    # CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "0,1,2,3,4,5")
    CUDA_VISIBLE_DEVICES = os.environ.get("CUDA_VISIBLE_DEVICES", "3")

    cluster = None  # Connect to existing cluster if desired
    if cluster is None:
        cluster = LocalCUDACluster(
            CUDA_VISIBLE_DEVICES=CUDA_VISIBLE_DEVICES,
            rmm_pool_size=get_rmm_size(0.8 * device_mem_size()),
            local_directory=os.path.join(OUTPUT_PATH, "dask-space"),
        )

    client = Client(cluster)

    # Specify column names
    cont_names = ["I" + str(x) for x in range(1, 14)]
    cat_names = ["C" + str(x) for x in range(1, 27)]
    cols = ["label"] + cont_names + cat_names

    # Specify column dtypes. Note that "hex" means that
    # the values will be hexadecimal strings that should
    # be converted to int32
    dtypes = {}
    dtypes["label"] = np.int32
    for x in cont_names:
       dtypes[x] = np.int32
    for x in cat_names:
        dtypes[x] = "hex"

    file_list = glob.glob(os.path.join(INPUT_PATH, "day_0"))

    dataset = nvt.Dataset(
        file_list,
        engine="csv",
        names=cols,
        part_mem_fraction=0.10,
        sep="\t",
        dtypes=dtypes,
        client=client,
    )

    dataset.to_parquet(
        os.path.join(OUTPUT_PATH, "criteo"),
        preserve_files=True,
    )

This is my original full code,from notebook(https://nvidia-merlin.github.io/Merlin/main/examples/scaling-criteo/01-Download-Convert.html#conversion-script-for-criteo-dataset-csv-to-parquet)

@rnyak could you tell me the solution? I use docker images(nvcr.io/nvidia/merlin/merlin-tensorflow 22.12) from website(https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-tensorflow)

rnyak commented

@gukejun1 hello. not sure I understood, did you fix it or not? you wrote fix it..

are you running this notebook? which line is giving you the error?

are you running this code on GPU and with multiple GPUs?

@rnyak NO,i can't fix it by use \Merlin\examples\scaling-criteo\01_download_convert.ipynb. I run this code on GPU.The data is from criteo dataset . the error is from the code

dataset = nvt.Dataset(
        file_list,
        engine="csv",
        names=cols,
        part_mem_fraction=0.10,
        sep="\t",
        dtypes=dtypes,
        client=client,
    )
rnyak commented

@gukejun1 thanks. how many GPUs are you using for running this notebook? I am still confused. if you could not make this notebook run properly, how did you generated the parquet files you mention in this ticket? #1770

@rnyak the late I change the notebook code to

for x in cont_names:
      dtypes[x] = np.zeros(0)## change  here
      # dtypes[x] = np.int32
  for x in cat_names:
      dtypes[x] = "hex"
    #.......
   dataset = nvt.Dataset(
      file_list,
      engine="csv",
      names=cols,
      part_mem_fraction=0.10,
      sep="\t",
      dtypes=dtypes,
      client=client,
      assume_missing=True  ## here  add
  )

i run it in 6 GPUS. The modification to my way eventually worked. But I wonder why does it fail like notebook? I'm not sure if my approach deviates from the idea in the notebook.and is there a better solution?

I also encountered this exact same problem and resolved it using the solution from @gukejun1 . @gukejun1 did you find out if the modifications you made do not affect the correctness of the training?