CODAIT/deep-histopath

Save tiled images to disk

luistelmocosta opened this issue · 7 comments

Is it possible to save the transformed images to the disk? I see you use the PIL library to save but I you also mention an Hadoop-supported directory. I don't get how your save_jpeg_help is either saving to disk or to HDFS.
I have an Hadoop instance running on hdfs://localhost:9000/data but it does not save to the Hadoop-Support directory. If I try a local path nothing happens as well.

Is there any way to save the processed images to a local directory?

@feihugis Can you answer this?

That would be really really good. I'm struggling with this for days and I can't make any progress.

The current repo uses NFS to store data instead of HDFS, so it may work on the local disk but does not work on HDFS yet.

For HDFS, Apache Arrow can be used to connect with HDFS. The following functions can be used to read and save images from HDFS:

import numpy as np
from pyspark.sql.types import BinaryType, StringType, StructField, StructType
from PIL import Image
from io import BytesIO
import logging
import pyarrow as pa
import socket


def get_hdfs(host, port):
  """
  Connect to HadoopFileSystem
  :param host: HDFS namenode host
  :param port: HDFS namenode port, which can be retrieved by `hdfs getconf -nnRpcAddresses`
  :return: HadoopFileSystem
  """
  fs = pa.hdfs.connect(host, port)
  return fs

def read_image(fs, img_path, mode="rb"):
  """
  Read image file from HDFS
  :param fs: HadoopFileSystem
  :param img_path: image file path
  :param mode: The mode.  If given, this argument must be "r"
  :return:
  """
  f = fs.open(img_path, mode)
  pil_img = Image.open(f)
  img_array = np.asarray(pil_img, dtype=np.uint8)
  f.close()
  return img_array

Thank you so much for your quick response!
You posted how to read_images. Anything about saving?
And one of my main questions remains - is it possible to save it to disk after running the preprocessing?

save_jpeg_help should be able to save the data into local disk.

I did not directly save data into HDFS, but it is not difficult. The idea is similar to the function read_image(), but you need to change the mode to be wb and use fs.write() to save data. More info here may be helpful.

Sadly, save_jpeg_help is not saving the generated images. I really can't understand why.
My guess is since the application is running in a different path, something like /usr/local/spark-2.4.0-bin-hadoop2.7/work/app-20181206041146-0000/0, I'm not able to save to that local path.

Do I need to rdd.collect() before start saving the pictures?

Sorry for the late response. Hope you have solved your problem. The local path should be fine. It will just save the image in the local disk. One thing is that you need to make sure the file names are different.