Fetching of training dataset does not work

Hi,

I am trying to access a training dataset from my local Python environment on Windows 10 and reading it into a Pandas Dataframe with the read() function of the TrainingDataset object. However, this leads to the following error:

    dfData = td.read()
      File "...\Python\Python37\site-packages\hsfs\training_dataset.py", line 237, in read
        return self._training_dataset_engine.read(self, split, read_options)
      File "...\Python\Python37\site-packages\hsfs\core\training_dataset_engine.py", line 84, in read
        split,
      File "...\Python\Python37\site-packages\hsfs\engine\hive.py", line 68, in read
        df_list = self._read_hopsfs(location, split, data_format)
      File "...\Python\Python37\site-packages\hsfs\engine\hive.py", line 100, in _read_hopsfs
        ) from err
    ModuleNotFoundError: Reading training dataset from HopsFS requires `pydoop`

As to my knowledge the module pydoop is not available for Windows I tried the same code in a Linux environment. There I can install the module pydoop. However, it also does not work to query the training dataset. I obtain the following error messages:

    hopsfs://10.0.0.4:8020/Projects/<myProjectTrainingDataLocation>
    
    2021-07-09 09:38:15,939 INFO ipc.Client: Retrying connect to server: 10.0.0.4/10.0.0.4:8020. Already tried 0 time(s); maxRetries=45
    ...
   
    2021-07-09 09:52:56,616 INFO ipc.Client: Retrying connect to server: 10.0.0.4/10.0.0.4:8020. Already tried 44 time(s); maxRetries=45
    hdfsGetPathInfo(/Projects/demo_fs_alexande/demo_fs_alexande_Training_Datasets/steri_training_data_steri_features_10_false_1): getFileInfo error:
    (unable to get stack trace for org.apache.hadoop.net.ConnectTimeoutException exception: ExceptionUtils::getStackTrace error.)
    ...
        dfData = td.read()
      File "/root/.local/lib/python3.6/site-packages/hsfs/training_dataset.py", line 237, in read
        return self._training_dataset_engine.read(self, split, read_options)
      File "/root/.local/lib/python3.6/site-packages/hsfs/core/training_dataset_engine.py", line 84, in read
        split,
      File "/root/.local/lib/python3.6/site-packages/hsfs/engine/hive.py", line 68, in read
        df_list = self._read_hopsfs(location, split, data_format)
      File "/root/.local/lib/python3.6/site-packages/hsfs/engine/hive.py", line 105, in _read_hopsfs
        path_list = hdfs.ls(location, recursive=True)
      File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/__init__.py", line 307, in ls
        dir_list = lsl(hdfs_path, user, recursive)
      File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/__init__.py", line 291, in lsl
        top = next(treewalk)
      File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/fs.py", line 631, in walk
        top = self.get_path_info(top)
      File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/fs.py", line 406, in get_path_info
        return self.fs.get_path_info(path)
    OSError: [Errno 255] Unknown error 255

Why does the loading of a training dataset not work in my local Python environment?

Kind regards
Alex

Hi @alex_s,

The read() method on training datasets stored on HopsFS works, but it requires a bit of configuration.
As you have seen it requires Pydoop, a JVM and the HopsFS jars. It also requires direct connectivity to the private IPs of your cluster.

If you are using Hopsworks.ai, an easier way of interacting with a training datasets from your laptop, is to store it on S3 instead. With that, you’ll have less issues with dependencies and connectivity.


Fabio

Hi Fabio,

thank you again for your feedback and your tips.
Perhaps I will then use a different storage connector with a potentially easier way of interacting.

Kind regards
Alex