Fetching of training dataset does not work

Hi,

I am trying to access a training dataset from my local Python environment on Windows 10 and reading it into a Pandas Dataframe with the read() function of the TrainingDataset object. However, this leads to the following error:

    dfData = td.read()
      File "...\Python\Python37\site-packages\hsfs\training_dataset.py", line 237, in read
        return self._training_dataset_engine.read(self, split, read_options)
      File "...\Python\Python37\site-packages\hsfs\core\training_dataset_engine.py", line 84, in read
        split,
      File "...\Python\Python37\site-packages\hsfs\engine\hive.py", line 68, in read
        df_list = self._read_hopsfs(location, split, data_format)
      File "...\Python\Python37\site-packages\hsfs\engine\hive.py", line 100, in _read_hopsfs
        ) from err
    ModuleNotFoundError: Reading training dataset from HopsFS requires `pydoop`

As to my knowledge the module pydoop is not available for Windows I tried the same code in a Linux environment. There I can install the module pydoop. However, it also does not work to query the training dataset. I obtain the following error messages:

    hopsfs://10.0.0.4:8020/Projects/<myProjectTrainingDataLocation>
    
    2021-07-09 09:38:15,939 INFO ipc.Client: Retrying connect to server: 10.0.0.4/10.0.0.4:8020. Already tried 0 time(s); maxRetries=45
    ...
   
    2021-07-09 09:52:56,616 INFO ipc.Client: Retrying connect to server: 10.0.0.4/10.0.0.4:8020. Already tried 44 time(s); maxRetries=45
    hdfsGetPathInfo(/Projects/demo_fs_alexande/demo_fs_alexande_Training_Datasets/steri_training_data_steri_features_10_false_1): getFileInfo error:
    (unable to get stack trace for org.apache.hadoop.net.ConnectTimeoutException exception: ExceptionUtils::getStackTrace error.)
    ...
        dfData = td.read()
      File "/root/.local/lib/python3.6/site-packages/hsfs/training_dataset.py", line 237, in read
        return self._training_dataset_engine.read(self, split, read_options)
      File "/root/.local/lib/python3.6/site-packages/hsfs/core/training_dataset_engine.py", line 84, in read
        split,
      File "/root/.local/lib/python3.6/site-packages/hsfs/engine/hive.py", line 68, in read
        df_list = self._read_hopsfs(location, split, data_format)
      File "/root/.local/lib/python3.6/site-packages/hsfs/engine/hive.py", line 105, in _read_hopsfs
        path_list = hdfs.ls(location, recursive=True)
      File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/__init__.py", line 307, in ls
        dir_list = lsl(hdfs_path, user, recursive)
      File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/__init__.py", line 291, in lsl
        top = next(treewalk)
      File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/fs.py", line 631, in walk
        top = self.get_path_info(top)
      File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/fs.py", line 406, in get_path_info
        return self.fs.get_path_info(path)
    OSError: [Errno 255] Unknown error 255

Why does the loading of a training dataset not work in my local Python environment?

Kind regards
Alex

Hi @alex_s,

The read() method on training datasets stored on HopsFS works, but it requires a bit of configuration.
As you have seen it requires Pydoop, a JVM and the HopsFS jars. It also requires direct connectivity to the private IPs of your cluster.

If you are using Hopsworks.ai, an easier way of interacting with a training datasets from your laptop, is to store it on S3 instead. With that, you’ll have less issues with dependencies and connectivity.


Fabio

Hi Fabio,

thank you again for your feedback and your tips.
Perhaps I will then use a different storage connector with a potentially easier way of interacting.

Kind regards
Alex

@Fabio

I run into the same issue while calling read() method with ‘hive’ engine from an external python kernel/env to a community hopsworks featurestore 2.4, can you provide more details on where to get the HopsFS jars, and how to intall it? I managed to install with pip the Pydoop==2.0.0 with JVM 11 and Hadoop 3.3.2.jar on my external python kernel and then I got an Permission denied error:

    df: DataFrame = td.read(split=config.get("dataset_split", "train"))
  File "/usr/local/lib/python3.8/site-packages/hsfs/training_dataset.py", line 257, in read
    return self._training_dataset_engine.read(self, split, read_options)
  File "/usr/local/lib/python3.8/site-packages/hsfs/core/training_dataset_engine.py", line 107, in read
    return training_dataset.storage_connector.read(
  File "/usr/local/lib/python3.8/site-packages/hsfs/storage_connector.py", line 106, in read
    return engine.get_instance().read(self, data_format, options, path)
  File "/usr/local/lib/python3.8/site-packages/hsfs/engine/hive.py", line 73, in read
    df_list = self._read_hopsfs(location, data_format)
  File "/usr/local/lib/python3.8/site-packages/hsfs/engine/hive.py", line 108, in _read_hopsfs
    path_list = hdfs.ls(location, recursive=True)
  File "/usr/local/lib/python3.8/site-packages/pydoop/hdfs/__init__.py", line 307, in ls
    dir_list = lsl(hdfs_path, user, recursive)
  File "/usr/local/lib/python3.8/site-packages/pydoop/hdfs/__init__.py", line 291, in lsl
    top = next(treewalk)
  File "/usr/local/lib/python3.8/site-packages/pydoop/hdfs/fs.py", line 631, in walk
    top = self.get_path_info(top)
  File "/usr/local/lib/python3.8/site-packages/pydoop/hdfs/fs.py", line 406, in get_path_info
    return self.fs.get_path_info(path)
PermissionError: [Errno 13] Permission denied

Should I use spark engine for hsfs.connection() instead of hive? I wasn’t able to download client.tar.gz from my community editionfeatures store 2.4 following the spark integration guide (Spark - Hopsworks Documentation), there is no integrations tap at the community 2.4 UI.

I really appreciate your help in advance.

Hi @Yingding, Integrations are part of enterprise edition. Do you have on-prem installation?

/Davit

@Davit_Bzhalava thanks, I see. Yes i have an on-prem community version installed. Is there any chance also to allow the HopsFS integration for community version in the future?