Fetching of training dataset does not work

alex_s · July 9, 2021, 9:57am

Hi,

I am trying to access a training dataset from my local Python environment on Windows 10 and reading it into a Pandas Dataframe with the read() function of the TrainingDataset object. However, this leads to the following error:

    dfData = td.read()
      File "...\Python\Python37\site-packages\hsfs\training_dataset.py", line 237, in read
        return self._training_dataset_engine.read(self, split, read_options)
      File "...\Python\Python37\site-packages\hsfs\core\training_dataset_engine.py", line 84, in read
        split,
      File "...\Python\Python37\site-packages\hsfs\engine\hive.py", line 68, in read
        df_list = self._read_hopsfs(location, split, data_format)
      File "...\Python\Python37\site-packages\hsfs\engine\hive.py", line 100, in _read_hopsfs
        ) from err
    ModuleNotFoundError: Reading training dataset from HopsFS requires `pydoop`

As to my knowledge the module pydoop is not available for Windows I tried the same code in a Linux environment. There I can install the module pydoop. However, it also does not work to query the training dataset. I obtain the following error messages:

    hopsfs://10.0.0.4:8020/Projects/<myProjectTrainingDataLocation>
    
    2021-07-09 09:38:15,939 INFO ipc.Client: Retrying connect to server: 10.0.0.4/10.0.0.4:8020. Already tried 0 time(s); maxRetries=45
    ...
   
    2021-07-09 09:52:56,616 INFO ipc.Client: Retrying connect to server: 10.0.0.4/10.0.0.4:8020. Already tried 44 time(s); maxRetries=45
    hdfsGetPathInfo(/Projects/demo_fs_alexande/demo_fs_alexande_Training_Datasets/steri_training_data_steri_features_10_false_1): getFileInfo error:
    (unable to get stack trace for org.apache.hadoop.net.ConnectTimeoutException exception: ExceptionUtils::getStackTrace error.)
    ...
        dfData = td.read()
      File "/root/.local/lib/python3.6/site-packages/hsfs/training_dataset.py", line 237, in read
        return self._training_dataset_engine.read(self, split, read_options)
      File "/root/.local/lib/python3.6/site-packages/hsfs/core/training_dataset_engine.py", line 84, in read
        split,
      File "/root/.local/lib/python3.6/site-packages/hsfs/engine/hive.py", line 68, in read
        df_list = self._read_hopsfs(location, split, data_format)
      File "/root/.local/lib/python3.6/site-packages/hsfs/engine/hive.py", line 105, in _read_hopsfs
        path_list = hdfs.ls(location, recursive=True)
      File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/__init__.py", line 307, in ls
        dir_list = lsl(hdfs_path, user, recursive)
      File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/__init__.py", line 291, in lsl
        top = next(treewalk)
      File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/fs.py", line 631, in walk
        top = self.get_path_info(top)
      File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/fs.py", line 406, in get_path_info
        return self.fs.get_path_info(path)
    OSError: [Errno 255] Unknown error 255

Why does the loading of a training dataset not work in my local Python environment?

Kind regards
Alex

Fabio · July 12, 2021, 10:47am

Hi @alex_s,

The read() method on training datasets stored on HopsFS works, but it requires a bit of configuration.
As you have seen it requires Pydoop, a JVM and the HopsFS jars. It also requires direct connectivity to the private IPs of your cluster.

If you are using Hopsworks.ai, an easier way of interacting with a training datasets from your laptop, is to store it on S3 instead. With that, you’ll have less issues with dependencies and connectivity.

–
Fabio

alex_s · July 12, 2021, 11:56am

Hi Fabio,

thank you again for your feedback and your tips.
Perhaps I will then use a different storage connector with a potentially easier way of interacting.

Kind regards
Alex

Yingding · March 29, 2022, 12:56pm

@Fabio

I run into the same issue while calling read() method with ‘hive’ engine from an external python kernel/env to a community hopsworks featurestore 2.4, can you provide more details on where to get the HopsFS jars, and how to intall it? I managed to install with pip the Pydoop==2.0.0 with JVM 11 and Hadoop 3.3.2.jar on my external python kernel and then I got an Permission denied error:

    df: DataFrame = td.read(split=config.get("dataset_split", "train"))
  File "/usr/local/lib/python3.8/site-packages/hsfs/training_dataset.py", line 257, in read
    return self._training_dataset_engine.read(self, split, read_options)
  File "/usr/local/lib/python3.8/site-packages/hsfs/core/training_dataset_engine.py", line 107, in read
    return training_dataset.storage_connector.read(
  File "/usr/local/lib/python3.8/site-packages/hsfs/storage_connector.py", line 106, in read
    return engine.get_instance().read(self, data_format, options, path)
  File "/usr/local/lib/python3.8/site-packages/hsfs/engine/hive.py", line 73, in read
    df_list = self._read_hopsfs(location, data_format)
  File "/usr/local/lib/python3.8/site-packages/hsfs/engine/hive.py", line 108, in _read_hopsfs
    path_list = hdfs.ls(location, recursive=True)
  File "/usr/local/lib/python3.8/site-packages/pydoop/hdfs/__init__.py", line 307, in ls
    dir_list = lsl(hdfs_path, user, recursive)
  File "/usr/local/lib/python3.8/site-packages/pydoop/hdfs/__init__.py", line 291, in lsl
    top = next(treewalk)
  File "/usr/local/lib/python3.8/site-packages/pydoop/hdfs/fs.py", line 631, in walk
    top = self.get_path_info(top)
  File "/usr/local/lib/python3.8/site-packages/pydoop/hdfs/fs.py", line 406, in get_path_info
    return self.fs.get_path_info(path)
PermissionError: [Errno 13] Permission denied

Should I use spark engine for hsfs.connection() instead of hive? I wasn’t able to download client.tar.gz from my community editionfeatures store 2.4 following the spark integration guide (Spark - Hopsworks Documentation), there is no integrations tap at the community 2.4 UI.

I really appreciate your help in advance.

Davit_Bzhalava · April 4, 2022, 8:06pm

Hi @Yingding, Integrations are part of enterprise edition. Do you have on-prem installation?

/Davit

Yingding · April 5, 2022, 9:21am

@Davit_Bzhalava thanks, I see. Yes i have an on-prem community version installed. Is there any chance also to allow the HopsFS integration for community version in the future?