Hi,
I am trying to access a training dataset from my local Python environment on Windows 10 and reading it into a Pandas Dataframe with the read() function of the TrainingDataset object. However, this leads to the following error:
dfData = td.read()
File "...\Python\Python37\site-packages\hsfs\training_dataset.py", line 237, in read
return self._training_dataset_engine.read(self, split, read_options)
File "...\Python\Python37\site-packages\hsfs\core\training_dataset_engine.py", line 84, in read
split,
File "...\Python\Python37\site-packages\hsfs\engine\hive.py", line 68, in read
df_list = self._read_hopsfs(location, split, data_format)
File "...\Python\Python37\site-packages\hsfs\engine\hive.py", line 100, in _read_hopsfs
) from err
ModuleNotFoundError: Reading training dataset from HopsFS requires `pydoop`
As to my knowledge the module pydoop is not available for Windows I tried the same code in a Linux environment. There I can install the module pydoop. However, it also does not work to query the training dataset. I obtain the following error messages:
hopsfs://10.0.0.4:8020/Projects/<myProjectTrainingDataLocation>
2021-07-09 09:38:15,939 INFO ipc.Client: Retrying connect to server: 10.0.0.4/10.0.0.4:8020. Already tried 0 time(s); maxRetries=45
...
2021-07-09 09:52:56,616 INFO ipc.Client: Retrying connect to server: 10.0.0.4/10.0.0.4:8020. Already tried 44 time(s); maxRetries=45
hdfsGetPathInfo(/Projects/demo_fs_alexande/demo_fs_alexande_Training_Datasets/steri_training_data_steri_features_10_false_1): getFileInfo error:
(unable to get stack trace for org.apache.hadoop.net.ConnectTimeoutException exception: ExceptionUtils::getStackTrace error.)
...
dfData = td.read()
File "/root/.local/lib/python3.6/site-packages/hsfs/training_dataset.py", line 237, in read
return self._training_dataset_engine.read(self, split, read_options)
File "/root/.local/lib/python3.6/site-packages/hsfs/core/training_dataset_engine.py", line 84, in read
split,
File "/root/.local/lib/python3.6/site-packages/hsfs/engine/hive.py", line 68, in read
df_list = self._read_hopsfs(location, split, data_format)
File "/root/.local/lib/python3.6/site-packages/hsfs/engine/hive.py", line 105, in _read_hopsfs
path_list = hdfs.ls(location, recursive=True)
File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/__init__.py", line 307, in ls
dir_list = lsl(hdfs_path, user, recursive)
File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/__init__.py", line 291, in lsl
top = next(treewalk)
File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/fs.py", line 631, in walk
top = self.get_path_info(top)
File "/root/.local/lib/python3.6/site-packages/pydoop/hdfs/fs.py", line 406, in get_path_info
return self.fs.get_path_info(path)
OSError: [Errno 255] Unknown error 255
Why does the loading of a training dataset not work in my local Python environment?
Kind regards
Alex