Error when running titanic dataset notebook in demo

suryakanth96 · April 14, 2021, 4:16am

I was trying to run the titanic dataset ipynb file provided with the demo by creating a python job.
The job execution fails and I keep getting the following error:
Traceback (most recent call last):
File “job_tmp_titanic_sample.py”, line 1, in
from pyspark.sql import SparkSession
ModuleNotFoundError: No module named ‘pyspark’
And when I go to the Python tab and try to install the required package, the following error shows up:
“The project’s Python environment failed to initialize, please recreate the environment.”

Theo · April 14, 2021, 8:24am

Hi,

The titanic example uses PySpark, when creating a job from a PySpark notebook the Spark job type should be selected. Before trying again, you can remove and create the Python environment of the project.

suryakanth96 · April 14, 2021, 9:00am

Is it possible to have a python feature engineering notebook that stores features from a pandas dataframe into the feature store? And can I run this notebook as a python job?

Theo · April 14, 2021, 10:13am

Yes that is supported, you can use the same feature group ingestion API with Spark and Pandas dataframes. So if you have a notebook using the Python kernel and using the feature group .save() or .insert() with pandas dataframe (Feature Group - Hopsworks Documentation), then you can run that notebook as a Python job.

For the moment this works with feature groups that do not have time-travel enabled, but from the next release that will be supported as well.

suryakanth96 · April 14, 2021, 10:28am

I have used a simple python feature engineering notebook that creates a feature group and stores the features in a pandas dataframe. The code works fine when run from JupyterLab and the feature group and the features get created and stored, but when running as a python job,I am getting the following error:
2021-04-14 09:28:01.974758: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
File “job_tmp_titanic_sample_python.py”, line 8, in
conn = hsfs.connection()
File “/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/connection.py”, line 318, in connection
api_key_value,
File “/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/connection.py”, line 140, in init
self.connect()
File “/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/decorators.py”, line 25, in if_not_connected
return fn(inst, *args, **kwargs)
File “/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/connection.py”, line 214, in connect
client.init(“hopsworks”)
File “/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/client/init.py”, line 39, in init
_client = hopsworks.Client()
File “/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/client/hopsworks.py”, line 60, in init
self._auth = auth.BearerAuth(self._read_jwt())
File “/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/client/base.py”, line 82, in _read_jwt
with open(self.TOKEN_FILE, “r”) as jwt:
FileNotFoundError: [Errno 2] No such file or directory: ‘token.jwt’

Theo · April 14, 2021, 2:43pm

We have managed to reproduce the issue, it is related to the current working directory of the Python job which is different to the one in Jupyter.

Until we release a fix, a workaround is to add this code in the beginning of your notebook which copies the missing file where the Python job expects it to be. Let me know if that works ok for you.

import shutil
import os

if os.path.exists("/srv/hops/secrets"):
    token = "token.jwt"  
    source = os.path.join("/srv/hops/secrets/", token)
    destination = os.path.join(os.getcwd(), token)
    dest = shutil.copyfile(source, destination)

suryakanth96 · April 19, 2021, 3:41pm

Now I am getting the following error when running the job:
‘io.hops.hopsworks.exceptions.ServiceException’

Theo · April 19, 2021, 4:38pm

Is there a more specific error in the stdout/stderr logs? There are two buttons to view the logs under the execution actions.

Does it work if you try to run a simple python program? For example you can upload a hello.py that only contains

print("hello")

in a dataset and then create and run the Python job.

Also which Hopsworks version are you using? Do you have cli access to the cluster VMs?