are you trying to use Zeppelin with a Python kernel, like SageMaker, or are you using it to start and interact with a Spark application? If it’s the former, than you should follow the same instructions as for the SageMaker integration. However, from the logs it seems that it’s the latter.
If that’s the case, then I’d suggest you have a look at how to integrate Hopsworks with a Databricks cluster, as the steps are similar: https://hopsworks.readthedocs.io/en/latest/featurestore/integrations/guides/databricks.html#id17
In particular you need:
- API key. If you are running the Spark cluster on EC2 machines with access to Secrets Manager/Paramter store, then you can use them to store it (as with the SageMaker integration). Otherwise you can store it on a file.
- The client jars. The feature store uses Hive Metastore to manage the feature groups. You need our client to be able to access it. If you run the
setup_databricks() method of the
hops-util-py library, you should be able to download a
tar.gz file with both our Metastore client and the client to HopsFS.
Here things start to differ from the Databricks guide. You need to add the content of that
tar.gz file to your classpath. This depends on what you are using to run Spark.
- Certificates. The
setup_databricks() method also downloads the
trustStore.jks and a file called
material_passwd. You need to make sure they are on the Spark executors. Again, here the instructions differ from the Databricks documentation, as you probably don’t have
- Configuration. You need to add some configuration properties to your Spark configuration.
spark.hadoop.hops.ssl.keystores.passwd.name [Path to material_passwd within the Spark executors]
spark.hadoop.hops.ssl.keystore.name [Path to keyStore.jks within the Spark executors]
spark.hadoop.hops.ssl.trustore.name [Path to trustStore.jks within the Spark executors]
spark.sql.hive.metastore.jars [Path to the jar files for the Hive metastore (the ones from the `tar.gz` you downloaded)]
spark.hadoop.hive.metastore.uris thrift://[hopsworks.ai URL]:9083
It’s also important that the Spark cluster is in the same VPC as the Hopsworks.ai cluster. Or, alternatively, you need to setup VPC peering between the 2. (https://docs.aws.amazon.com/vpc/latest/peering/what-is-vpc-peering.html)
It’s a bit messy, but if everything is done correctly it works. Let me know if you have issues.