HiveDatabaseNotFound Exception from Zeppelin Notebook whilst retrieving features

Msycho · May 21, 2020, 1:02pm

Hi

I’m trying to execute an API call from zeppline notebook: featurestore.get_featuregroup(“teams_features”).head(5)

I get the following exception.

Running sql: use demo_featurestore_msycho21_featurestore against offline feature store
Fail to execute line 1: featurestore.get_featuregroup(“teams_features”).head(5)
Traceback (most recent call last):
File “/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py”, line 63, in deco
return f(*a, **kw)
File “/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py”, line 328, in get_return_value
format(target_id, “.”, name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o93.sql.
: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database ‘~~~~~~~’ not found;

Sipho

Fabio · May 23, 2020, 11:38am

Hi Sipho,

are you trying to use Zeppelin with a Python kernel, like SageMaker, or are you using it to start and interact with a Spark application? If it’s the former, than you should follow the same instructions as for the SageMaker integration. However, from the logs it seems that it’s the latter.

If that’s the case, then I’d suggest you have a look at how to integrate Hopsworks with a Databricks cluster, as the steps are similar: https://hopsworks.readthedocs.io/en/latest/featurestore/integrations/guides/databricks.html#id17

In particular you need:

API key. If you are running the Spark cluster on EC2 machines with access to Secrets Manager/Paramter store, then you can use them to store it (as with the SageMaker integration). Otherwise you can store it on a file.
The client jars. The feature store uses Hive Metastore to manage the feature groups. You need our client to be able to access it. If you run the setup_databricks() method of the hops-util-py library, you should be able to download a tar.gz file with both our Metastore client and the client to HopsFS.
Here things start to differ from the Databricks guide. You need to add the content of that tar.gz file to your classpath. This depends on what you are using to run Spark.
Certificates. The setup_databricks() method also downloads the keyStore.jks, trustStore.jks and a file called material_passwd. You need to make sure they are on the Spark executors. Again, here the instructions differ from the Databricks documentation, as you probably don’t have dbfs:// available.
Configuration. You need to add some configuration properties to your Spark configuration.

    spark.hadoop.fs.hopsfs.impl io.hops.hopsfs.client.HopsFileSystem
    spark.hadoop.hops.ipc.server.ssl.enabled true
    spark.hadoop.hops.ssl.hostname.verifier ALLOW_ALL
    spark.hadoop.hops.rpc.socket.factory.class.default io.hops.hadoop.shaded.org.apache.hadoop.net.HopsSSLSocketFactory
    spark.hadoop.client.rpc.ssl.enabled.protocol TLSv1.2
    spark.hadoop.hops.ssl.keystores.passwd.name [Path to material_passwd within the Spark executors]
    spark.hadoop.hops.ssl.keystore.name [Path to keyStore.jks within the Spark executors]
    spark.hadoop.hops.ssl.trustore.name [Path to trustStore.jks within the Spark executors]
    spark.sql.hive.metastore.jars [Path to the jar files for the Hive metastore (the ones from the `tar.gz` you downloaded)]
    spark.hadoop.hive.metastore.uris thrift://[hopsworks.ai URL]:9083

It’s also important that the Spark cluster is in the same VPC as the Hopsworks.ai cluster. Or, alternatively, you need to setup VPC peering between the 2. (https://docs.aws.amazon.com/vpc/latest/peering/what-is-vpc-peering.html)

It’s a bit messy, but if everything is done correctly it works. Let me know if you have issues.

–
Fabio