Problem inserting new features to Feature Store using Python API

HiepNguyen · May 8, 2020, 3:50am

Hi,

I have a Hopswork 1.3.0-SNAPSHOT with project with feature store up and running.
As I understand, the dataframe pass into insert_into_featuregroup(df, featuregroup_name) will be converted to Spark dataframe, then inserted into Feature Store, so I should connect to remote Spark single node cluster with Hive support enabled.

So from a local machine, I’m trying to do the following steps:

import cv2
import hops.featurestore as fs
from pyspark.sql import SparkSession
fs.connect(host=, project_name=“test”, api_key_file="/path/to/generated/api.key", secrets_store= “local”, hostname_verification=False)
#successfully connect to remote feature store, I’m able to call get_featuregroups() and see the result list
spark_session = SparkSession.builder.master(“spark://< hopswork-instance-ip >:7077”).appName(“test”).enableHiveSupport().getOrCreate()
data = cv2.imread("/path/img.jpg", cv2.IMREAD_GRAYSCALE) # load image as gray scale to have 2d numpy array so it can be convert to spark dataframe later
fs.insert_into_featuregroup(data, “test_featuregroup”)

Above code will failed and it says “Connection refused: < hopswork-instance-ip >:7077”
I’m sure the port is configured correctly.

Do I understand the steps correctly? If not would you help how can I insert a feature to Feature Store from a remote machine?
I check listening port on the Hopswork instance, but no service is listening on port 7077.
What is the port to connect to Spark remotely with SparkSession?

Best regards,

Steffen · May 8, 2020, 11:37am

Hi HiepNguyen,

It’s possible to write to the Feature Store from an external Spark or Hadoop cluster but this is an enterprise feature and not included in the community version. You could use Hopsworks.ai which allows you to use the Enterprise version of Hopsworks in the free tier. If that’s not possible for you, then you could try and approach the problem from the other direction. Create a job in your Hopsworks cluster that loads the data from the external source.