Question about join

Tim · August 9, 2021, 6:55pm

Is it possible to join an on demand feature group with a cached feature group?

How to join two feature groups with colliding column names which are not part of the joining key? I got a lot of ambiguous reference errors.

Fabio · August 10, 2021, 3:13pm

Hi @Tim,

Yes it’s possible. Once you get the metadata handles for the cached (FeatureStore - Hopsworks Documentation) and on-demand feature group (FeatureStore - Hopsworks Documentation), both metadata objects have the same functionalities.

They can be joined together using the join method and you can select features from them using the same select, select_all and select_except methods (FeatureGroup - Hopsworks Documentation).

Regarding the ambiguous reference errors, that’s unfortunately still an issue on your version, and my suggestion is to exclude the feature from the select of one side of the join.

In the new version we are releasing we also added a prefix option to the join method to handle these issues of name collision.

–
Fabio

Tim · August 10, 2021, 4:22pm

I can’t join on demand feature group with cached feature group.

import hsfs
connection = hsfs.connection()
fs = connection.get_feature_store()
payment = fs.get_on_demand_feature_group(name="paymenttype_snowflake", version=1)
account_fg_meta = fs.get_feature_group(name="account_fg",version=1)
query = payment.select_all().join(account_fg_meta.select(['account', 'id']), on='account')
s3_conn = fs.get_storage_connector("S3")
td = fs.create_training_dataset(name="abc_model",
                               description="Dataset to train the payment model",
                               data_format="csv",
                               storage_connector=s3_conn,
                               version=1)
td.save(query)

Here is the error I got:

py4j.protocol.Py4JJavaError: An error occurred while calling o148.select.
: org.apache.spark.sql.AnalysisException: Reference ‘account’ is ambiguous, could be: fg1.account, fg0.account.;

Fabio · August 10, 2021, 8:17pm

Based on the join you are performing, the value of account should be the same on both the payment and account_fg.
So this code is going to work:

import hsfs
connection = hsfs.connection()
fs = connection.get_feature_store()
payment = fs.get_on_demand_feature_group(name="paymenttype_snowflake", version=1)
account_fg_meta = fs.get_feature_group(name="account_fg",version=1)
query = payment.select_all().join(account_fg_meta.select(['id']), on='account')
s3_conn = fs.get_storage_connector("S3")
td = fs.create_training_dataset(name="abc_model",
                               description="Dataset to train the payment model",
                               data_format="csv",
                               storage_connector=s3_conn,
                               version=1)
td.save(query)

(I removed the account feature from the account_fg_meta select list.