Question about join

Is it possible to join an on demand feature group with a cached feature group?

How to join two feature groups with colliding column names which are not part of the joining key? I got a lot of ambiguous reference errors.

Hi @Tim,

Yes it’s possible. Once you get the metadata handles for the cached (FeatureStore - Hopsworks Documentation) and on-demand feature group (FeatureStore - Hopsworks Documentation), both metadata objects have the same functionalities.

They can be joined together using the join method and you can select features from them using the same select, select_all and select_except methods (FeatureGroup - Hopsworks Documentation).

Regarding the ambiguous reference errors, that’s unfortunately still an issue on your version, and my suggestion is to exclude the feature from the select of one side of the join.

In the new version we are releasing we also added a prefix option to the join method to handle these issues of name collision.


Fabio

I can’t join on demand feature group with cached feature group.

import hsfs
connection = hsfs.connection()
fs = connection.get_feature_store()
payment = fs.get_on_demand_feature_group(name="paymenttype_snowflake", version=1)
account_fg_meta = fs.get_feature_group(name="account_fg",version=1)
query = payment.select_all().join(account_fg_meta.select(['account', 'id']), on='account')
s3_conn = fs.get_storage_connector("S3")
td = fs.create_training_dataset(name="abc_model",
                               description="Dataset to train the payment model",
                               data_format="csv",
                               storage_connector=s3_conn,
                               version=1)
td.save(query)

Here is the error I got:

py4j.protocol.Py4JJavaError: An error occurred while calling o148.select.
: org.apache.spark.sql.AnalysisException: Reference ‘account’ is ambiguous, could be: fg1.account, fg0.account.;

Based on the join you are performing, the value of account should be the same on both the payment and account_fg.
So this code is going to work:

import hsfs
connection = hsfs.connection()
fs = connection.get_feature_store()
payment = fs.get_on_demand_feature_group(name="paymenttype_snowflake", version=1)
account_fg_meta = fs.get_feature_group(name="account_fg",version=1)
query = payment.select_all().join(account_fg_meta.select(['id']), on='account')
s3_conn = fs.get_storage_connector("S3")
td = fs.create_training_dataset(name="abc_model",
                               description="Dataset to train the payment model",
                               data_format="csv",
                               storage_connector=s3_conn,
                               version=1)
td.save(query)

(I removed the account feature from the account_fg_meta select list.