Is it possible to join an on demand feature group with a cached feature group?
How to join two feature groups with colliding column names which are not part of the joining key? I got a lot of ambiguous reference errors.
Is it possible to join an on demand feature group with a cached feature group?
How to join two feature groups with colliding column names which are not part of the joining key? I got a lot of ambiguous reference errors.
Hi @Tim,
Yes it’s possible. Once you get the metadata handles for the cached (FeatureStore - Hopsworks Documentation) and on-demand feature group (FeatureStore - Hopsworks Documentation), both metadata objects have the same functionalities.
They can be joined together using the join
method and you can select features from them using the same select
, select_all
and select_except
methods (FeatureGroup - Hopsworks Documentation).
Regarding the ambiguous reference errors
, that’s unfortunately still an issue on your version, and my suggestion is to exclude the feature from the select of one side of the join.
In the new version we are releasing we also added a prefix
option to the join
method to handle these issues of name collision.
–
Fabio
I can’t join on demand feature group with cached feature group.
import hsfs
connection = hsfs.connection()
fs = connection.get_feature_store()
payment = fs.get_on_demand_feature_group(name="paymenttype_snowflake", version=1)
account_fg_meta = fs.get_feature_group(name="account_fg",version=1)
query = payment.select_all().join(account_fg_meta.select(['account', 'id']), on='account')
s3_conn = fs.get_storage_connector("S3")
td = fs.create_training_dataset(name="abc_model",
description="Dataset to train the payment model",
data_format="csv",
storage_connector=s3_conn,
version=1)
td.save(query)
Here is the error I got:
py4j.protocol.Py4JJavaError: An error occurred while calling o148.select.
: org.apache.spark.sql.AnalysisException: Reference ‘account’ is ambiguous, could be: fg1.account, fg0.account.;
Based on the join you are performing, the value of account should be the same on both the payment
and account_fg
.
So this code is going to work:
import hsfs
connection = hsfs.connection()
fs = connection.get_feature_store()
payment = fs.get_on_demand_feature_group(name="paymenttype_snowflake", version=1)
account_fg_meta = fs.get_feature_group(name="account_fg",version=1)
query = payment.select_all().join(account_fg_meta.select(['id']), on='account')
s3_conn = fs.get_storage_connector("S3")
td = fs.create_training_dataset(name="abc_model",
description="Dataset to train the payment model",
data_format="csv",
storage_connector=s3_conn,
version=1)
td.save(query)
(I removed the account
feature from the account_fg_meta
select list.