Adding new features (offline Hive) removes all previous data

ericchen · May 26, 2020, 8:56am

Hi,

Seems adding new features (offline Hive) simply removes all existing data. Is there any reason that we have to do that? Hive only enforces schema on READ. Is that possible we just keep the existing data and simply return empty for the missing features?

Thanks!

Fabio · May 26, 2020, 6:31pm

Hi @ericchen,

Currently the APIs don’t provide a way of extending the feature group with more features. We are closing on a new release of the APIs that will allow it.

Currently, if you call create_featuregroup to create a feature group with the same name twice, the second time the API call should fail and you should increment the featuregroup_version parameter.
(Here you can find the documentation for the method: http://hops-py.logicalclocks.com/hops.html#hops.featurestore.create_featuregroup)

We had a bug in an early version for which this wasn’t true, and the data was removed (as you are experiencing). Are you using https://hopsworks.ai?

–
Fabio

ericchen · May 26, 2020, 9:47pm

Thanks Fabio!

Yes, I’m on hopsworks.ai and here is the instance URL:
https://e78771d0-848d-11ea-a397-f1d57d468680.aws.hopsworks.ai/hopsworks/#!/

The problem I saw is when I added a new feature to an existing feature group, it shows all existing data will be removed. I have to create a new version instead.

Then I created a new version of the same feature group and added a new feature there but seems all existing data are also removed

No data in version 2

Fabio · May 27, 2020, 9:20pm

Hi @ericchen,

Yes, creating a new version does not automatically migrate the data, however you can still read the data from the previous version. That’s by design. The rationale is that a new version means that there is a breaking change the feature group (e.g. you dropped a feature or changed completely the meaning of a feature), so we can’t automatically migrate the feature group data between version. Users have to write a Spark job that reads from the previous version, apply the transformations and write the dataframe in the new version.

From an implementation point of view, a new version means a new Hive table which will be initially empty.

Unfortunately, as I was mentioning above, we currently treat “adding a new column” as a breaking change. But that’s going to be fixed with the new APIs.

–
Fabio

ericchen · May 28, 2020, 8:00am

But that’s going to be fixed with the new APIs.

That’s existing!

a new version means a new Hive table which will be initially empty.

Is there a limitation on how many feature sets (hive table) that can be created in the system. Not so familiar with Hive but I believe all the schemas are stored in some metadata store and there may be a limitation.

Thanks!

Fabio · May 28, 2020, 10:38pm

From a metadata point of view, I don’t think there is a real limitation in terms of number of feature groups. The metadata for feature groups (the Hive tables) are stored on NDB which can scale to 100 TB+ in memory and in the order of PB for disk storage.

If you work in an AI Hyperscale type of company where you have 100ks features across several thousands feature groups, then depending how you structure and partition them, what kind of joins you make, then you’ll probably have to spend some time tuning the database and Hive so that the query compilation time is acceptable. But if that’s the case, we are here to help

–
Fabio

ericchen · May 29, 2020, 1:26am

Thanks Fabio! That is really helpful! Our team is currently evaluating hopsworks feature store for our needs. Will let your team know if we need any help,

Jim_Dowling · May 30, 2020, 6:29pm

We try to keep slack for customers for now - this is our community support channel for now. It may change when we grow more.