On-demand vs Cached Feature Group

Arumugaguru_M · March 11, 2021, 10:36am

Hi,

I would like to know more on what is the difference between on-demand vs cached feature groups.Correct me if my understanding is wrong

On-demand: It will build a query to get the data from the external datasource like mysql
cached: Storing the data directly to the feature group

If my understanding is correct, i have some questions around them.

If I create a training dataset from on-demand feature group will the data be persisted in the training dataset (or) it will be pulled everytime when I call the training dataset?

Also, I am not able to find any option on the cached dataset to pull data from mysql connection, so I assume only we can have on-demand feature group for external data source

Our Actual Test Case:

We are evaluating the Hopswork Feature Store to be a part of our FW for Data Science team to store their derived features into Feature Store, in one of the training from hopswork team, we heard even if the data get changes in the external source like mysql it will be automatically updated in the feature group/ training dataset.

But when trying to test the above scenario I got confused that the on-demand feature group will only fetch data when we are calling it, in that case how it will be automatically fetched and stored.

If some one can spot some light will be more helpful.

Thanks,
Guru

Fabio · March 12, 2021, 8:50am

Hi @Arumugaguru_M,

Yes, your understanding is correct. With cached feature groups, the feature data is materialized on Hopsworks, you can use Hudi/Time travel on them and you can also make them available online to be served with low latency.
With on-demand on the other hand, the data is on an external data source and Hopsworks tracks only the metadata. When you use a on-demand feature group to build a training dataset, Hopsworks will query the external source to fetch the data.

If you create a training dataset, then the data will be pulled from the external source, joined with features stored in other feature groups (not mandatory but likely) and then saved in a ML framework friendly format (e.g. TFRecords, CSV, …). When you use the training dataset you’ll be reading the data from the training dataset, not the original on-demand data source.

The training dataset itself can be stored in Hopsworks or on S3 or on ADLS.

We don’t have a MySQL connection option for cached feature group. What I would do is to define a on-demand feature group over the original table. Those would be your original features. You use those to do some more feature engineering/aggregations and create derived features which you can then save as cached feature groups in the feature store.

If you only use on-demand feature groups then yes, every time you use the on-demand feature group to create a training dataset, you’ll be using the most recent value of the data in the external source. If, on the other hand, you do additional data engineering and saved the result in cache feature groups (as explained above) then you need to define a periodic job that pulls data, compute the new features and save them.

Let me know if that answer your questions.

–
Fabio

Arumugaguru_M · March 12, 2021, 11:57am

Hi Fabio,

Thanks a lot for a detailed answer, one clarification in the answer

As said by you, we will have an on-demand feature group and do transformation from it and cache the data into an offline Feature Group with time travel enabled using HUDI.

The newly created offline feature group will have a spark job associated with it (it is my understanding), and we have seen there is an option to schedule those spark job in the Job-UI, will this pull the record from the on-demand feature group and update the record with time travel option?

If it is not possible, do we have to create any airflow job to capture the data from the source and do transformation?

In a simple word, for automatically updating the offline feature group is there any airflow job/ script is required (or) it is inbuilt available with the offline feature group and Spark job of it?

Thanks,
Guru

Fabio · March 14, 2021, 9:50pm

You can setup a Job in Hopsworks and schedule it using the Jobs UI, or by creating an Airflow dag. We provide an Airflow operator that allows you to launch Jobs in Hopsworks from Airflow. So, if you decide to use the Airflow route, you’ll just need a single stage that triggers a Job in Hopsworks, you don’t have to do any data manipulation in Airflow itself. There is also a UI wizard to make it easier to generate Airflow Dags that trigger Hopsworks Jobs.

You can also develop the transformations/feature engineering in a Jupyter notebook and then convert that notebook into a Job to be scheduled as described above.

The incremental pulling comes from scheduling a Job, the feature store doesn’t do that automatically.

Let me know if that answer your question.

–
Fabio

Arumugaguru_M · March 16, 2021, 4:44am

Thanks, Fabio. I was able to create a job and load the data from an external source to Feature Store.