Feature engineering for streams of data?

I’m researching feature stores and feature engineering pipelines. The former is clear in most cases (both Hopsworks and Feast have good enough solutions), but I haven’t seen an explicit implementation of the latter which is the important use-case in my team. We are a small team and don’t want to maintain both an ETL pipeline in one system and a list of features in a feature store. Ideally both these things would be coupled.

https://youtu.be/0wfxWFaDG9Q (Data Engineering Melbourne Meetup- Jim Dowling 30th April 2020) in this video @Jim_Dowling says “this [feature engineering] can be done on Databricks or Sagemaker”. I’m not familiar with Databricks but Sagemaker seems more focused on batch processing of features from S3 – this is only useful when training a model, not while serving a model in production.

What is the common feature engineering system for live production systems, and how well would it integrate with Hopsworks?

Hi @jcpsantiago,

You can develop feature engineering pipelines in Hopsworks. It provides end-to-end solution. However, one can also use just feature stores from Hopsworks and Databricks and Sagemaker for feature engineering, if this is a requirement.

As for common platforms for feature engineering: depending on use case Spark can be used for batch processing. Spark Streaming and Flink for online feature engineering. As I mentioned above Hopsworks supports all of these.

Hopsworks provides both offline, as well as online feature stores for low latency.