I work as SRE for the Wikimedia Foundation, we are building our own ML pipeline and we are very interested in evaluating the Hops Feature Store! First of all thanks a lot for creating a fully open source Feature Store, it is really a great effort (and we hope to participate to it if we’ll chose to use Hops!).
I have some doubts about its integration with existing Hadoop / Hive / Spark services, plus its hardware requirements, so I’d be really grateful if you could give me some info or pointers to documentation
We run an on-premise bare metal Hadoop cluster (Apache Bigtop 1.5, Hadoop 2.10.1 + Hive 2.3.6 + Spark 2.4.4) and we are building a Kubeflow cluster to train models on (running on Kubernetes on premise). Ideally we’d leverage existing tools as much as possible for the offline storage, but after reading Spark - Hopsworks Documentation I have some doubts:
- Does the Hops Feature store provide its own Hive Metastore service (plus Hudi libs etc…) or can it re-use an existing one? Due to the client libs download step I suspect that the former is the answer, but I’d like some confirmation.
- Can we leverage Spark running on our Hadoop cluster as computation power when creating/ingesting/etc… new feature datasets to the offline storage?
- Can we leverage our own HDFS cluster/fs as target storage for Hive external tables? (This of course depends on the answer to 1).
- Last but not the least, given the above, what hardware requirements would be needed to run the Feature Store?
Thanks a lot in advance,