Feature Store hardware requirements

elukey · April 27, 2021, 10:09am

Hi everybody,

I work as SRE for the Wikimedia Foundation, we are building our own ML pipeline and we are very interested in evaluating the Hops Feature Store! First of all thanks a lot for creating a fully open source Feature Store, it is really a great effort (and we hope to participate to it if we’ll chose to use Hops!).

I have some doubts about its integration with existing Hadoop / Hive / Spark services, plus its hardware requirements, so I’d be really grateful if you could give me some info or pointers to documentation

We run an on-premise bare metal Hadoop cluster (Apache Bigtop 1.5, Hadoop 2.10.1 + Hive 2.3.6 + Spark 2.4.4) and we are building a Kubeflow cluster to train models on (running on Kubernetes on premise). Ideally we’d leverage existing tools as much as possible for the offline storage, but after reading Spark - Hopsworks Documentation I have some doubts:

Does the Hops Feature store provide its own Hive Metastore service (plus Hudi libs etc…) or can it re-use an existing one? Due to the client libs download step I suspect that the former is the answer, but I’d like some confirmation.
Can we leverage Spark running on our Hadoop cluster as computation power when creating/ingesting/etc… new feature datasets to the offline storage?
Can we leverage our own HDFS cluster/fs as target storage for Hive external tables? (This of course depends on the answer to 1).
Last but not the least, given the above, what hardware requirements would be needed to run the Feature Store?

Thanks a lot in advance,

Regards,

Luca

Jim_Dowling · April 28, 2021, 10:07pm

Hi Luca. Thanks for the interest in Hopsworks Feature Store.
Firstly, you can your feature engineering in your Spark Cluster or in any Python client (like Kubeflow Pipelines). We also support “external tables”, where the data is not stored in Hopsworks Feature Store, but in an external object store or table in a JDBC-enabled database. Documentation is not great on that. In answer to your questions:

Hopsworks has its own Hive metastore. It’s a fork of Hive 3.0.5 - we updated it to use TLS for access control, not Kerberos. You can write to it from your Spark cluster - Spark - Hopsworks Documentation
Yes. You can use it to write to both offline and online.
Currently, for some reason Fabio can explain, we don’t support kerberized Hive tables as an external table. I don’t remember what the reason is.
You could run the whole feature store on a single server if you don’t do any feature engineering on the server. If you do feature engineering in Python and ingest Pandas dataframes, they get uploaded first via the REST API to Hopsworks, then we run a Spark job on Hopsworks to ingest them (Dynamic Executors, i think). If you want to run the online feature store in HA mode, you should have 2 VMs for that. The minimum spec for a single server we recommend is 8 CPUs and 32GB, but if you’re going to use it heavily, we recommend 16 CPUs. You should have as many disks as data you are going to store on it.
If hardware budget is not a problem, then a NVMe disks for the NameNode in HopsFS can be configured to store small files, and you get really nice HopsFS performance - NVMe now in HopsFS - Logical Clocks

elukey · April 30, 2021, 3:43pm

Hi Jim,

thanks a lot for the detailed answer! I would be curious to know more about the kerberized external table problem, in my mind I had depicted HDFS as storage for Parquet files, but it seems not an option.

Luca