Hi, I am considering Hopsworks as a feature store solution for a ML pipeline that currently lives in GCP.
I’ve read through the installation guides for GCP and on premise installs, and was wondering if there’s a succinct summary on the benefits of the two. My understanding is that the managed installation would require less maintenance, but are there any benefits from a data privacy/security perspective (if the ML pipeline contains very sensitive data) with an on premise install?
The managed version’s main features are cluster management (creation, starting, stopping and termination and resizing of clusters), compute autoscaling, backup/restore, upgrades and organization management via the web ui or its REST APIs.
The data is in any case stored and processed entirely inside your own cloud account and we don’t gain access to it. To lock up the cluster further, you can disable our user management and manage cluster access entirely yourself: Cluster Creation - Hopsworks Documentation. You can also limit the permissions as much as possible but it doesn’t change the access to the data: Limiting Permissions - Hopsworks Documentation.
In contrast to on-premise, the managed version is reporting usage statistics back to us which are used for cluster sizing and billing. This requires outgoing network connectivity to our API. If required, outgoing traffic can be restricted to a specific endpoint.