Does Hopsworks provide features analogous to a Data Catalogue or integration with Data Catalogue 3party?

therapon · February 15, 2022, 9:19pm

Hi,

I’ve tried to figure out from the documentation, examples, google searches and tutorials if hopsworks provides features similar to typical Data Catalogue software? The software features I have in mind here for a Data Catalogue component include

data discovery
data curation
data governance (lineage, access control, data audit)

I can infer from the Hopsworks documentation that some are covered to an extend (discoverability, curation maybe) but I was not able to easily find if lineage, governance, auditing etc. are also covered.

Is something along the lines of a Data Catalogue in the roadmap/plans? Are there integrations with 3rd party Data Catalogue systems possible/available?

Thanks in advance for your help.

therapon · February 15, 2022, 9:23pm

I think I have answer partially my own question
https://hopsworks.readthedocs.io/en/latest/hopsml/provenance.html

I’d still be interested on answers with regards to the integration with 3rd party Data Catalogue systems.

Thanks

Alex · February 16, 2022, 9:05am

Hi @therapon,

Currently we do not have integrations for third party Data Catalogues. We have an internal CDC (Change Data Capture) mechanism that is pluggable, so you could export/import data from other Data Catalogues/Metastores. However this mechanism is not currently exposed.

We have all the mechanisms you mentioned already and as an addition our Metastore is integrated with our File System providing eventual strong consistency. That is, when files on the file system are created/deleted, metadata is automatically created/deleted as well.

Data Discovery:
We provide full text search based on title, description, custom metadata through elasticsearch (soon to switch to opensearch). Our platform is multi-tenant, and this reflects in the search mechanism as well. You can decide if your metadata and or data should be discoverable/accessible. You can thus make metadata public for search, but the data remains private and will be available on approval from data owner.
https://hopsworks.readthedocs.io/en/latest/user_guide/hopsworks/search.html?highlight=search

Custom metadata is based on keywords and schematised tags.
https://hopsworks.readthedocs.io/en/latest/user_guide/hopsworks/tags.html?highlight=schematized%20tags

Data curation is enabled through spark data engineering as well as data validation rules:

Data Governance.
Lineage is provided for the main machine learning abtractions: feature groups (on demand/cached), training datasets, experiments, models. You can thus follow which user/application created each of these and what inputs did it use.
https://hopsworks.readthedocs.io/en/latest/hopsml/provenance.html?highlight=provenance

Access Control is enabled by our RBAC (Role based access control) based on HopsFS (our file system) ACL (access control lists)

https://hopsworks.readthedocs.io/en/latest/user_guide/hopsworks/projectMembers.html?highlight=row%20based%20access%20control
https://hopsworks.readthedocs.io/en/latest/user_guide/hopsworks/dataSetShare.html?highlight=row%20based%20access%20control
https://hopsworks.readthedocs.io/en/latest/user_guide/hopsfs/acls.html?highlight=row%20based%20access%20control

Data auditing. We have an audit log providing CRUD(create/read/update/delete) information for the featurestore data (available to users through the UI). We log REST endpoints access (available as logs).

We are currently migrating and improving documentation. Moving from the old location, to the new one. I will get back to you with another message when the pages for the relevant information to you have been moved to the new documentation.

I hope this was helpful. Let me know if you have further questions.

Regards,
Alex

therapon · February 16, 2022, 5:28pm

Thanks @Alex

Thank you for the links and the information, I’ll take the time to read through them. A quick follow up

We have an internal CDC (Change Data Capture) mechanism that is pluggable, so you could export/import data from other Data Catalogues/Metastores. However this mechanism is not currently exposed.

Are there plans to expose this mechanism in the future?

Thanks,
Theo

Alex · February 17, 2022, 8:48am

Hi Theo,

We are currently looking at exporting Hopsworks metadata to external Metastores, but we don’t have an actual version set for this.

Do you have a particular Data Catalogue/Metastore in mind? And what would the use case look like? Do you want to import metadata from the external Metastore into Hopsworks or the other way around?

Regards,
Alex

therapon · February 18, 2022, 5:17am

Hi Alex,

I am looking into a feature store to use and a colleague is looking into data catalogue and we wanted to figure out how to hook the two together. We are at the stage of exploring options.

The use case is to be able from hopsworks feature store to find use and generate features found in the data catalogue. We are not set on a data catalogue but https://www.amundsen.io/ seems decent for an open source one.

The overall goal is that we live in a system (closed system) with lots of data coming from multiple data storage solutions and we would like the option to have these data available to the feature store without moving the data. Some data are already in feature form (signals) others are not.

As a general question, for the open source version of Hopsworks, is there a public roadmap and/or a process to add to the roadmap?

Thanks.

–
Theo

Alex · February 21, 2022, 1:56pm

Hej Theo,

You don’t need to move all data into our featurestore even in the current version.
You have a number of already existing storage connectors:

which you can use together with the on-demand (external) feature groups:

So that you only pull the data when you need to process it and it will generally reside on the external storage.

Regards,
Alex

therapon · February 28, 2022, 6:26pm

Hi Alex,

Thanks for the links, I had not seen the on-demand option.