I find that HOPSWORKS can count the number of files in Datasets very quickly, I want to know how this is done.
If I use the Datasets API to do the same thing, I will have to traverse every folder in Datasets recursively.
HopsFS has a quota for directories and files. Each directory that is quota-enabled knows how many files are in its subtree - it is stored as metadata, and it is updated asynchronously when you perform operations on files in subtrees.
We don’t expose that in the dataset API. You can get it by querying the HopsFS directory using the Java API . It’s not well documented - here is the equivalent in HDFS: https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/QuotaUsage.html#getFileAndDirectoryCount--