How do I count the number of files in Datasets？

Freeman · December 2, 2021, 1:29pm

I find that HOPSWORKS can count the number of files in Datasets very quickly, I want to know how this is done.
If I use the Datasets API to do the same thing, I will have to traverse every folder in Datasets recursively.

If I want to count the number of files in a folder in Datasets, is there a simple and efficient way to do this?

Jim_Dowling · December 2, 2021, 3:07pm

HopsFS has a quota for directories and files. Each directory that is quota-enabled knows how many files are in its subtree - it is stored as metadata, and it is updated asynchronously when you perform operations on files in subtrees.
We don’t expose that in the dataset API. You can get it by querying the HopsFS directory using the Java API . It’s not well documented - here is the equivalent in HDFS:
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/QuotaUsage.html#getFileAndDirectoryCount--

Freeman · December 3, 2021, 1:01am

Hi @Jim_Dowling ,

Ok, thank you very much for your reply.
I found a HopsWorks-CLI project on GitHub ( GitHub - hopshadoop/hopsworks-cli: A command-line client and java library for interacting with Hopsworks via its REST API ), can I use this command-line client to count files in Datasets?