Python-based HDFS tutorial

riseresearch · January 30, 2020, 8:27am

Hi
I am looking for a tutorial on Python-based access to Hopsworks HDFS. I have a QA application that require uploading of data (mainly images and clear text) when running in the client environment ie not on the Hopsworks platform. I found a note on data uploading on the Hopsworks 1.3-snap docs however that page only covers manual uploading of data.

Theo · January 30, 2020, 9:43am

Hi,

There is a Java client available at https://github.com/hopshadoop/hopsworks-cli to upload files in Hopsworks, but the Python implementation is scheduled for the next release, 1.3.

Python API for interacting with Hopsworks when running jobs and notebooks is available here http://hops-py.logicalclocks.com/

and when running on Databricks or Sagemaker http://hopsworks-cloud-sdk.logicalclocks.com/

riseresearch · January 30, 2020, 10:10am

Thank you for your prompt response.
Two questions:

Where can I find the release schedule for Hopsworks / When is 1.3 due?
Would you say it is a viable (albeit a temporary) solution to call the java cli through a Python system call?

jimdowling · January 30, 2020, 10:03pm

Hi. If your data is less than 100 GB in size, i would recommend zipping it up into a file. Then using the UI to upload it to hopsworks. The UI uses a javascript library that uploads the file in checksummed chunks. The Java API can also use that same checksummed chunks protocol, but it’s not available in Python yet. Inside a notebook, you can always call ‘wget’ to download a file to the local disk on the notebook, and then copy that data into hdfs. However, make sure your project has enough disk quota to store the data. By default, i think project’s have only 200 GB.

If your data size is larger than that - but you can zip it up into files of max 100GB in size, then keep using the UI.

riseresearch · January 30, 2020, 10:21pm

Hi Jim and thank you for replying.
I am trying to integrate Hopsworks into a live QA process where several cameras image a production line using an Intel NUC. That NUC is then set to send these images to the Hopsworks cloud platform which, right now, is a local data lake. So there is no “data” to manually upload. It is all taken in “real time” and then sent to the cloud for storage and some light post processing. There are also some ML bits thrown into the mix for good measure, however that stuff is not relevant to the storing of image data on Hopsworks.

Your thoughts?

Theo · February 4, 2020, 3:36pm

Release cycle is ~6 weeks, so Hopsworks 1.3 should be out by mid to end of March. Road map in the form of JIRAs is available here.
That might work, although I’m not sure of the status of the Java cli client as it hasn’t been updated for a while. Let me know if you get any issues.

riseresearch · February 26, 2020, 9:29am

Hi theo and thank you for your reply. I am struggling with getting file (.jpg image) uploading to work with the hops.ai system. My client system is win10, NetBeans IDE, latest JDK 13.0.2,
During initial compile of the Java client I get the Maven-related error below

Some problems were encountered while processing the POMs:
[ERROR] Unresolveable build extension: Plugin org.sonatype.plugins:nexus-staging-maven-plugin:1.6.7 or one of its dependencies could not be resolved: The following artifacts could not be resolved: org.sonatype.plugins:nexus-staging-maven-plugin:jar:1.6.7, org.codehaus.plexus:plexus-utils:jar:1.1: Cannot access central (https://repo.maven.apache.org/maven2) in offline mode and the artifact org.sonatype.plugins:nexus-staging-maven-plugin:jar:1.6.7 has not been downloaded from it before.

This was solved by manually running maven from the commandline, outside of NetBeans. This manual compile seemed successful outside of JavaDoc generation, which failed.

However now I am stuck on the final Maven compile where I get the error “(No POM) 1” when in fact there is a POM present.

Are you able to provide some guidance here? Do you know if there exist a binary release of the hopsworks Java cli?

Thank you.

Theo · March 2, 2020, 9:05am

Hi @riseresearch

Looks like a maven/java version mismatch. It builds with maven 3.5.4 and openjdk version “1.8.0_242”.
I uploaded it here for the moment so you can access it.

Theo · April 20, 2020, 10:21am

Hi @riseresearch

There is now a Python upload API for uploading files to a Hopworks dataset. It will be in the next release of the hops Python library, but you can already use it by installing it from github

pip install git+https://github.com/logicalclocks/hops-util-py.git@master

You would need to create an API key in the project you plan to upload to https://hopsworks.readthedocs.io/en/latest/user_guide/hopsworks/apiKeys.html

and then you can use the API like

from hops import project, dataset
project.connect("MyProject", "hopsworks", 443, api_key="api_key_path")
dataset.upload("file.txt", "MyDataset")