Error installing Hopsworks on-promise

smach_h · June 10, 2020, 2:50pm

Hi,

During the installation of Hopsworks , when I run the script “./hopsworks-installer.sh” I got an error:

Please enter your email address to continue:
my_email_address
Registering hopsworks instance…
curl: (28) Failed to connect to snurran.sics.se port 8443: Connexion terminée par expiration du délai d’attente

Is this error impact the installation ? Knowing that I press enter to continue it and in the log’s file I got an error:

$ tail -f installation.log
at javax.validation.Validation$GenericBootstrapImpl.configure(Validation.java:276)
at javax.validation.Validation.buildDefaultValidatorFactory(Validation.java:110)
at io.dropwizard.setup.Bootstrap.(Bootstrap.java:62)
at io.dropwizard.Application.run(Application.java:67)
at se.kth.karamel.webservice.KaramelServiceApplication.main(KaramelServiceApplication.java:233)
Caused by: java.lang.ClassNotFoundException: javax.xml.bind.JAXBException
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
… 13 more

The karamel interface launched with connected status but then it disconnected.
Could you please help me on the installation of Hopsworks, knowing that I’m trying to install it on a single machine (on-promise installation).
Thanks in advance.

Fabio · June 10, 2020, 10:47pm

Hi @smach_h,

This morning we had some issues with the machines serving the artifacts for the community installation. It’s fixed now.
Can you run the purge;

./hopsworks-installer.sh -i purge -ni

and try again to run the hopsworks-installer.sh script?

smach_h · June 11, 2020, 11:48am

Hi @Fabio,

Thanks for your answer. In fact I did the purge firstly then I re-run the hopsworks-installer.sh script again but still have the same error.
I explain more, just after the step of email’s writing I click enter then the following message displayed:
Registering hopsworks instance....

After ~30s I got the same message:

curl: (28) Failed to connect to snurran.sics.se port 8443: Connexion terminée par expiration du délai d'attente

Is it related to the same issue of yesterday (issues with the machines serving the artifacts for the community installation) ?
Really I need to you to finish Hopsworks installation to start working on it. I’m ready for doing a quick conf call if you have a time. Thanks in advance.

smach_h · June 15, 2020, 12:02pm

Hi @Fabio

I come back to you with a good news. In fact, after running the purge, many bugs are resolved and I could coutinue the installation.
But, I have a new bug, in Karamel interface the solo.rb does not installed, it failed:

4. | make solo.rb | FAILED | retry skip log| 7530

In the logs I have:

sudo: pas de tty présent et pas de programme askpass spécifié

I disabled the passwd for all users but still the solo.rb does not work.
Could you please tell me why ? Thanks in advance,

Fabio · June 16, 2020, 8:17am

Hi,

There is something wrong with your sudoers configuration. it’s still asking for the password.

Could you please double check that you can use sudo without having to type the password ? Otherwise you can pass you sudo password to the installer script with the --password flag:

./hopsworks-installer.sh [.....] --password [yourpassword]

–
Fabio

smach_h · June 16, 2020, 8:34am

Hi @Fabio

I already resolved this error but after this I have another one:

clone and vendor hopsworks-chef| FAILED | retry skip log| 34294

The error is the following:

Clonage dans 'hopsworks-chef'...
Basculement sur la nouvelle branche '1.3'
The following error occurred while reading the cookbook `kube-hops':
Ridley::Errors::FromFileParserError: Could not parse `/tmp/d20200615-26917-fbhmmv/metadata.rb': wrong number of arguments (given 2, expected 0)
hing 'dela' from https://github.com/logicalclocks/dela-chef.git (at 1.3)

What do you think about this error please ?
Thanks a lot

smach_h · June 18, 2020, 1:38pm

Hi @Fabio

I know that I disturbed you by my questions, but really I have a block point when the installation arrive to clone and vendor hopsworks-chef . It doesn’t find the URL when it clone it especially /tree/1.3 folders.
And it return the error that I sent you (my last reply).

Log:

    Clonage dans 'hopsworks-chef'...
    Basculement sur la nouvelle branche '1.3'
    The following error occurred while reading the cookbook `kube-hops':
    Ridley::Errors::FromFileParserError: Could not parse `/tmp/d20200615-26917-fbhmmv/metadata.rb': wrong number of arguments (given 2, expected 0)
    hing 'dela' from https://github.com/logicalclocks/dela-chef.git (at 1.3)

Could you please help me ?
Thanks in advance

Fabio · June 19, 2020, 12:57pm

Hi @smach_h,

Cloning and vendoring hopsworks-chef works for us. My suspicion is that you are running a different version of Chef/Berks.

Could you please post the output of the following command:

chef --version

It should look something like this:

vagrant@hopsworks0:~$ chef --version
Chef Development Kit Version: 3.7.23
chef-client version: 14.10.9
delivery version: master (64f556d5ebfd7bac2c0b3cc2c53669688b3ea4b5)
berks version: 7.0.7
kitchen version: 1.24.0
inspec version: 3.4.1

smach_h · June 19, 2020, 2:00pm

Hi @Fabio

Please find attached a screenshot of the output of the command: chef --version. We don’t have the same versions (I have an olds one)

Thanks

Fabio · June 19, 2020, 6:07pm

I think that’s the reason why it fails to parse the cookbooks.

Could you please try to remove Chef 13 from the system and try again? The hopsworks-installer.sh is going to install the correct version (Chef 14) for our cookbooks.

–
Fabio

smach_h · June 22, 2020, 8:55am

Thanks @Fabio, issued is resolved.
But, another error is occurs. In fact, just after resolving the chef version issue, the installation is going fast, but is blocked in Tensorflow installation and the platform pass to paused knowing that it already installled many services.

the are many errors, but as my understanding the most important one is the following:

Detected 8 CPUs online; setting concurrency level to 8.
STDERR: ERROR: An NVIDIA kernel module 'nvidia-drm' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
---- End output of "bash"  "/tmp/chef-script20200622-18440-1f6hjfy" ----
Ran "bash"  "/tmp/chef-script20200622-18440-1f6hjfy" returned 1

What do you think please about this issue ?
Thanks

Fabio · June 22, 2020, 9:14am

Can you please send the cluster definition in you can find in your working directory under cluster-definition/hopsworks-installer-active.yml - Remove the passwords before sending it.

From the error, my guess is that you are installing a GPU node and at the same time you are running the the Ubuntu UI.
The Chef recipes, during the setup of the GPU node, they install the NVIDIA drivers. To be able to do so the GPU needs to be idle (no processes running) and the Nouveau drivers need to be blacklisted.

From the error message, it seems that the Nouveau drivers are already blacklisted, but there are some processes running on the GPU(s).
To progress with the installation I suggest you stop the desktop environment and you use the command line to complete the installation.

–
Fabio

smach_h · June 22, 2020, 9:49am

The cluster definition is:

name: Hops
baremetal:
    username: brooks

cookbooks:
  hopsworks:
    github: logicalclocks/hopsworks-chef
    branch: 1.3

attrs:
  
  install:
    dir: /srv/hops
    cloud: on-premises
    kubernetes: false
    
  cuda:
    accept_nvidia_download_terms: true
  hops:
#    version: "2.8.2.10-EE-RC0"  
    tls:
      enabled: false
    yarn:
      cgroups_strict_resource_usage: 'false'
      vcores: 7
      memory_mbs: 61440
      detect-hardware-capabilities: false
    rmappsecurity:
      actor_class: "org.apache.hadoop.yarn.server.resourcemanager.security.DevHopsworksRMAppSecurityActions"
#  maggy:
#    version: "git+git://github.com/logicalclocks/maggy@master"
  kagent:
    python_conda_versions: 3.6
#  conda:
#    hops-util-py:
#      install-mode: "git"
#      branch: "model_repo_project"
#      repo: "jimdowling"
#  ndb:
#    nvme:
#      disks: "/dev/nvme0n1 /dev/nvme0n1"
#      format: true
#      logfile_size: 100000M
#      undofile_size: 1000M
#    NoOfReplicas: 2
#    DataMemory: 8192
  alertmanager:
    email:
      to: sre@logicalclocks.com
      from: hopsworks@logicalclocks.com
      smtp_host: mail.hello.com
  prometheus:
      retention_time: "8h"
  hopsworks:
#    war_url: http://snurran.sics.se/hops/hopsworks/1.4.0/hopsworks-web-dev.war
#    ca_url: http://snurran.sics.se/hops/hopsworks/1.4.0/hopsworks-ca-dev.war    
#    ear_url: http://snurran.sics.se/hops/hopsworks/1.4.0/hopsworks-ear-dev.ear
    encryption_password: ...........
    master:
      password: ................
    admin:
      user: adminuser
      password: ..............
    https:
      port: 443
    featurestore_online: true
    requests_verify: false
    application_certificate_validity_period: 6d
    kagent_liveness:
      enabled: true
      threshold: 40s
  hive2:
      mysql_password: ..............
  mysql:
      password: ..............      
  elastic:
    opendistro_security:
      jwt:
        exp_ms: 1800000
      audit:
        enable_rest: true
        enable_transport: false
      admin:
        username: admin
        password: ..............
      kibana:
        username: kibana
        password: ..............
      logstash:
        username: logstash
        password: ..............
      epipe:
        username: epipe
        password: ..............
      elastic_exporter:
        username: elasticexporter
        password: ..............
groups:
  metaserver:
    size: 1
    baremetal:
      ip: ..............
    recipes:
      - kagent
      - conda
      - ndb::mgmd
      - ndb::ndbd
      - ndb::mysqld
      - hops::ndb
      - hops::rm
      - hops::nn
      - hops::jhs
      - hadoop_spark::yarn
      - hadoop_spark::historyserver
      - hadoop_spark::certs
      - flink::yarn
      - flink::historyserver
      - elastic
      - livy
      - kzookeeper
      - kkafka
      - epipe
      - hopsworks
      - hopsmonitor
      - hopslog
      - hopslog::_filebeat-spark
      - hopslog::_filebeat-serving
      - hopslog::_filebeat-kagent
      - hopslog::_filebeat-beam
      - hops::dn
      - hops::nm
      - tensorflow
      - hive2
      - hops_airflow
      - hops_airflow::sqoop
      - hopsmonitor::prometheus
      - hopsmonitor::alertmanager
      - hopsmonitor::node_exporter
      - hopsmonitor::purge_telegraf
      - consul::master

what do you mean by "stop the desktop envoronment and use the CLI to complete the installation"
Because I already tried before asking you the question running this command
sudo systemctl set-default multi-user but is stopped every things: I followed this: (debian - How to unload kernel module 'nvidia-drm'? - Unix & Linux Stack Exchange)

Thanks @Fabio

Fabio · June 22, 2020, 11:11pm

I think the easiest way is for you is to connect to the machine you are working with using SSH, stop the Gnome (or Unity) process and do the installation from there.

In this way you don’t have issues related to processes using the GPU

smach_h · June 23, 2020, 1:35pm

Hi @Fabio

Thanks for your answers and suggestions. During this small period.

About your suggestion to connect to the machine you are working with using SSH, stop the Gnome (or Unity) process and do the installation from there, I stoped gdm process, then I connected to my machine via other one using ssh, and I launched the installation. I got the same error when it arrive to tensorflow’s installation.

I tried to skip tensforflow’s installation, then the installation is continued, but it blocked in Hopsworks service’s installation with a big logs, I choosed the most important I think:

Installing Cookbook Gems:e[0m
Compiling Cookbooks...e[0m
/tmp/chef-solo/cookbooks/flink/libraries/inifile.rb:11: warning: already initialized constant IniFile::VERSION
/tmp/chef-solo/cookbooks/kagent/libraries/inifile.rb:11: warning: previous definition of VERSION was here
/tmp/chef-solo/cookbooks/glassfish/resources/archive.rb:86: warning: constant ::Fixnum is deprecated
/tmp/chef-solo/cookbooks/glassfish/resources/connector_connection_pool.rb:39: warning: constant ::Fixnum is deprecated
/tmp/chef-solo/cookbooks/glassfish/resources/jdbc_connection_pool.rb:82: warning: constant ::Fixnum is deprecated
Converging 127 resourcese[0m
e[0me[0m
Running handlers:e[0m
[2020-06-23T14:43:47+02:00] ERROR: Running exception handlers
Running handlers complete
e[0m[2020-06-23T14:43:47+02:00] ERROR: Exception handlers complete
Chef Client failed. 18 resources updated in 20 minutes 12 secondse[0m
[2020-06-23T14:43:47+02:00] FATAL: Stacktrace dumped to /tmp/chef-solo/chef-stacktrace.out
[2020-06-23T14:43:47+02:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2020-06-23T14:43:47+02:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: hopsworks_grants[reload_systemd] (hopsworks::install line 509) had an error: Mixlib::ShellOut::ShellCommandFailed: bash[reload_systemd] (/tmp/chef-solo/cookbooks/hopsworks/providers/grants.rb line 15) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'
---- Begin output of "bash"  "/tmp/chef-script20200623-1907-1i3w2z3" ----
STDOUT: 
STDERR: Job for glassfish-domain1.service failed because the control process exited with error code.
See "systemctl status glassfish-domain1.service" and "journalctl -xe" for details.
---- End output of "bash"  "/tmp/chef-script20200623-1907-1i3w2z3" ----
Ran "bash"  "/tmp/chef-script20200623-1907-1i3w2z3" returned 1

I think that I will quit the Hopsworks installation.
Thanks