During the installation of Hopsworks , when I run the script “./hopsworks-installer.sh” I got an error:
Please enter your email address to continue:
my_email_address
Registering hopsworks instance…
curl: (28) Failed to connect to snurran.sics.se port 8443: Connexion terminée par expiration du délai d’attente
Is this error impact the installation ? Knowing that I press enter to continue it and in the log’s file I got an error:
$ tail -f installation.log
at javax.validation.Validation$GenericBootstrapImpl.configure(Validation.java:276)
at javax.validation.Validation.buildDefaultValidatorFactory(Validation.java:110)
at io.dropwizard.setup.Bootstrap.(Bootstrap.java:62)
at io.dropwizard.Application.run(Application.java:67)
at se.kth.karamel.webservice.KaramelServiceApplication.main(KaramelServiceApplication.java:233)
Caused by: java.lang.ClassNotFoundException: javax.xml.bind.JAXBException
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
… 13 more
The karamel interface launched with connected status but then it disconnected.
Could you please help me on the installation of Hopsworks, knowing that I’m trying to install it on a single machine (on-promise installation).
Thanks in advance.
Thanks for your answer. In fact I did the purge firstly then I re-run the hopsworks-installer.sh script again but still have the same error.
I explain more, just after the step of email’s writing I click enter then the following message displayed: Registering hopsworks instance....
After ~30s I got the same message:
curl: (28) Failed to connect to snurran.sics.se port 8443: Connexion terminée par expiration du délai d'attente
Is it related to the same issue of yesterday (issues with the machines serving the artifacts for the community installation) ?
Really I need to you to finish Hopsworks installation to start working on it. I’m ready for doing a quick conf call if you have a time. Thanks in advance.
I come back to you with a good news. In fact, after running the purge, many bugs are resolved and I could coutinue the installation.
But, I have a new bug, in Karamel interface the solo.rb does not installed, it failed:
4. | make solo.rb | FAILED | retry skip log| 7530
In the logs I have:
sudo: pas de tty présent et pas de programme askpass spécifié
I disabled the passwd for all users but still the solo.rb does not work.
Could you please tell me why ? Thanks in advance,
There is something wrong with your sudoers configuration. it’s still asking for the password.
Could you please double check that you can use sudo without having to type the password ? Otherwise you can pass you sudo password to the installer script with the --password flag:
I already resolved this error but after this I have another one:
clone and vendor hopsworks-chef| FAILED | retry skip log| 34294
The error is the following:
Clonage dans 'hopsworks-chef'...
Basculement sur la nouvelle branche '1.3'
The following error occurred while reading the cookbook `kube-hops':
Ridley::Errors::FromFileParserError: Could not parse `/tmp/d20200615-26917-fbhmmv/metadata.rb': wrong number of arguments (given 2, expected 0)
hing 'dela' from https://github.com/logicalclocks/dela-chef.git (at 1.3)
What do you think about this error please ?
Thanks a lot
I know that I disturbed you by my questions, but really I have a block point when the installation arrive to clone and vendor hopsworks-chef . It doesn’t find the URL when it clone it especially /tree/1.3 folders.
And it return the error that I sent you (my last reply).
Log:
Clonage dans 'hopsworks-chef'...
Basculement sur la nouvelle branche '1.3'
The following error occurred while reading the cookbook `kube-hops':
Ridley::Errors::FromFileParserError: Could not parse `/tmp/d20200615-26917-fbhmmv/metadata.rb': wrong number of arguments (given 2, expected 0)
hing 'dela' from https://github.com/logicalclocks/dela-chef.git (at 1.3)
I think that’s the reason why it fails to parse the cookbooks.
Could you please try to remove Chef 13 from the system and try again? The hopsworks-installer.sh is going to install the correct version (Chef 14) for our cookbooks.
Thanks @Fabio, issued is resolved.
But, another error is occurs. In fact, just after resolving the chef version issue, the installation is going fast, but is blocked in Tensorflow installation and the platform pass to paused knowing that it already installled many services.
the are many errors, but as my understanding the most important one is the following:
Detected 8 CPUs online; setting concurrency level to 8.
STDERR: ERROR: An NVIDIA kernel module 'nvidia-drm' appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
---- End output of "bash" "/tmp/chef-script20200622-18440-1f6hjfy" ----
Ran "bash" "/tmp/chef-script20200622-18440-1f6hjfy" returned 1
What do you think please about this issue ?
Thanks
Can you please send the cluster definition in you can find in your working directory under cluster-definition/hopsworks-installer-active.yml - Remove the passwords before sending it.
From the error, my guess is that you are installing a GPU node and at the same time you are running the the Ubuntu UI.
The Chef recipes, during the setup of the GPU node, they install the NVIDIA drivers. To be able to do so the GPU needs to be idle (no processes running) and the Nouveau drivers need to be blacklisted.
From the error message, it seems that the Nouveau drivers are already blacklisted, but there are some processes running on the GPU(s).
To progress with the installation I suggest you stop the desktop environment and you use the command line to complete the installation.
what do you mean by "stop the desktop envoronment and use the CLI to complete the installation"
Because I already tried before asking you the question running this command sudo systemctl set-default multi-user but is stopped every things: I followed this: (debian - How to unload kernel module 'nvidia-drm'? - Unix & Linux Stack Exchange)
I think the easiest way is for you is to connect to the machine you are working with using SSH, stop the Gnome (or Unity) process and do the installation from there.
In this way you don’t have issues related to processes using the GPU
Thanks for your answers and suggestions. During this small period.
About your suggestion to connect to the machine you are working with using SSH, stop the Gnome (or Unity) process and do the installation from there, I stoped gdm process, then I connected to my machine via other one using ssh, and I launched the installation. I got the same error when it arrive to tensorflow’s installation.
I tried to skip tensforflow’s installation, then the installation is continued, but it blocked in Hopsworks service’s installation with a big logs, I choosed the most important I think:
Installing Cookbook Gems:e[0m
Compiling Cookbooks...e[0m
/tmp/chef-solo/cookbooks/flink/libraries/inifile.rb:11: warning: already initialized constant IniFile::VERSION
/tmp/chef-solo/cookbooks/kagent/libraries/inifile.rb:11: warning: previous definition of VERSION was here
/tmp/chef-solo/cookbooks/glassfish/resources/archive.rb:86: warning: constant ::Fixnum is deprecated
/tmp/chef-solo/cookbooks/glassfish/resources/connector_connection_pool.rb:39: warning: constant ::Fixnum is deprecated
/tmp/chef-solo/cookbooks/glassfish/resources/jdbc_connection_pool.rb:82: warning: constant ::Fixnum is deprecated
Converging 127 resourcese[0m
e[0me[0m
Running handlers:e[0m
[2020-06-23T14:43:47+02:00] ERROR: Running exception handlers
Running handlers complete
e[0m[2020-06-23T14:43:47+02:00] ERROR: Exception handlers complete
Chef Client failed. 18 resources updated in 20 minutes 12 secondse[0m
[2020-06-23T14:43:47+02:00] FATAL: Stacktrace dumped to /tmp/chef-solo/chef-stacktrace.out
[2020-06-23T14:43:47+02:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2020-06-23T14:43:47+02:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: hopsworks_grants[reload_systemd] (hopsworks::install line 509) had an error: Mixlib::ShellOut::ShellCommandFailed: bash[reload_systemd] (/tmp/chef-solo/cookbooks/hopsworks/providers/grants.rb line 15) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received '1'
---- Begin output of "bash" "/tmp/chef-script20200623-1907-1i3w2z3" ----
STDOUT:
STDERR: Job for glassfish-domain1.service failed because the control process exited with error code.
See "systemctl status glassfish-domain1.service" and "journalctl -xe" for details.
---- End output of "bash" "/tmp/chef-script20200623-1907-1i3w2z3" ----
Ran "bash" "/tmp/chef-script20200623-1907-1i3w2z3" returned 1
I think that I will quit the Hopsworks installation.
Thanks