Bare-metal -1 cluster install failure

Hello,

seems i may missing something but can’t find what is it, i ran the install for physical server (48GB,400GB HDD,2GPU)
and is not even starting, seems some have classes or modules missed here.

this is all i have from the installation.log

Exception in thread “main” java.lang.NoClassDefFoundError: javax/xml/bind/JAXBException
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:415)
at org.jboss.logging.Logger.getMessageLogger(Logger.java:2248)
at org.jboss.logging.Logger.getMessageLogger(Logger.java:2214)
at org.hibernate.validator.internal.util.logging.LoggerFactory.make(LoggerFactory.java:29)
at org.hibernate.validator.internal.util.Version.(Version.java:27)
at org.hibernate.validator.internal.engine.ConfigurationImpl.(ConfigurationImpl.java:65)
at org.hibernate.validator.HibernateValidator.createGenericConfiguration(HibernateValidator.java:41)
at javax.validation.Validation$GenericBootstrapImpl.configure(Validation.java:276)
at javax.validation.Validation.buildDefaultValidatorFactory(Validation.java:110)
at io.dropwizard.setup.Bootstrap.(Bootstrap.java:62)
at io.dropwizard.Application.run(Application.java:67)
at se.kth.karamel.webservice.KaramelServiceApplication.main(KaramelServiceApplication.java:233)
Caused by: java.lang.ClassNotFoundException: javax.xml.bind.JAXBException
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:583)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
… 13 more

this is what this server has installed so far:
** java --version**
java 12.0.1 2019-04-16
Java™ SE Runtime Environment (build 12.0.1+12)
Java HotSpot™ 64-Bit Server VM (build 12.0.1+12, mixed mode, sharing)
[root@PER320-2 cluster]#

Hi @Fernando_Marines,

Downgrade java to 1.8 and try again.

thank for that @Alex , however the install stuck here:

WARN [2020-07-23 02:31:51,948] se.kth.karamel.common.util.Confs: Couldn’t find karamel conf file in ‘/home/fmarines/.karamel’
INFO [2020-07-23 02:31:55,562] se.kth.karamel.client.api.CookbookCacheIml: 0-level cookbooks for Hops is 21
INFO [2020-07-23 02:31:55,564] se.kth.karamel.client.api.CookbookCacheIml: 1-level cookbooks for Hops is 20
INFO [2020-07-23 02:31:55,597] se.kth.karamel.client.api.CookbookCacheIml: 0-level cookbooks for Hops is 1
INFO [2020-07-23 02:31:55,597] se.kth.karamel.client.api.CookbookCacheIml: 1-level cookbooks for Hops is 20
INFO [2020-07-23 02:31:55,599] se.kth.karamel.client.api.CookbookCacheIml: 2-level cookbooks for Hops is 14
INFO [2020-07-23 02:31:55,603] se.kth.karamel.client.api.CookbookCacheIml: 0-level cookbooks for Hops is 1
INFO [2020-07-23 02:31:55,603] se.kth.karamel.client.api.CookbookCacheIml: 1-level cookbooks for Hops is 20
INFO [2020-07-23 02:31:55,604] se.kth.karamel.client.api.CookbookCacheIml: 2-level cookbooks for Hops is 14
INFO [2020-07-23 02:31:58,726] se.kth.karamel.client.api.CookbookCacheIml: 0-level cookbooks for Hops is 21
INFO [2020-07-23 02:31:58,727] se.kth.karamel.client.api.CookbookCacheIml: 1-level cookbooks for Hops is 20
INFO [2020-07-23 02:31:58,742] se.kth.karamel.client.api.CookbookCacheIml: 0-level cookbooks for Hops is 1
INFO [2020-07-23 02:31:58,742] se.kth.karamel.client.api.CookbookCacheIml: 1-level cookbooks for Hops is 20
INFO [2020-07-23 02:31:58,743] se.kth.karamel.client.api.CookbookCacheIml: 2-level cookbooks for Hops is 14
INFO [2020-07-23 02:31:58,748] se.kth.karamel.client.api.CookbookCacheIml: 0-level cookbooks for Hops is 1
INFO [2020-07-23 02:31:58,748] se.kth.karamel.client.api.CookbookCacheIml: 1-level cookbooks for Hops is 20
INFO [2020-07-23 02:31:58,749] se.kth.karamel.client.api.CookbookCacheIml: 2-level cookbooks for Hops is 14
INFO [2020-07-23 02:31:58,767] se.kth.karamel.client.api.CookbookCacheIml: 0-level cookbooks for Hops is 21
INFO [2020-07-23 02:31:58,769] se.kth.karamel.client.api.CookbookCacheIml: 1-level cookbooks for Hops is 20
INFO [2020-07-23 02:31:58,776] se.kth.karamel.backend.launcher.baremetal.BaremetalLauncher: Public-key=’/home/fmarines/.ssh/id_rsa.pub’
INFO [2020-07-23 02:31:58,776] se.kth.karamel.backend.launcher.baremetal.BaremetalLauncher: Private-key=’/home/fmarines/.ssh/id_rsa’
INFO [2020-07-23 02:31:58,777] se.kth.karamel.backend.ClusterManager: Cluster-Manager started for ‘Hops’ d’-’
INFO [2020-07-23 02:31:58,777] se.kth.karamel.backend.machines.MachinesMonitor: Machines-Monitor started for ‘Hops’ d’-’
INFO [2020-07-23 02:31:58,778] se.kth.karamel.backend.ClusterManager: Going to serve ‘LAUNCH_CLUSTER’
INFO [2020-07-23 02:31:58,781] se.kth.karamel.backend.ClusterManager: Prelaunch Cleaning ‘Hops’ …
INFO [2020-07-23 02:31:58,781] se.kth.karamel.backend.ClusterManager: \o/\o/\o/\o/\o/‘Hops’ PRECLEANED \o/\o/\o/\o/\o/
INFO [2020-07-23 02:31:58,782] se.kth.karamel.backend.ClusterManager: Forking groups ‘Hops’ …
INFO [2020-07-23 02:31:58,782] se.kth.karamel.backend.ClusterManager: \o/\o/\o/\o/\o/‘Hops’ GROUPS_FORKED \o/\o/\o/\o/\o/
INFO [2020-07-23 02:31:58,783] se.kth.karamel.backend.ClusterManager: Launching ‘Hops’ …
INFO [2020-07-23 02:31:58,783] se.kth.karamel.backend.ClusterManager: groups ‘[se.kth.karamel.backend.running.model.GroupRuntime@54ea4bf9]’
INFO [2020-07-23 02:31:58,783] se.kth.karamel.backend.ClusterManager: Gogo
INFO [2020-07-23 02:31:58,783] se.kth.karamel.backend.ClusterManager: Using provider ‘se.kth.karamel.common.clusterdef.Baremetal@60d0b720’
INFO [2020-07-23 02:31:58,783] se.kth.karamel.backend.ClusterManager: Using launcher ‘se.kth.karamel.backend.launcher.baremetal.BaremetalLauncher@1396194c’
INFO [2020-07-23 02:31:58,783] se.kth.karamel.backend.ClusterManager: Using launcher ‘se.kth.karamel.backend.launcher.baremetal.BaremetalLauncher@1396194c’
INFO [2020-07-23 02:31:58,943] se.kth.karamel.backend.ClusterManager: \o/\o/\o/\o/\o/‘Hops’ MACHINES_FORKED \o/\o/\o/\o/\o/
INFO [2020-07-23 02:31:58,943] se.kth.karamel.backend.ClusterManager: Going to serve ‘SUBMIT_INSTALL_DAG’
INFO [2020-07-23 02:31:58,943] se.kth.karamel.backend.ClusterManager: Running the DAG for ‘Hops’ …
INFO [2020-07-23 02:31:59,015] se.kth.karamel.client.api.CookbookCacheIml: 0-level cookbooks for Hops is 21
INFO [2020-07-23 02:31:59,017] se.kth.karamel.client.api.CookbookCacheIml: 1-level cookbooks for Hops is 20
INFO [2020-07-23 02:32:08,822] net.schmizz.sshj.transport.random.BouncyCastleRandom: Generating random seed from SecureRandom.
INFO [2020-07-23 02:32:08,926] se.kth.karamel.backend.machines.SshMachine: 192.168.0.230: connecting …
INFO [2020-07-23 02:32:08,927] net.schmizz.sshj.transport.TransportImpl: Client identity string: SSH-2.0-SSHJ_0.20.0
INFO [2020-07-23 02:32:08,943] net.schmizz.sshj.transport.TransportImpl: Server identity string: SSH-2.0-OpenSSH_7.4
INFO [2020-07-23 02:32:09,114] se.kth.karamel.backend.machines.SshMachine: 192.168.0.230: Yey!! connected ^-^
WARN [2020-07-23 02:32:09,795] net.schmizz.sshj.xfer.scp.SCPEngine: SCP exit status: 1
INFO [2020-07-23 02:32:09,795] se.kth.karamel.backend.machines.SshMachine: Succeeded tasklist does not exist on 192.168.0.230
INFO [2020-07-23 02:32:09,798] se.kth.karamel.backend.machines.SshMachine: 192.168.0.230: Running task: find os-type
INFO [2020-07-23 02:32:10,128] se.kth.karamel.backend.machines.SshMachine: 192.168.0.230: Running task: apt-get essentials
INFO [2020-07-23 03:12:27,708] se.kth.karamel.backend.machines.SshMachine: 192.168.0.230: Running task: install chefdk
INFO [2020-07-23 03:12:28,417] se.kth.karamel.backend.machines.SshMachine: 192.168.0.230: Running task: make solo.rb
INFO [2020-07-23 03:12:34,138] se.kth.karamel.backend.machines.SshMachine: 192.168.0.230: Running task: make solo.rb
ERROR [2020-07-23 03:12:41,165] se.kth.karamel.backend.dag.DagNode: Failed ‘make solo.rb on 192.168.0.230’ because ‘192.168.0.230: Command did not complete: mkdir -p /home/fmarines/.karamel/install ; cd /home/fmarines/.karamel/install; echo $$ > pid; echo ‘#!/bin/bash
set -eo pipefail
sudo touch solo.rb
sudo chmod 777 solo.rb
cat > solo.rb <<-‘END_OF_FILE’
file_cache_path “/tmp/chef-solo”
cookbook_path ["/home/fmarines/.karamel/cookbooks/hopsworks-chef_vendor"]
END_OF_FILE’ > make_solo_rb.sh ; chmod +x make_solo_rb.sh ; ./make_solo_rb.sh’, DAG is stuck here : (
INFO [2020-07-23 03:12:44,020] se.kth.karamel.backend.machines.MachinesMonitor: Sending pause signal to all machines


Would you know what i should review or redo?

Regards

Hi @Fernando_Marines,

Karamel needs to be able to ssh into each of the machines you are trying to install to. Did you make sure that you have the proper ssh key on all machines and also if the ssh key is password protected, did you provide it to karamel in the UI?

Try to make sure you can ssh into each of the machines with the install user (make sure you can do this on the machine running karamel too … that it can ssh into itself).

Yes i validate first that requirement, also the install no longer ask for a password in the initial steps of the installation (see below)

Found existing id_rsa.pub
Found existing entry in authorized_keys
Connection to localhost closed.
Connection to 192.168.0.230 closed.
File ‘hopsworks-installer.yml’ already there; not retrieving.
File ‘hopsworks-worker.yml’ already there; not retrieving.
File ‘hopsworks-worker-gpu.yml’ already there; not retrieving.
Press ENTER to continue

and looks the same error as the Virtual Machine installation (vagrant), stops at the same step and error.

forgot to mention that was asked for the sudo password as well and was there during the install.

Found karamel
sudo: a password is required

It appears you need a sudo password for this account.
Enter the sudo password for fmarines:

[sudo] password for fmarines:
Running command from /extend1/cluster/karamel-0.6:

setsid ./bin/karamel -headless -launch …/cluster-defns/hopsworks-installer-active.yml -passwd @@@@@@ > …/installation.log 2>&1 &


Installation has started, but may take 1 hour or more…

The Karamel installer UI will soon start at: http://192.168.0.230:9090/index.html
Note: port 9090 must be open for external traffic and Karamel will shutdown when installation finishes.

=====================================================================

You can view the installation logs with this command:

tail -f installation.log


ok, so i think something may not responding correct, what i did was restart the server (just in case)
remove password for sudo command and now the install have a different output, however stop again at the same point.

},
“cuda”: {
“accept_nvidia_download_terms”: “true”
},
“alertmanager”: {
“email”: {
“to”: “sre@logicalclocks.com”,
“smtp_host”: “mail.hello.com”,
“from”: “hopsworks@logicalclocks.com
}
},
“prometheus”: {
“retention_time”: “8h”
},
“private_ips”: [
“192.168.0.230”
],
“public_ips”: [
“192.168.0.230”
],
“hosts”: {
“192.168.0.230”: “192.168.0.230”
},
“run_list”: [
“kagent::install”
]
}
END_OF_FILE
sudo chef-solo -c /home/fmarines/.karamel/install/solo.rb -j /home/fmarines/.karamel/install/kagent__install.json 2>&1 | tee kagent__install.log
echo ‘https://github.com/logicalclocks/kagent-chef/tree/1.3/kagent::install’ >> succeed_list
’ > kagent__install.sh ; chmod +x kagent__install.sh ; ./kagent__install.sh
', DAG is stuck here :frowning:
INFO [2020-07-23 14:37:23,836] se.kth.karamel.backend.machines.MachinesMonitor: Sending pause signal to all machines

seems there should be an licence accept step during the install that is now happening, this is what i got from the .karamel directory and the kagent__install log

[fmarines@PER320-2 ~]$ cd /home/fmarines/.karamel/install
[fmarines@PER320-2 install]$ ls
aptget.sh clone_hopsworks-chef.sh install-chefdk.sh kagent__install.json kagent__install.log kagent__install.sh make_solo_rb.sh order ostype ostype.sh pid solo.rb succeed_list
[fmarines@PER320-2 install]$ ls -ltr
total 76
-rwxrwxr-x. 1 fmarines fmarines 1094 Jul 23 12:36 ostype.sh
-rw-rw-r–. 1 fmarines fmarines 8 Jul 23 12:36 ostype
-rwxrwxr-x. 1 fmarines fmarines 994 Jul 23 12:36 aptget.sh
-rwxrwxr-x. 1 fmarines fmarines 1130 Jul 23 12:36 install-chefdk.sh
-rwxrwxr-x. 1 fmarines fmarines 219 Jul 23 12:36 make_solo_rb.sh
-rwxrwxrwx. 1 root root 107 Jul 23 12:36 solo.rb
-rwxrwxr-x. 1 fmarines fmarines 954 Jul 23 12:36 clone_hopsworks-chef.sh
-rw-rw-r–. 1 fmarines fmarines 121 Jul 23 12:37 succeed_list
-rw-rw-r–. 1 fmarines fmarines 5 Jul 23 12:37 pid
-rwxrwxr-x. 1 fmarines fmarines 13064 Jul 23 12:37 kagent__install.sh
-rw-rw-r–. 1 fmarines fmarines 52 Jul 23 12:37 order
-rw-rw-r–. 1 fmarines fmarines 12695 Jul 23 12:37 kagent__install.json
-rw-rw-r–. 1 fmarines fmarines 63 Jul 23 12:37 kagent__install.log
[fmarines@PER320-2 install]$ vi kagent__install.log
Chef Infra Client cannot execute without accepting the license

Hi @Alex
once solved the chef license acceptance, and 2 hrs later conda fails

[fmarines@PER320-2 cluster]$ tail installation.log
“run_list”: [
“conda::install”
]
}
END_OF_FILE
sudo chef-solo -c /home/fmarines/.karamel/install/solo.rb -j /home/fmarines/.karamel/install/conda__install.json 2>&1 | tee conda__install.log
echo ‘https://github.com/logicalclocks/conda-chef/tree/1.3/conda::install’ >> succeed_list
’ > conda__install.sh ; chmod +x conda__install.sh ; ./conda__install.sh
', DAG is stuck here :frowning:
INFO [2020-07-23 20:45:52,763] se.kth.karamel.backend.machines.MachinesMonitor: Sending pause signal to all machines

this is from install__log:

Recipe: conda::install^[[0m

  • ulimit_domain[anaconda] action create
    Recipe: ^[[0m
    • ulimit_rule[ulimit_rule[anaconda:nice-hard–10]] action createCreate: {“anaconda”=>{“nice”=>{“hard”=>-10}}}
      (up to date)
    • ulimit_rule[ulimit_rule[anaconda:nice-soft–10]] action createCreate: {“anaconda”=>{“nice”=>{“hard”=>-10, “soft”=>-10}}}
      (up to date)
    • template[/etc/security/limits.d/anaconda.conf] action create (up to date)
      (up to date)
      Recipe: conda::install^[[0m
  • template[/home/anaconda/hops-system-environment.yml] action create (up to date)
  • directory[/home/anaconda/.conda] action create (up to date)
  • directory[/home/anaconda/.conda/pkgs] action create (up to date)
  • file[/home/anaconda/.conda/environments.txt] action create (up to date)
  • bash[update_conda] action run
    ^[[0m
    ================================================================================^[[0m
    ^[[31mError executing action run on resource ‘bash[update_conda]’^[[0m
    ================================================================================^[[0m

^[[0m Mixlib::ShellOut::ShellCommandFailed^[[0m
------------------------------------^[[0m
Expected process to exit with [0], but received ‘1’
^[[0m ---- Begin output of “bash” “/tmp/chef-script20200723-37215-euo5m1” ----
^[[0m STDOUT: Collecting package metadata (current_repodata.json): …working… done
^[[0m Solving environment: …working… done
^[[0m
^[[0m # All requested packages already installed.
^[[0m
^[[0m Collecting package metadata (current_repodata.json): …working… done
^[[0m Solving environment: …working… failed with repodata from current_repodata.json, will retry with next repodata source.
^[[0m STDERR: ==> WARNING: A newer version of conda exists. <==
^[[0m current version: 4.8.2
^[[0m latest version: 4.8.3
^[[0m
^[[0m Please update conda by running
^[[0m
^[[0m $ conda update -n base -c defaults conda
^[[0m
^[[0m
^[[0m
^[[0m # >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<
^[[0m
^[[0m Traceback (most recent call last):
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/exceptions.py”, line 1079, in call
^[[0m return func(*args, **kwargs)
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/cli/main.py”, line 84, in _main
^[[0m exit_code = do_call(args, p)
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/cli/conda_argparse.py”, line 82, in do_call
^[[0m return getattr(module, func_name)(args, parser)
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/cli/main_update.py”, line 20, in execute
^[[0m install(args, parser, ‘update’)
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/cli/install.py”, line 265, in install
^[[0m should_retry_solve=(_should_retry_unfrozen or repodata_fn != repodata_fns[-1]),
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/core/solve.py”, line 117, in solve_for_transaction
^[[0m should_retry_solve)
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/core/solve.py”, line 158, in solve_for_diff
^[[0m force_remove, should_retry_solve)
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/core/solve.py”, line 281, in solve_final_state
^[[0m ssc = self._run_sat(ssc)
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/common/io.py”, line 88, in decorated
^[[0m return f(*args, **kwds)
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/core/solve.py”, line 808, in _run_sat
^[[0m should_retry_solve=ssc.should_retry_solve
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/common/io.py”, line 88, in decorated
^[[0m return f(*args, **kwds)
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/resolve.py”, line 1412, in solve
^[[0m if not is_converged(solution):
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/resolve.py”, line 1304, in is_converged
^[[0m psolution = clean(solution)
^[[0m File “/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/resolve.py”, line 1293, in clean
^[[0m return [q for q in (C.from_index(s) for s in sol)
^[[0m TypeError: ‘NoneType’ object is not iterable
^[[0m
^[[0m $ /srv/hops/anaconda/anaconda/bin/conda update anaconda -y -q
^[[0m
^[[0m environment variables:
^[[0m CIO_TEST=
^[[0m CONDA_ROOT=/srv/hops/anaconda/anaconda-3-2020.02
^[[0m PATH=/sbin:/bin:/usr/sbin:/usr/bin
^[[0m PYTHONUNBUFFERED=1
^[[0m REQUESTS_CA_BUNDLE=
^[[0m SSL_CERT_FILE=
^[[0m SUDO_COMMAND=/bin/chef-solo -c /home/fmarines/.karamel/install/solo.rb -j
^[[0m /home/fmarines/.karamel/install/conda__install.json
^[[0m SUDO_GID=1000
^[[0m SUDO_UID=1000
^[[0m SUDO_USER=fmarines
^[[0m
^[[0m active environment : None
^[[0m user config file : /home/anaconda/.condarc
^[[0m populated config files : /home/anaconda/.condarc
^[[0m conda version : 4.8.2
^[[0m conda-build version : 3.18.11
^[[0m python version : 3.7.6.final.0
^[[0m virtual packages : __cuda=10.2
^[[0m __glibc=2.17
^[[0m base environment : /srv/hops/anaconda/anaconda-3-2020.02 (writable)
^[[0m channel URLs : https://conda.anaconda.org/pytorch/linux-64
^[[0m https://conda.anaconda.org/pytorch/noarch
^[[0m https://repo.anaconda.com/pkgs/main/linux-64
^[[0m https://repo.anaconda.com/pkgs/main/noarch
^[[0m https://repo.anaconda.com/pkgs/r/linux-64
^[[0m https://repo.anaconda.com/pkgs/r/noarch
^[[0m package cache : /srv/hops/anaconda/anaconda/pkgs
^[[0m envs directories : /srv/hops/anaconda/anaconda/envs
^[[0m /srv/hops/anaconda/anaconda-3-2020.02/envs
^[[0m /home/anaconda/.conda/envs
^[[0m platform : linux-64
^[[0m user-agent : conda/4.8.2 requests/2.22.0 CPython/3.7.6 Linux/3.10.0-1127.13.1.el7.x86_64 centos/7.8.2003 glibc/2.17
^[[0m UID:GID : 968:1011
^[[0m netrc file : None
^[[0m offline mode : False
^[[0m
^[[0m
^[[0m An unexpected error has occurred. Conda has prepared the above report.
^[[0m
^[[0m Upload successful.
^[[0m ---- End output of “bash” “/tmp/chef-script20200723-37215-euo5m1” ----
^[[0m Ran “bash” “/tmp/chef-script20200723-37215-euo5m1” returned 1^[[0m

^[[0m Resource Declaration:^[[0m
---------------------^[[0m
# In /tmp/chef-solo/cookbooks/conda/recipes/install.rb
^[[0m
^[[0m 191: bash “update_conda” do
^[[0m 192: user node[‘conda’][‘user’]
^[[0m 193: group node[‘conda’][‘group’]
^[[0m 194: environment ({‘HOME’ => “/home/#{node[‘conda’][‘user’]}”})
^[[0m 195: cwd “/home/#{node[‘conda’][‘user’]}”
^[[0m 196: retries 1
^[[0m 197: retry_delay 10
^[[0m 198: code <<-EOF
^[[0m 199: #{node[‘conda’][‘base_dir’]}/bin/conda install --no-deps pycryptosat libcryptominisat
^[[0m 200: #{node[‘conda’][‘base_dir’]}/bin/conda config --set sat_solver pycryptosat
^[[0m 201: #{node[‘conda’][‘base_dir’]}/bin/conda update anaconda -y -q
^[[0m 202: EOF
^[[0m 203: end
^[[0m
^[[0m Compiled Resource:^[[0m
------------------^[[0m
# Declared in /tmp/chef-solo/cookbooks/conda/recipes/install.rb:191:in `from_file’
^[[0m
^[[0m bash(“update_conda”) do
^[[0m action [:run]
^[[0m default_guard_interpreter :default
^[[0m command nil
^[[0m backup 5
^[[0m interpreter “bash”
^[[0m declared_type :bash
^[[0m cookbook_name “conda”
^[[0m recipe_name “install”
^[[0m user “anaconda”
^[[0m group “anaconda”
^[[0m code " /srv/hops/anaconda/anaconda/bin/conda install --no-deps pycryptosat libcryptominisat\n /srv/hops/anaconda/anaconda/bin/conda config --set sat_solver pycryptosat\n /srv/hops/anaconda/anaconda/bin/conda update anaconda -y -q\n"
^[[0m domain nil
^[[0m environment {“HOME”=>"/home/anaconda"}
^[[0m cwd “/home/anaconda”
^[[0m retries 1
^[[0m retry_delay 10
^[[0m end
^[[0m
^[[0m System Info:^[[0m
------------^[[0m
chef_version=15.12.22
^[[0m platform=centos
^[[0m platform_version=7.8.2003
^[[0m ruby=ruby 2.6.6p146 (2020-03-31 revision 67876) [x86_64-linux]
^[[0m program_name=/bin/chef-solo
^[[0m executable=/opt/chefdk/bin/chef-solo^[[0m

^[[0m^[[0m
Running handlers:^[[0m
[2020-07-23T16:45:44-04:00] ERROR: Running exception handlers
Running handlers complete
^[[0m[2020-07-23T16:45:44-04:00] ERROR: Exception handlers complete
Chef Infra Client failed. 5 resources updated in 01 minutes 06 seconds^[[0m
[2020-07-23T16:45:44-04:00] FATAL: Stacktrace dumped to /tmp/chef-solo/chef-stacktrace.out
[2020-07-23T16:45:44-04:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2020-07-23T16:45:44-04:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: bash[update_conda] (conda::install line 191) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received ‘1’
---- Begin output of “bash” “/tmp/chef-script20200723-37215-euo5m1” ----
STDOUT: Collecting package metadata (current_repodata.json): …working… done
Solving environment: …working… done

All requested packages already installed.

Collecting package metadata (current_repodata.json): …working… done
Solving environment: …working… failed with repodata from current_repodata.json, will retry with next repodata source.
STDERR: ==> WARNING: A newer version of conda exists. <==
current version: 4.8.2
latest version: 4.8.3

Please update conda by running

$ conda update -n base -c defaults conda

>>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

Traceback (most recent call last):
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/exceptions.py", line 1079, in __call__
    return func(*args, **kwargs)
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/cli/main.py", line 84, in _main
    exit_code = do_call(args, p)
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/cli/conda_argparse.py", line 82, in do_call
    return getattr(module, func_name)(args, parser)
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/cli/main_update.py", line 20, in execute
    install(args, parser, 'update')
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/cli/install.py", line 265, in install
    should_retry_solve=(_should_retry_unfrozen or repodata_fn != repodata_fns[-1]),
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/core/solve.py", line 117, in solve_for_transaction
    should_retry_solve)
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/core/solve.py", line 158, in solve_for_diff
    force_remove, should_retry_solve)
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/core/solve.py", line 281, in solve_final_state
    ssc = self._run_sat(ssc)
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/common/io.py", line 88, in decorated
    return f(*args, **kwds)
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/core/solve.py", line 808, in _run_sat
    should_retry_solve=ssc.should_retry_solve
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/common/io.py", line 88, in decorated
    return f(*args, **kwds)
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/resolve.py", line 1412, in solve
    if not is_converged(solution):
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/resolve.py", line 1304, in is_converged
    psolution = clean(solution)
  File "/srv/hops/anaconda/anaconda-3-2020.02/lib/python3.7/site-packages/conda/resolve.py", line 1293, in clean
    return [q for q in (C.from_index(s) for s in sol)
TypeError: 'NoneType' object is not iterable

$ /srv/hops/anaconda/anaconda/bin/conda update anaconda -y -q

environment variables:
CIO_TEST=
CONDA_ROOT=/srv/hops/anaconda/anaconda-3-2020.02
PATH=/sbin:/bin:/usr/sbin:/usr/bin
PYTHONUNBUFFERED=1
REQUESTS_CA_BUNDLE=
SSL_CERT_FILE=
SUDO_COMMAND=/bin/chef-solo -c /home/fmarines/.karamel/install/solo.rb -j
/home/fmarines/.karamel/install/conda__install.json
SUDO_GID=1000
SUDO_UID=1000
SUDO_USER=fmarines

 active environment : None
   user config file : /home/anaconda/.condarc

populated config files : /home/anaconda/.condarc
conda version : 4.8.2
conda-build version : 3.18.11
python version : 3.7.6.final.0
virtual packages : __cuda=10.2
__glibc=2.17
base environment : /srv/hops/anaconda/anaconda-3-2020.02 (writable)
channel URLs : https://conda.anaconda.org/pytorch/linux-64
https://conda.anaconda.org/pytorch/noarch
https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /srv/hops/anaconda/anaconda/pkgs
envs directories : /srv/hops/anaconda/anaconda/envs
/srv/hops/anaconda/anaconda-3-2020.02/envs
/home/anaconda/.conda/envs
platform : linux-64
user-agent : conda/4.8.2 requests/2.22.0 CPython/3.7.6 Linux/3.10.0-1127.13.1.el7.x86_64 centos/7.8.2003 glibc/2.17
UID:GID : 968:1011
netrc file : None
offline mode : False

An unexpected error has occurred. Conda has prepared the above report.

Upload successful.
---- End output of “bash” “/tmp/chef-script20200723-37215-euo5m1” ----
Ran “bash” “/tmp/chef-script20200723-37215-euo5m1” returned 1

Hi @Fernando_Marines

Can you give me a few more details on installation. You already wrote that you are trying to install this on a baremetal (48GB,400GB HDD,2GPU).
What OS is the machine running (version too), what version of hopsworks are you trying to install and just to make sure, you are using karamel GUI to do this or the install script?

@Fernando_Marines

It might be an issue with upstream conda.
Can you go on the machine to:
/home/fmarines/.karamel/cookbooks/hopsworks-chef_vendor/conda/recipes/install.rb
and remove/comment this line
#{node['conda']['base_dir']}/bin/conda config --set sat_solver pycryptosat
This line should be towards the end of the file. If you are using version 1.3 this should be line 200.
After modifying this, you can run the recipe manually, by running
/home/vagrant/.karamel/install/conda__install.sh
If the recipe runs successfully you can go to the karame GUI and skip the recipe and the installation should continue.

Hi @Alex
is bare-metal installation , Centos 7.8. 2 nvidia GPU’s, 48GB RAM, 400GB hdd using the script installation:

wget https://raw.githubusercontent.com/logicalclocks/karamel-chef/1.3/hopsworks-installer.sh
chmod +x hopsworks-installer.sh
./hopsworks-installer.sh

your note about conda recipe seems to work now, the question is how do I skip that in the script install because it seems to start over if I execute the script install again.

@Fernando_Marines

You can access the karamel GUI if you open in a browser: http://ip:9090/index.html.
Go to terminal drop down menu item… Then click status link. There you will see a list with status of all recipies status. Click on skip for the problematic recipe. The problematic failed recipe will have the options: retry/skip. Since you already ran the recipe manually you can choose to skip this one.
karamel

Thanks for the tip! now the roadblock seems to be on ndn_mgmd recipe…?

  • template[/srv/hops/mysql-cluster/ndb/scripts/cluster-shutdown.sh] action create (up to date)
    • template[/srv/hops/mysql-cluster/ndb/scripts/cluster-init.sh] action create (up to date)
    • template[/srv/hops/mysql-cluster/ndb/scripts/cluster-start-with-recovery.sh] action create (up to date)
    • template[/srv/hops/mysql-cluster/ndb/scripts/exit-singleuser-mode.sh] action create (up to date)
    • service[ndb_mgmd] action nothing (skipped due to action :nothing)
    • template[/usr/lib/systemd/system/ndb_mgmd.service] action create (up to date)
    • template[/srv/hops/mysql-cluster/config.ini] action create_if_missing (up to date)
    • kagent_config[ndb_mgmd] action add
      • bash[restart-kagent-after-update] action run (skipped due to not_if)

^[[0m * kagent_config[ndb_mgmd] action systemd_reload
* bash[start-if-not-running-ndb_mgmd] action run
^[[32m- execute “bash” “/tmp/chef-script20200724-30922-iv6cbm”^[[0m
^[[0m
^[[0m * kagent_keys[/home/mysql] action generate
* bash[generate-ssh-keypair-for-/home/mysql] action run
^[[0m
================================================================================^[[0m
> ^[[31mError executing action run on resource ‘bash[generate-ssh-keypair-for-/home/mysql]’^[[0m
================================================================================^[[0m

^[[0m Mixlib::ShellOut::ShellCommandFailed^[[0m
------------------------------------^[[0m
Expected process to exit with [0], but received ‘1’
^[[0m ---- Begin output of “bash” “/tmp/chef-script20200724-30922-1ckaa0” ----
^[[0m STDOUT:
^[[0m STDERR: Saving key “/home/mysql/.ssh/id_rsa” failed: No such file or directory
^[[0m ---- End output of “bash” “/tmp/chef-script20200724-30922-1ckaa0” ----
^[[0m Ran “bash” “/tmp/chef-script20200724-30922-1ckaa0” returned 1^[[0m

^[[0m Resource Declaration:^[[0m
---------------------^[[0m
# In /tmp/chef-solo/cookbooks/kagent/providers/keys.rb
^[[0m
^[[0m 119: bash “generate-ssh-keypair-for-#{homedir}” do
^[[0m 120: user cb_user
^[[0m 121: group cb_group
^[[0m 122: code <<-EOF
^[[0m 123: ssh-keygen -b 2048 -f #{homedir}/.ssh/id_rsa -t rsa -q -N ‘’
^[[0m 124: EOF
^[[0m 125: not_if { ::File.exists?( “#{homedir}/.ssh/id_rsa” ) }
^[[0m 126: end
^[[0m 127: end
^[[0m
^[[0m Compiled Resource:^[[0m
------------------^[[0m
# Declared in /tmp/chef-solo/cookbooks/kagent/providers/keys.rb:119:in `block in class_from_file’
^[[0m
^[[0m bash(“generate-ssh-keypair-for-/home/mysql”) do
^[[0m action [:run]
^[[0m default_guard_interpreter :default
^[[0m command nil
^[[0m backup 5
^[[0m interpreter “bash”
^[[0m declared_type :bash
^[[0m cookbook_name “ndb”
^[[0m user “mysql”
^[[0m code " ssh-keygen -b 2048 -f /home/mysql/.ssh/id_rsa -t rsa -q -N ‘’\n"
^[[0m domain nil
^[[0m group “mysql”
^[[0m not_if { #code block }
^[[0m end
^[[0m
^[[0m System Info:^[[0m
------------^[[0m
chef_version=15.12.22
^[[0m platform=centos
^[[0m platform_version=7.8.2003
^[[0m ruby=ruby 2.6.6p146 (2020-03-31 revision 67876) [x86_64-linux]
^[[0m program_name=/bin/chef-solo
^[[0m executable=/opt/chefdk/bin/chef-solo^[[0m

^[[0m ^[[0m
================================================================================^[[0m
^[[31mError executing action generate on resource ‘kagent_keys[/home/mysql]’^[[0m
================================================================================^[[0m

^[[0m Mixlib::ShellOut::ShellCommandFailed^[[0m
------------------------------------^[[0m

Thanks for that!

now seems the problem is ndb…i’ve tried manually but get the same error as here

Running handlers:^[[0m
[2020-07-24T18:51:50-04:00] ERROR: Running exception handlers
Running handlers complete
^[[0m[2020-07-24T18:51:50-04:00] ERROR: Exception handlers complete
Chef Infra Client failed. 3 resources updated in 10 seconds^[[0m
[2020-07-24T18:51:50-04:00] FATAL: Stacktrace dumped to /tmp/chef-solo/chef-stacktrace.out
[2020-07-24T18:51:50-04:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2020-07-24T18:51:50-04:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: kagent_keys[/home/mysql] (ndb::mgmd line 147) had an error: Mixlib::ShellOut::ShellCommandFailed: bash[generate-ssh-keypair-for-/home/mysql] (/tmp/chef-solo/cookbooks/kagent/providers/keys.rb line 119) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received ‘1’
---- Begin output of “bash” “/tmp/chef-script20200724-37844-9su5mt” ----
STDOUT:
STDERR: Saving key “/home/mysql/.ssh/id_rsa” failed: No such file or directory
---- End output of “bash” “/tmp/chef-script20200724-37844-9su5mt” ----
Ran “bash” “/tmp/chef-script20200724-37844-9su5mt” returned 1

Hi @Fernando_Marines,

Can you please show me the cluster definition you are using.
Also can you check if the mysql user was created correctly.

Hi @Alex, here is the info:

[fmarines@PER320-2 cluster]$ ls -l /home
total 8
drwx------. 3 airflow airflow 92 Jul 23 15:22 airflow
drwx------. 5 anaconda anaconda 169 Jul 27 10:22 anaconda
drwx------. 53 fmarines fmarines 4096 Jul 26 20:45 fmarines
drwx------. 3 hdfs hadoop 92 Jul 23 14:29 hdfs
drwx------. 6 hive hive 156 Jul 26 19:18 hive
drwx------. 3 kafka kafka 92 Jul 23 14:44 kafka
drwx------. 3 kagent kagent 92 Jul 23 14:04 kagent
drwx------. 3 livy hadoop 92 Jul 23 14:05 livy
drwx------. 2 mysql mysql 6 Jul 24 18:51 mysql
drwx------. 3 yarn hadoop 92 Jul 23 14:18 yarn
drwx------. 3 yarnapp hadoop 92 Jul 23 16:12 yarnapp

[fmarines@PER320-2 cluster]$ ./hopsworks-installer.sh

Karamel/Hopsworks Installer, Copyright© 2020 Logical Clocks AB. All rights reserved.

This program can install Karamel/Chef and/or Hopsworks.

To cancel installation at any time, press CONTROL-C

You appear to have following setup on this host:

  • available memory: 46
  • available disk space (on ‘/’ root partition): 16G
  • available disk space (under ‘/mnt’ partition):
  • available CPUs: 20
  • available GPUS: 4
  • your ip is: 192.168.0.230
  • installation user: fmarines
  • linux distro: centos
  • cluster defn branch: https://raw.githubusercontent.com/logicalclocks/karamel-chef/1.3
  • hopsworks-chef branch: logicalclocks/hopsworks-chef/1.3

WARNING: We recommend at least 60GB of disk space on the root partition. Minimum is 50GB of available disk.
You have 16G space on ‘/’, and no space on ‘/mnt’.

./hopsworks-installer.sh: line 213: -1: substring expression < 0
-------------------- Installation Options --------------------

What would you like to do?

(1) Install a single-host Hopsworks cluster.



| Cluster Name| Phase | Failed/Paused| Actions |

  1. | Hops | RUNNING_DAG| true/true | status tdag vdag groups machines tasks terminate services yaml cost|

Yaml

name: Hops
baremetal:
ips: [
]
sudoPassword: ‘’
username: fmarines
cookbooks:
hopsworks:
branch: ‘1.3’
github: logicalclocks/hopsworks-chef
attrs:
cuda:
accept_nvidia_download_terms: ‘true’
install:
cloud: on-premises
kubernetes: ‘false’
dir: /srv/hops
kagent:
python_conda_versions: ‘3.6’
elastic:
opendistro_security:
epipe:
password: 082bf372_201
username: epipe
logstash:
password: 082bf372_201
username: logstash
audit:
enable_rest: ‘true’
enable_transport: ‘false’
jwt:
exp_ms: ‘1800000’
elastic_exporter:
password: 082bf372_201
username: elasticexporter
admin:
password: 082bf372_201
username: admin
kibana:
password: 082bf372_201
username: kibana
hive2:
mysql_password: 082bf372_203
mysql:
password: 082bf372_202
hops:
tls:
enabled: ‘false’
yarn:
detect-hardware-capabilities: ‘false’
gpus: ‘*’
memory_mbs: ‘45056’
cgroups_strict_resource_usage: ‘false’
vcores: ‘19’
rmappsecurity:
actor_class: org.apache.hadoop.yarn.server.resourcemanager.security.DevHopsworksRMAppSecurityActions
capacity:
resource_calculator_class: org.apache.hadoop.yarn.util.resource.DominantResourceCalculatorGPU
prometheus:
retention_time: 8h
alertmanager:
email:
smtp_host: mail.hello.com
from: hopsworks@logicalclocks.com
to: sre@logicalclocks.com
hopsworks:
kagent_liveness:
threshold: 40s
enabled: ‘true’
featurestore_online: ‘true’
admin:
password: 082bf372_201
user: adminuser
application_certificate_validity_period: 6d
requests_verify: ‘false’
encryption_password: 082bf372_001
https:
port: ‘443’
master:
password: 082bf372_002
groups:
metaserver:
size: 1
baremetal:
ips:
- 192.168.0.230
sudoPassword: ‘’
attrs: {
}
recipes:
- ndb::mgmd
- elastic::default
- hadoop_spark::certs
- hops::dn
- flink::historyserver
- hopslog::default
- conda::default
- hopsmonitor::default
- hadoop_spark::yarn
- hopslog::_filebeat-serving
- hops::nm
- hops::nn
- hops::ndb
- hopslog::_filebeat-beam
- hops::rm
- ndb::mysqld
- hive2::default
- livy::default
- hops_airflow::default
- hops::jhs
- epipe::default
- kkafka::default
- tensorflow::default
- hopslog::_filebeat-spark
- hopslog::_filebeat-kagent
- hadoop_spark::historyserver
- hopsmonitor::node_exporter
- kzookeeper::default
- flink::yarn
- hops_airflow::sqoop
- hopsmonitor::prometheus
- kagent::default
- hopsmonitor::alertmanager
- hopsworks::default
- ndb::ndbd
- consul::master
- hopsmonitor::purge_telegraf

Hi @Fernando_Marines,

Is your disk size only 16GB? I see the install script is showing only 16GB of disk space available. The installation is bigger than this. As pointed by the install script, recommended would be 60GB.

Hi @Alex, i noticed that warning when i was posting that message, so I cleaned the /tmp folder and stared over but halts on the same step

Hi.

I’m not sure why but it might be that the directory /home/mysql/.ssh does not exist. If it doesn’t create it and retry the recipe. ssh-keygen won’t create a directory that does not exist.