Never able to re-deploy after first install (Hopsworks::default)

joakimeriksson · July 6, 2021, 1:42pm

I am trying to redeploy Hopsworks on a machine with ubuntu 18.04 - it works fine until I end up at Hopsworks::default when it hits an error:

[2021-07-06T13:35:28+00:00] ERROR: Running exception handlers Running handlers complete
[2021-07-06T13:35:28+00:00] ERROR: Exception handlers complete Chef Client failed. 6 resources updated in 14 seconds
[2021-07-06T13:35:28+00:00] FATAL: Stacktrace dumped to /tmp/chef-solo/chef-stacktrace.out
[2021-07-06T13:35:28+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2021-07-06T13:35:28+00:00] FATAL: RuntimeError: ruby_block[check_db_empty]     (hopsworks::default line 197) had an error: RuntimeError: You are trying to initialize the database, but   the database is not empty. Either there is a failed migration, or you forgot to set the current_version   attribute

We got some hints about deleting the databases in mysql (the Hopsworks database) but it seems not to make any difference.

Any clues on how to get further?
(without a complete wipeout of the machine).

/Joakim

joakimeriksson · July 6, 2021, 5:53pm

This seems to be the actual issue:

 ...  /srv/hops/glassfish/versions/current/glassfish/bin/asadmin --terse=false --echo=true --user adminuser --passwordfile=/srv/hops/domains/domain1_admin_passwd --port 4848 set resources.managed-executor-service.concurrent\/hopsExecutorService.thread-priority=10  [...]

    STDOUT: asadmin --host localhost --port 4848 --user adminuser --passwordfile /srv/hops/domains/domain1_admin_passwd --interactive=false --echo=true --terse=false set resources.managed-executor-service.concurrent/hopsExecutorService.thread-priority=10
Command set failed.


 Compiled Resource:
      ------------------
      # Declared in /tmp/chef-solo/cookbooks/glassfish/providers/asadmin.rb:22:in `block in class_from_file'
      
      execute("asadmin set resources.managed-executor-service.concurrent\/hopsExecutorService.thread-priority=10") do
        action [:run]
        default_guard_interpreter :execute
        command "/srv/hops/glassfish/versions/current/glassfish/bin/asadmin --terse=false --echo=true --user adminuser --passwordfile=/srv/hops/domains/domain1_admin_passwd --port 4848 set resources.managed-executor-service.concurrent\\/hopsExecutorService.thread-priority=10"
        backup 5
        declared_type :execute
        cookbook_name "hopsworks"
        timeout 405
        user "glassfish"
        group "glassfish"
        domain nil
        returns 0
      end
      
      System Info:
      ------------
      chef_version=14.10.9
      platform=ubuntu
      platform_version=18.04
      ruby=ruby 2.5.3p105 (2018-10-18 revision 65156) [x86_64-linux]
      program_name=/usr/bin/chef-solo
      executable=/opt/chefdk/bin/chef-solo

(then after this the Hopsworks::default is restarted and this error is overwritten and the next execution fails on database is not empty).

And before this happens it seems like the previous thing is also doubly executed:

INFO  [2021-07-06 17:42:13,679] se.kth.karamel.backend.machines.SshMachine: 192.168.1.167: Running task: hops::ndb
INFO  [2021-07-06 17:42:28,095] se.kth.karamel.backend.machines.SshMachine: 192.168.1.167: Running task: hops::ndb
WARN  [2021-07-06 17:44:11,783] net.schmizz.sshj.xfer.scp.SCPEngine: SCP exit status: 1

Alex · July 12, 2021, 10:09am

Hi Joakim,

Can you please give a few more details on the installation.
Did you previously install Hopsworks on this machine with the hopsworks installer:
Hopsworks Installer — Documentation 2.2 documentation ?
Did you try cleaning the machine of hopsworks artifacts before trying to re-install? Did you for example run the purge step from the documentation above?
./hopsworks-installer.sh -i purge -ni
If you try to run on the machine the command that failed, what output do you get?
/srv/hops/glassfish/versions/current/bin/asadmin --host localhost --port 4848 --user adminuser --passwordfile /srv/hops/domains/domain1_admin_passwd --interactive=false --echo=true --terse=false set resources.managed-executor-service.concurrent/hopsExecutorService.thread-priority=10

Regards,
Alex

joakimeriksson · August 1, 2021, 7:12am

Hi Alex,

Yes I did the purge before reinstalling. I did try that and it complained about a missing executorService (the hopsExecutorService). Not sure why it was missing, but that seems to be the case each time I deploy on a machine that already had a completed installation (even after purge).

(but now I did a reinstall and have it up-n-running but I will try later to get the same error printed).

Best regards,
/Joakim

joakimeriksson · August 17, 2021, 1:26pm

Finally redid the failure - here is a copy of the log file (Hopsworks_default.log) - did not capture the absolute beginning - but I think I got the first errors) - and the last ones too…
hopsworks__default.log.zip (6.7 KB)

joakimeriksson · August 31, 2021, 9:44am

Ping! Anyone looking at the issue?

antonios · September 1, 2021, 1:36pm

Hi @joakimeriksson

To completely purge the installation shutdown all services using the script /srv/hops/kagent/kagent/bin/shutdown-all-local-services.sh. Then delete the installation directory, if you didn’t change it, it defaults to /srv/hops so do sudo rm -rf /srv/hops. Finally delete /etc/docker directory. If you have installed Kubernetes also delete /etc/kubernetes and the files in /home/kubernetes. Then you can retry the installation.

joakimeriksson · September 2, 2021, 3:47pm

Ok, will try and get back as soon as I have any results.

joakimeriksson · September 2, 2021, 4:39pm

No, seems like I get the same issue:
“INFO [2021-09-02 16:13:21,457] se.kth.karamel.backend.machines.SshMachine: 10.10.124.23: Running task: hopsworks::default
INFO [2021-09-02 16:21:02,353] se.kth.karamel.backend.machines.SshMachine: 10.10.124.23: Running task: hopsworks::default
ERROR [2021-09-02 16:21:29,386] se.kth.karamel.backend.dag.DagNode: Failed ‘hopsworks::default on 10.10.124.23’ because '10.10.124.2
3: Command did not complete: mkdir -p /home/ubuntu/.karamel/install ; cd /home/ubuntu/.karamel/install; e”

Hopsworks::default still fails - probably at the same place.

Is there anything else that is installed by the scripts that is not ending up at the /srv/… ? Some Java-config or something else?

antonios · September 9, 2021, 7:59am

If you deleted /srv/hops after you have stopped all services then there isn’t any other place. What’s in the log /home/ubuntu/.karamel/install/hopsworks__default.log ?

joakimeriksson · September 20, 2021, 11:46am

In that log there are lots of something like this:

e[0m      ---- Begin output of /srv/hops/glassfish/versions/current/glassfish/bin/asadmin --terse=false --echo=true --user adminuser --passwordfile=/srv/hops/domains/domain1_admin_passwd --port 4848 set resources.managed-executor-service.concurrent\/hopsExecutorService.thread-priority=10 ----
e[0m      STDOUT: asadmin --host localhost --port 4848 --user adminuser --passwordfile /srv/hops/domains/domain1_admin_passwd --interactive=false --echo=true --terse=false set resources.managed-executor-service.concurrent/hopsExecutorService.thread-priority=10
e[0m      Command set failed.
e[0m      STDERR: remote failure: No configuration found for resources.managed-executor-service.concurrent/hopsExecutorService
e[0m      ---- End output of /srv/hops/glassfish/versions/current/glassfish/bin/asadmin --terse=false --echo=true --user adminuser --passwordfile=/srv/hops/domains/domain1_admin_passwd --port 4848 set resources.managed-executor-service.concurrent\/hopsExecutorService.thread-priority=10 ----
...
e[0m      Ran /srv/hops/glassfish/versions/current/glassfish/bin/asadmin --terse=false --echo=true --user adminuser --passwordfile=/srv/hops/domains/domain1_admin_passwd --port 4848 set resources.managed-executor-service.concurrent\/hopsExecutorService.thread-priority=10 returned 1e[0m
...

Before it re-runs and fails with db-full.

So it seems like something with glass fish fails for some unknown reason.

The log is in a zip above. (or do you mean another log?)