"kagent::default" is not able to be installed in a cluster environment

I’m trying to deploy the Hopsworks in a cluster environment (1 master node and 2 worker nodes), but the installation of “kagent::default” has failed on the worker nodes.
The log information is as follows:

Section 1:

  • template[/srv/hops/kagent/kagent/bin/edit-config-ini-inplace.py] action create (up to date)

  • template[/srv/hops/kagent/kagent/bin/edit-and-start.sh] action create (up to date)

  • template[/srv/hops/kagent/etc/config.ini] action create (up to date)

  • bash[chown_/srv/hops/kagent/host-certs] action run
    ^[[32m- execute “bash” “/tmp/chef-script20210105-18166-z0wv3p”^[[0m
    ^[[0m * template[/srv/hops/kagent/host-certs/keystore.sh] action create (up to date)

  • kagent_hopsify[Register Host] action register_hostcerts

    • bash[Register Host with Hopsworks] action run
      ^[[0m
      ================================================================================^[[0m
      ^[[31mError executing action run on resource ‘bash[Register Host with Hopsworks]’^[[0m
      ================================================================================^[[0m

^[[0m Mixlib::ShellOut::ShellCommandFailed^[[0m
------------------------------------^[[0m
Expected process to exit with [0], but received ‘1’
^[[0m ---- Begin output of “bash” “/tmp/chef-script20210105-18166-ncisi8” ----
^[[0m STDOUT: time=“2021-01-05T10:17:02+08:00” level=info msg=“Executing host command”
^[[0m time=“2021-01-05T10:17:02+08:00” level=info msg=“Server url https://10.12.9.220:443
^[[0m time=“2021-01-05T10:17:02+08:00” level=info msg=“Successfully logged in”
^[[0m time=“2021-01-05T10:17:02+08:00” level=error msg=“Failed to perform HTTP operation - status: 404 Retrying… {“type”:“restApiJsonResponse”,“errorCode”:100025,“errorMsg”:“Host was not found.”,“usrMsg”:“hostname: dwfainode1”}”

Section 2:

^[[0m Ran “bash” “/tmp/chef-script20210105-18166-ncisi8” returned 1^[[0m

^[[0m Resource Declaration:^[[0m
---------------------^[[0m
# In /tmp/chef-solo/cookbooks/kagent/providers/hopsify.rb
^[[0m
^[[0m 6: bash “Register Host with Hopsworks” do
^[[0m 7: user node[‘kagent’][‘certs_user’]
^[[0m 8: group node[‘kagent’][‘group’]
^[[0m 9: puts node[‘kagent’][‘certs_user’]
^[[0m 10: code <<-EOH
^[[0m 11: #{node[“kagent”][“certs_dir”]}/hopsify --config #{node[‘kagent’][‘etc’]}/config.ini #{hopsworks_alt_url} host
^[[0m 12: EOH
^[[0m 13: end
^[[0m 14: end
^[[0m
^[[0m Compiled Resource:^[[0m
------------------^[[0m
# Declared in /tmp/chef-solo/cookbooks/kagent/providers/hopsify.rb:6:in `block in class_from_file’
^[[0m
^[[0m bash(“Register Host with Hopsworks”) do
^[[0m action [:run]
^[[0m default_guard_interpreter :default
^[[0m command nil
^[[0m backup 5
^[[0m interpreter “bash”
^[[0m declared_type :bash
^[[0m cookbook_name “kagent”
^[[0m code " /srv/hops/kagent/host-certs/hopsify --config /srv/hops/kagent/etc/config.ini --alt-url https://10.12.9.220:443 host\n"
^[[0m domain nil
^[[0m user “certs”
^[[0m group “kagent”
^[[0m end
^[[0m
^[[0m System Info:^[[0m
------------^[[0m
chef_version=14.10.9
^[[0m platform=centos
^[[0m platform_version=7.9.2009
^[[0m ruby=ruby 2.5.3p105 (2018-10-18 revision 65156) [x86_64-linux]
^[[0m program_name=/bin/chef-solo
^[[0m executable=/opt/chefdk/bin/chef-solo^[[0m

^[[0m ^[[0m
================================================================================^[[0m
^[[31mError executing action register_host on resource ‘kagent_hopsify[Register Host]’^[[0m
================================================================================^[[0m

^[[0m Mixlib::ShellOut::ShellCommandFailed^[[0m
------------------------------------^[[0m
bash[Register Host with Hopsworks] (/tmp/chef-solo/cookbooks/kagent/providers/hopsify.rb line 6) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received ‘1’
^[[0m ---- Begin output of “bash” “/tmp/chef-script20210105-18166-ncisi8” ----
^[[0m STDOUT: time=“2021-01-05T10:17:02+08:00” level=info msg=“Executing host command”
^[[0m time=“2021-01-05T10:17:02+08:00” level=info msg=“Server url https://10.12.9.220:443
^[[0m time=“2021-01-05T10:17:02+08:00” level=info msg=“Successfully logged in”
^[[0m time=“2021-01-05T10:17:02+08:00” level=error msg=“Failed to perform HTTP operation - status: 404 Retrying… {“type”:“restApiJsonResponse”,“errorCode”:100025,“errorMsg”:“Host was not found.”,“usrMsg”:“hostname: dwfainode1”}”

The 443 port has been occupied by the Glassfish, it looks like that the worker nodes can not access the services of Glassfish, which has been deployed on master node. Or I guess that some Glassfish services did not start.

Has anyone encountered this problem?

If you ssh into the host where the error happened, does it have connectivity to
dwfainode1
Can you ping it? What does
dig dwfainode1
return?

Hi Jim,
The executing results of the commands are as follows :

Ssh:

[appadm@dwfaihead ~]$ ssh dwfainode1
Last login: Tue Jan 5 17:00:43 2021 from dwfaihead
[appadm@dwfainode1 ~]$ exit
logout
Connection to dwfainode1 closed.

Ping:

[appadm@dwfaihead ~]$ ping dwfainode1
PING dwfainode1 (10.12.9.231) 56(84) bytes of data.
64 bytes from dwfainode1 (10.12.9.231): icmp_seq=1 ttl=64 time=0.201 ms
64 bytes from dwfainode1 (10.12.9.231): icmp_seq=2 ttl=64 time=0.206 ms
64 bytes from dwfainode1 (10.12.9.231): icmp_seq=3 ttl=64 time=0.183 ms

Dig dwfainode1:

[appadm@dwfaihead ~]$ dig dwfainode1

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.3 <<>> dwfainode1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 55802
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;dwfainode1. IN A

;; Query time: 0 msec
;; SERVER: 10.12.7.43#53(10.12.7.43)
;; WHEN: Tue Jan 05 17:11:04 CST 2021
;; MSG SIZE rcvd: 39

IP address and host name of each node have been configured in “/etc/hosts” in each node. And every node in the cluster can ssh into each other without password.
But the IP address and host name of each node have not been configured in the DNS server. Do I have to configure the mapping relation between IP address and host name in DNS server?

Hi Jim,
By reading the source code (Class: HostsController, Method: findByHostname),I found that the host name was queried from the MySQL cluster. So I want to know when the host name was written to the database.
I remembered that the installation script had asked me to enter the information of each work node, and I entered the IP address. Should I enter the host name at that time?


Additional information:
Just now I read the hopsworks-installer.sh,

add_worker()
{
if [ “$WORKER_DEFAULTS” != “true” ] ; then
printf 'Please enter the IP of the worker you want to add: ’
read WORKER_IP
fi

ssh -t -o StrictHostKeyChecking=no $WORKER_IP "whoami" > /dev/null
if [ $? -ne 0 ] ; then
    echo "Failed to ssh using public into: ${USER}@${WORKER_IP}"
    echo "Cannot add worker node, as you need to be able to ssh into it using your public key"
    echo ""
    echo ""
    echo "You can setup passwordless SSH to setup to ${USER}@${WORKER_IP} by entering the password."
    echo "Running ssh-copy-id.... "
    ssh-copy-id -i ${HOME}/.ssh/id_rsa.pub ${USER}@${WORKER_IP}
    if [ $? -ne 0 ] ; then
        exit_error "Problem setting up passwordless SSH to ${USER}@${WORKER_IP}"
    fi
fi

I think it’s right to enter an IP address.

It’s correct to enter an IP address.

Hi Jim,
The problem has been solved,I have installed the cluster successfully.
The cause of the issue is that some host information was not initialized into the “Hosts” table correctly, and I modified these data manually.
But I still don’t know the real cause of this issue.

BTW, Could you please tell me if the Hopsworks has provided the APIs for redevelopment?

I’m sorry I don’t understand what you mean by the APIs for redevelopment?
Do you mean upgrades? Or development APIs?

Hi Jim,
We may do the second-development based on Hopsworks in the future, so I am more concerned about whether Hopsworks has provided the SDK or RESTful API, so that we can develop our own apps quickly.
I have looked up the pages of the official website, but I didn’t find the relevant information.

We have the REST API here:

There is a Python API, much of which can be used in external clusters:

There is also a (more limited) Java API:

Hi Jim,
Thank you for the specific answer, these links are really helpful to us.

But when I tried to test these RESTful APIs (Version: 1.4), I ran into an authorization problem.
For example, I called a RESTful API whose URL is https://Host_IP/hopsworks-api/api/admin/projects, the returned JSON was :

{
“type”: “restApiJsonResponse”,
“errorCode”: 200003,
“errorMsg”: “Authorization header not set.”
}

I noticed that there was a “Api keys” menu in “Settings”. Then I generated an API key, which included all scopes, and set it into header parameter, the returned JSON was :

{
“type”: “restApiJsonResponse”,
“errorCode”: 200003,
“errorMsg”: “Invalidated Api key.”
}

I used Postman to test and the name of header parameter was “Authorization”.
Could you tell me how to get the correct authorization code?

Hi @Freeman

You can issue an API key and use it for authorization https://hopsworks.readthedocs.io/en/stable/user_guide/hopsworks/apiKeys.html#api-keys

Hi Theo,
I have read the relevant chapters, and this problem has been solved.
Thanks a lot. :smile: