Kagent not starting Ubuntu 18.04... /bin/chown error

on starting services within /srv/hops/kagent/start-all-local-services.sh on an Ubuntu 18.04 AWS cluster… journalctl -xe. reports the following:

chown[29488]: /bin/chown: cannot access ‘/srv/hops/kagent/kagent.pid’: No such file or directory

Failed to start timeout.

Any isights are appreciated

Hi Peter,

kagent's logs are in /srv/hops/kagent/logs Please check if there is anything printed there.
If you do /srv/hops/kagent/kagent/bin/status-all-local-services.sh probably the first dead service will be causing the failure.

Also, if you could give us some background on how you’re trying to install Hopsworks and how many nodes.

Kind regards.

Thanks for the prompt response Antonios:
Here is what I have in those logs:

/srv/hops/kagent/logs# cat conda_commands.log
2020-09-02 20:46:39,982 INFO python36 CLEAN python36 -1 WORKING
2020-09-02 20:46:41,232 INFO python36 CLEAN python36 0 SUCCESS
2020-09-02 20:46:43,247 INFO python36 CLEAN python36 -1 WORKING
2020-09-02 20:46:44,200 INFO python36 CLEAN python36 0 SUCCESS

From jornalctl -xe :

journalctl -xe

Sep 03 16:40:17 ip-10-0-1-117 consul[3376]: 2020-09-03T16:40:17.342Z [WARN] agent: Check is now critical: check=nm-health-c

Sep 03 16:40:51 ip-10-0-1-117 consul[3376]: 2020-09-03T16:40:51.437Z [WARN] agent: Check is now critical: check=nm-health-c

Sep 03 16:41:16 ip-10-0-1-117 systemd[1]: kagent.service: Start operation timed out. Terminating.

Sep 03 16:41:16 ip-10-0-1-117 systemd[1]: kagent.service: Failed with result ‘timeout’.

Sep 03 16:41:16 ip-10-0-1-117 systemd[1]: Failed to start Kagent, monitors/controls Hops services.

– Subject: Unit kagent.service has failed

– Defined-By: systemd

– Support: http://www.ubuntu.com/support

– Unit kagent.service has failed.

– The result is RESULT.

Sep 03 16:41:22 ip-10-0-1-117 systemd[1]: kagent.service: Service hold-off time over, scheduling restart.

Sep 03 16:41:22 ip-10-0-1-117 systemd[1]: kagent.service: Scheduled restart job, restart counter is at 724.

– Subject: Automatic restarting of a unit has been scheduled

– Defined-By: systemd

– Support: http://www.ubuntu.com/support

– Automatic restarting of the unit kagent.service has been scheduled, as the result for

– the configured Restart= setting for the unit.

Sep 03 16:41:22 ip-10-0-1-117 systemd[1]: Stopped Kagent, monitors/controls Hops services.

– Subject: Unit kagent.service has finished shutting down

– Defined-By: systemd

– Support: http://www.ubuntu.com/support

– Unit kagent.service has finished shutting down.

Sep 03 16:41:22 ip-10-0-1-117 systemd[1]: Starting Kagent, monitors/controls Hops services…

– Subject: Unit kagent.service has begun start-up

– Defined-By: systemd

– Support: http://www.ubuntu.com/support

– Unit kagent.service has begun starting up.

Sep 03 16:41:22 ip-10-0-1-117 chown[6200]: /bin/chown: cannot access ‘/srv/hops/kagent/kagent.pid’: No such file or directory

Sep 03 16:41:22 ip-10-0-1-117 start-agent.sh[6201]: Checking if the agent is running…

Sep 03 16:41:22 ip-10-0-1-117 start-agent.sh[6201]: PID is

Sep 03 16:41:22 ip-10-0-1-117 start-agent.sh[6201]: Starting the agent…

Sep 03 16:41:23 ip-10-0-1-117 start-agent.sh[6201]: PID is 6209

Sep 03 16:41:23 ip-10-0-1-117 systemd[1]: kagent.service: New main PID 6209 does not exist or is a zombie.

Sep 03 16:41:25 ip-10-0-1-117 consul[3376]: 2020-09-03T16:41:25.534Z [WARN] agent: Check is now critical: check=nm-health-c

Sep 03 16:41:46 ip-10-0-1-117 sshd[6386]: Accepted publickey for ubuntu from 10.0.1.154 port 53876 ssh2: RSA SHA256:VFzhOLrxXSkx

Sep 03 16:41:46 ip-10-0-1-117 sshd[6386]: pam_unix(sshd:session): session opened for user ubuntu by (uid=0)

Sep 03 16:41:46 ip-10-0-1-117 systemd-logind[1062]: New session 61 of user ubuntu.

– Subject: A new session 61 has been created for user ubuntu

Hello Antonios,

You asked about nodes. Currently we have two nodes. One primary and a secondary. Our goal is to bring up the size to at least 4 nodes for demonstration of the Hopsworks Distributed capability.

Hi again.

In /srv/hops/kagent/logs there is the log file of the agent itself - agent.log I’m mostly interested in this log file.

Also, can you show me the output of /srv/hops/kagent/kagent/bin/status-all-local-services.sh ?

Thanks

Here is the output of agent.log ; truncated in that it seems to only repeat:

/srv/hops/kagent/logs# cat agent.log

2020-09-04 14:35:55,128 INFO [agent/setupLogging] Hops-Kagent started.

2020-09-04 14:35:55,128 INFO [agent/setupLogging] Heartbeat URL: https://hopsworks.glassfish.service.consul:443/hopsworks-api/api/agentresource?action=heartbeat

2020-09-04 14:35:55,128 INFO [agent/setupLogging] Host Id: ip-10-0-1-154

2020-09-04 14:35:55,128 INFO [agent/setupLogging] Hostname: ip-10-0-1-154

2020-09-04 14:35:55,128 INFO [agent/setupLogging] Public IP: 10.0.1.154

2020-09-04 14:35:55,129 INFO [agent/setupLogging] Private IP: 10.0.1.154

2020-09-04 14:35:55,134 INFO [agent/] Hops Kagent PID: 23027

2020-09-04 14:35:55,138 INFO [agent/run] Starting commands handling thread

2020-09-04 14:35:55,142 INFO [agent/send] Logging in to Hopsworks…

2020-09-04 14:35:55,142 INFO [host_services_watcher_action/action] Service Service: Consul/consul - State: INIT started

2020-09-04 14:35:55,163 INFO [host_services_watcher_action/action] Service Service: kafka/zookeeper - State: INIT started

2020-09-04 14:35:55,175 INFO [agent/] RESTful service started.

2020-09-04 14:35:55,642 ERROR [service/alive] Service filebeat-beamjobservercluster is DEAD.

2020-09-04 14:35:55,656 ERROR [service/alive] Service kafka is DEAD.

2020-09-04 14:35:55,676 ERROR [service/alive] Service filebeat-beamjobserverlocal is DEAD.

2020-09-04 14:35:55,703 ERROR [service/alive] Service filebeat-spark is DEAD.

2020-09-04 14:35:55,723 ERROR [service/alive] Service filebeat-kagent is DEAD.

2020-09-04 14:35:55,755 ERROR [service/alive] Service filebeat-tf-serving is DEAD.

2020-09-04 14:35:55,768 ERROR [service/alive] Service kagent is DEAD.

2020-09-04 14:35:55,775 ERROR [service/alive] Service historyserver is DEAD.

2020-09-04 14:35:55,815 ERROR [service/alive] Service nodemanager is DEAD.

2020-09-04 14:35:55,835 ERROR [service/alive] Service filebeat-sklearn-serving is DEAD.

2020-09-04 14:35:55,875 ERROR [service/alive] Service filebeat-beamsdkworker is DEAD.

2020-09-04 14:35:57,908 ERROR [service/alive] Service filebeat-beamjobservercluster is DEAD.

2020-09-04 14:35:57,922 ERROR [service/alive] Service kafka is DEAD.

2020-09-04 14:35:57,940 ERROR [service/alive] Service filebeat-beamjobserverlocal is DEAD.

2020-09-04 14:35:57,965 ERROR [service/alive] Service filebeat-spark is DEAD.

2020-09-04 14:35:57,983 ERROR [service/alive] Service filebeat-kagent is DEAD.

2020-09-04 14:35:58,015 ERROR [service/alive] Service filebeat-tf-serving is DEAD.

2020-09-04 14:35:58,027 ERROR [service/alive] Service kagent is DEAD.

2020-09-04 14:35:58,034 ERROR [service/alive] Service historyserver is DEAD.

2020-09-04 14:35:58,070 ERROR [service/alive] Service nodemanager is DEAD.

2020-09-04 14:35:58,089 ERROR [service/alive] Service filebeat-sklearn-serving is DEAD.

2020-09-04 14:35:58,125 ERROR [service/alive] Service filebeat-beamsdkworker is DEAD.

2020-09-04 14:36:00,150 ERROR [service/alive] Service filebeat-beamjobservercluster is DEAD.

2020-09-04 14:36:00,162 ERROR [service/alive] Service kafka is DEAD.

2020-09-04 14:36:00,180 ERROR [service/alive] Service filebeat-beamjobserverlocal is DEAD.

2020-09-04 14:36:00,205 ERROR [service/alive] Service filebeat-spark is DEAD.

2020-09-04 14:36:00,224 ERROR [service/alive] Service filebeat-kagent is DEAD.

2020-09-04 14:36:00,254 ERROR [service/alive] Service filebeat-tf-serving is DEAD.

2020-09-04 14:36:00,266 ERROR [service/alive] Service kagent is DEAD.

2020-09-04 14:36:00,272 ERROR [service/alive] Service historyserver is DEAD.

2020-09-04 14:36:00,309 ERROR [service/alive] Service nodemanager is DEAD.

2020-09-04 14:36:00,327 ERROR [service/alive] Service filebeat-sklearn-serving is DEAD.

2020-09-04 14:36:00,370 ERROR [service/alive] Service filebeat-beamsdkworker is DEAD.

2020-09-04 14:36:02,399 ERROR [service/alive] Service filebeat-beamjobservercluster is DEAD.

2020-09-04 14:36:02,413 ERROR [service/alive] Service kafka is DEAD.

2020-09-04 14:36:02,436 ERROR [service/alive] Service filebeat-beamjobserverlocal is DEAD.

2020-09-04 14:36:02,466 ERROR [service/alive] Service filebeat-spark is DEAD.

2020-09-04 14:36:02,486 ERROR [service/alive] Service filebeat-kagent is DEAD.

2020-09-04 14:36:02,518 ERROR [service/alive] Service filebeat-tf-serving is DEAD.

2020-09-04 14:36:02,530 ERROR [service/alive] Service kagent is DEAD.

2020-09-04 14:36:02,536 ERROR [service/alive] Service historyserver is DEAD.

2020-09-04 14:36:02,573 ERROR [service/alive] Service nodemanager is DEAD.

Here is the output of the status-all-local-services shortly following a start-all-local-services.sh :

/srv/hops/kagent/kagent/bin# ./status-all-local-services.sh

2020-09-04 14:51:05 INFO [agent/setupLogging] Hops-Kagent started.

2020-09-04 14:51:05 INFO [agent/setupLogging] Heartbeat URL: https://hopsworks.glassfish.service.consul:443/hopsworks-api/api/agentresource?action=heartbeat

2020-09-04 14:51:05 INFO [agent/setupLogging] Host Id: ip-10-0-1-154

2020-09-04 14:51:05 INFO [agent/setupLogging] Hostname: ip-10-0-1-154

2020-09-04 14:51:05 INFO [agent/setupLogging] Public IP: 10.0.1.154

2020-09-04 14:51:05 INFO [agent/setupLogging] Private IP: 10.0.1.154

2020-09-04 14:51:06 INFO [service/alive] Service node_exporter is alive

2020-09-04 14:51:06 INFO [service/alive] Service alertmanager is alive

2020-09-04 14:51:06 INFO [service/alive] Service prometheus is alive

2020-09-04 14:51:06 INFO [service/alive] Service ndb_mgmd is alive

2020-09-04 14:51:06 INFO [service/alive] Service ndbmtd is alive

2020-09-04 14:51:06 INFO [service/alive] Service mysqld is alive

2020-09-04 14:51:06 INFO [service/alive] Service mysqld_exporter is alive

2020-09-04 14:51:06 INFO [service/alive] Service airflow-webserver is alive

2020-09-04 14:51:06 INFO [service/alive] Service airflow-scheduler is alive

2020-09-04 14:51:06 INFO [service/alive] Service glassfish-domain1 is alive

2020-09-04 14:51:06 ERROR [service/alive] Service kagent is DEAD.

2020-09-04 14:51:06 INFO [service/alive] Service consul is alive

2020-09-04 14:51:06 INFO [service/alive] Service influxdb is alive

2020-09-04 14:51:06 INFO [service/alive] Service grafana is alive

2020-09-04 14:51:06 INFO [service/alive] Service sqoop is alive

2020-09-04 14:51:06 INFO [service/alive] Service elasticsearch is alive

2020-09-04 14:51:06 INFO [service/alive] Service elastic_exporter is alive

2020-09-04 14:51:06 INFO [service/alive] Service namenode is alive

2020-09-04 14:51:06 INFO [service/alive] Service zookeeper is alive

2020-09-04 14:51:06 INFO [service/alive] Service datanode is alive

2020-09-04 14:51:06 ERROR [service/alive] Service kafka is DEAD.

2020-09-04 14:51:06 INFO [service/alive] Service epipe is alive

2020-09-04 14:51:06 INFO [service/alive] Service hivemetastore is alive

2020-09-04 14:51:06 INFO [service/alive] Service hiveserver2 is alive

2020-09-04 14:51:06 INFO [service/alive] Service logstash is alive

2020-09-04 14:51:06 INFO [service/alive] Service kibana is alive

2020-09-04 14:51:06 ERROR [service/alive] Service historyserver is DEAD.

2020-09-04 14:51:06 INFO [service/alive] Service resourcemanager is alive

2020-09-04 14:51:06 INFO [service/alive] Service sparkhistoryserver is alive

2020-09-04 14:51:06 ERROR [service/alive] Service filebeat-spark is DEAD.

2020-09-04 14:51:06 ERROR [service/alive] Service filebeat-kagent is DEAD.

2020-09-04 14:51:06 ERROR [service/alive] Service filebeat-tf-serving is DEAD.

2020-09-04 14:51:06 ERROR [service/alive] Service filebeat-sklearn-serving is DEAD.

2020-09-04 14:51:06 ERROR [service/alive] Service filebeat-beamjobservercluster is DEAD.

2020-09-04 14:51:06 ERROR [service/alive] Service filebeat-beamjobserverlocal is DEAD.

2020-09-04 14:51:06 ERROR [service/alive] Service filebeat-beamsdkworker is DEAD.

2020-09-04 14:51:06 ERROR [service/alive] Service nodemanager is DEAD.

2020-09-04 14:51:06 INFO [service/alive] Service livy is alive

2020-09-04 14:51:06 INFO [service/alive] Service flinkhistoryserver is alive

Alright. Do sudo systemctl stop kagent and also try sudo ps aux | grep kagent Kill any kagent process. Then try starting it skipping systemd. To do so, switch to kagent user with sudo su kagent and run /srv/hops/anaconda/anaconda/envs/hops-system/bin/python /srv/hops/kagent/kagent/agent.py --config /srv/hops/kagent/etc/config.ini

This will give you a bit more info if something is wrong with kagent. To kill it hit Ctrl+c and make sure with sudo ps aux | grep kagent that there is no kagent instance running before you try again with systemd

Thanks Antonios, this provided helpful information. It warns about the fact that we have not installed a certificate yet:
/srv/hops/anaconda/anaconda/envs/hops-system/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings

InsecureRequestWarning)

2020-09-04 15:45:27 ERROR [service/alive] Service filebeat-beamjobservercluster is DEAD.

2020-09-04 15:45:27 ERROR [service/alive] Service kafka is DEAD.

2020-09-04 15:45:28 ERROR [service/alive] Service filebeat-beamjobserverlocal is DEAD.

2020-09-04 15:45:28 ERROR [service/alive] Service filebeat-spark is DEAD.

2020-09-04 15:45:28 ERROR [service/alive] Service filebeat-kagent is DEAD.

2020-09-04 15:45:28 ERROR [service/alive] Service filebeat-tf-serving is DEAD.

2020-09-04 15:45:28 ERROR [service/alive] Service elastic_exporter is DEAD.

2020-09-04 15:45:28 ERROR [service/alive] Service kagent is DEAD.

2020-09-04 15:45:28 ERROR [service/alive] Service historyserver is DEAD.

2020-09-04 15:45:28 ERROR [service/alive] Service nodemanager is DEAD.

2020-09-04 15:45:28 ERROR [service/alive] Service filebeat-sklearn-serving is DEAD.

2020-09-04 15:45:28 ERROR [service/alive] Service filebeat-beamsdkworker is DEAD.

2020-09-04 15:45:30 ERROR [service/alive] Service filebeat-beamjobservercluster is DEAD.

2020-09-04 15:45:30 ERROR [service/alive] Service kafka is DEAD.

2020-09-04 15:45:30 ERROR [service/alive] Service filebeat-beamjobserverlocal is DEAD.

2020-09-04 15:45:30 ERROR [service/alive] Service filebeat-spark is DEAD.

2020-09-04 15:45:30 ERROR [service/alive] Service filebeat-kagent is DEAD.

2020-09-04 15:45:30 ERROR [service/alive] Service filebeat-tf-serving is DEAD.

2020-09-04 15:45:30 ERROR [service/alive] Service elastic_exporter is DEAD.

2020-09-04 15:45:30 ERROR [service/alive] Service kagent is DEAD.

2020-09-04 15:45:30 ERROR [service/alive] Service historyserver is DEAD.

2020-09-04 15:45:30 ERROR [service/alive] Service nodemanager is DEAD.

2020-09-04 15:45:30 ERROR [service/alive] Service filebeat-sklearn-serving is DEAD.

2020-09-04 15:45:30 ERROR [service/alive] Service filebeat-beamsdkworker is DEAD.

Hi.

That’s not exactly the problem here, it’s a warning. Still I don’t see any error in the logs, except that some services are not running.

Have you managed to start kagent?