Bare-metal -1 cluster install failure

Fernando_Marines · August 11, 2020, 5:25pm

thank you for that, the installation finished that step and has now only 2 steps left, unfortunately hops:nm failed

hops::nm | FAILED| retry skip log| 73455

e[32m- change mode from ‘’ to '0774’e[0m
e[32m- change owner from ‘’ to 'root’e[0m
e[32m- restore selinux security contexte[0m
e[0m * kagent_config[nodemanager] action systemd_reload
* bash[start-if-not-running-nodemanager] action run[2020-08-11T10:58:32-04:00] ERROR: bash[start-if-not-running-nodemanager] (/tmp/chef-solo/cookbooks/kagent/providers/config.rb line 49) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received ‘1’
---- Begin output of “bash” “/chef-script20200811-26266-1o725ri” ----
STDOUT:
STDERR: Job for nodemanager.service failed because the control process exited with error code. See “systemctl status nodemanager.service” and “journalctl -xe” for details.
---- End output of “bash” “/chef-script20200811-26266-1o725ri” ----
Ran “bash” “/chef-script20200811-26266-1o725ri” returned 1; ignore_failure is set, continuing

  e[0m
  ================================================================================e[0m
  e[31mError executing action `run` on resource 'bash[start-if-not-running-nodemanager]'e[0m
  ================================================================================e[0m

e[0m Mixlib::ShellOut::ShellCommandFailede[0m
------------------------------------e[0m
Expected process to exit with [0], but received ‘1’
e[0m ---- Begin output of “bash” “/chef-script20200811-26266-1o725ri” ----
e[0m STDOUT:
e[0m STDERR: Job for nodemanager.service failed because the control process exited with error code. See “systemctl status nodemanager.service” and “journalctl -xe” for details.
e[0m ---- End output of “bash” “/chef-script20200811-26266-1o725ri” ----
e[0m Ran “bash” “/chef-script20200811-26266-1o725ri” returned 1e[0m

e[0m Resource Declaration:e[0m

journalctl -xe

Aug 11 12:37:10 per320-2.server airflow_runner.sh[33874]: [2020-08-11 12:37:10,418] {jobs.py:1559} INFO - Harvesting DAG parsing results
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: 2020-08-11 10:58:30,811 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: STARTUP_MSG:
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: /************************************************************
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: Starting NodeManager
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: user = yarn
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: host = per320-2.server/192.168.0.230
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: args = []
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: version = 2.8.2.10-RC1
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: classpath = /srv/hops/hadoop/etc/hadoop:/srv/hops/hadoop/etc/hadoop:/srv/hops/hadoop/et
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: build = git@github.com:hopshadoop/hops.git -r abc3abcdf67a1e49fd6d0b8b381a23a677996a18;
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: java = 1.8.0_262
Aug 11 12:37:11 per320-2.server systemd[1]: nodemanager.service: control process exited, code=exited status=1
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: /srv/hops/hadoop/sbin/start-nm.sh: line 24: kill: (19965) - No such process
Aug 11 12:37:11 per320-2.server systemd[1]: Failed to start NodeManager. The Processing Nodes for YARN…
– Subject: Unit nodemanager.service has failed
– Defined-By: systemd
– Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
–
– Unit nodemanager.service has failed.
–
– The result is failed.

Fernando_Marines · August 11, 2020, 5:55pm

i also verified all the services and loadstash is showing DEAD, tried to restart just that one service but failed as well, not sure if that is related to the previous error or consecuence of that

[root@PER320-2 hops]# ./kagent/kagent-1.3.0/bin/status-all-local-services.sh
2020-08-11 13:48:53 INFO [agent/setupLogging] Hops-Kagent started.
2020-08-11 13:48:53 INFO [agent/setupLogging] Heartbeat URL: https://hopsworks.glassfish.service.consul:443/hopsworks-api/api/agentresource?action=heartbeat
2020-08-11 13:48:53 INFO [agent/setupLogging] Host Id: PER320-2
2020-08-11 13:48:53 INFO [agent/setupLogging] Hostname: PER320-2
2020-08-11 13:48:53 INFO [agent/setupLogging] Public IP: 192.168.0.230
2020-08-11 13:48:53 INFO [agent/setupLogging] Private IP: 192.168.0.230
2020-08-11 13:48:53 INFO [service/alive] Service ndb_mgmd is alive
2020-08-11 13:48:53 INFO [service/alive] Service alertmanager is alive
2020-08-11 13:48:53 INFO [service/alive] Service prometheus is alive
2020-08-11 13:48:54 INFO [service/alive] Service node_exporter is alive
2020-08-11 13:48:54 INFO [service/alive] Service nvml_monitor is alive
2020-08-11 13:48:54 INFO [service/alive] Service ndbmtd is alive
2020-08-11 13:48:54 INFO [service/alive] Service mysqld is alive
2020-08-11 13:48:54 INFO [service/alive] Service mysqld_exporter is alive
2020-08-11 13:48:54 INFO [service/alive] Service airflow-webserver is alive
2020-08-11 13:48:54 INFO [service/alive] Service airflow-scheduler is alive
2020-08-11 13:48:54 INFO [service/alive] Service glassfish-domain1 is alive
2020-08-11 13:48:54 INFO [service/alive] Service kagent is alive
2020-08-11 13:48:54 INFO [service/alive] Service consul is alive
2020-08-11 13:48:54 INFO [service/alive] Service influxdb is alive
2020-08-11 13:48:54 INFO [service/alive] Service grafana is alive
2020-08-11 13:48:54 INFO [service/alive] Service elasticsearch is alive
2020-08-11 13:48:54 INFO [service/alive] Service elastic_exporter is alive
2020-08-11 13:48:54 INFO [service/alive] Service sqoop is alive
2020-08-11 13:48:54 INFO [service/alive] Service namenode is alive
2020-08-11 13:48:54 INFO [service/alive] Service zookeeper is alive
2020-08-11 13:48:54 INFO [service/alive] Service datanode is alive
2020-08-11 13:48:54 INFO [service/alive] Service kafka is alive
2020-08-11 13:48:54 INFO [service/alive] Service historyserver is alive
2020-08-11 13:48:54 INFO [service/alive] Service resourcemanager is alive
2020-08-11 13:48:54 INFO [service/alive] Service epipe is alive
2020-08-11 13:48:54 ERROR [service/alive] Service logstash is DEAD.
2020-08-11 13:48:54 INFO [service/alive] Service kibana is alive
2020-08-11 13:48:54 INFO [service/alive] Service hivemetastore is alive
2020-08-11 13:48:54 INFO [service/alive] Service hiveserver2 is alive
2020-08-11 13:48:54 INFO [service/alive] Service livy is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-tf-serving is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-sklearn-serving is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-beamjobservercluster is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-beamjobserverlocal is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-beamsdkworker is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-spark is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-kagent is alive
2020-08-11 13:48:54 INFO [service/alive] Service flinkhistoryserver is alive
2020-08-11 13:48:54 INFO [service/alive] Service sparkhistoryserver is alive
2020-08-11 13:48:54 ERROR [service/alive] Service nodemanager is DEAD.
[root@PER320-2 hops]# systemctl status logstash.service
● logstash.service - logstash Server
Loaded: loaded (/usr/lib/systemd/system/logstash.service; enabled; vendor preset: disabled)
Active: inactive (dead) since Tue 2020-08-11 10:43:11 EDT; 3h 5min ago
Main PID: 36080 (code=exited, status=1/FAILURE)

Aug 11 10:43:04 per320-2.server systemd[1]: Starting logstash Server…
Aug 11 10:43:04 per320-2.server systemd[1]: Started logstash Server.
Aug 11 10:43:11 per320-2.server systemd[1]: logstash.service: main process exited, code=exited, status=1/FAILURE
Aug 11 10:43:11 per320-2.server systemd[1]: Unit logstash.service entered failed state.
Aug 11 10:43:11 per320-2.server systemd[1]: logstash.service failed.
[root@PER320-2 hops]# systemctl restart logstash.service
[root@PER320-2 hops]# systemctl status logstash.service

ermias · August 12, 2020, 7:45am

Can you check the nodemanager log? You can find it in /srv/hops/hadoop/logs/hadoop-yarn-nodemanager-hopsworks0.logicalclocks.com.log and get back to us with the last 50 lines.

Also can you print the content of /etc/resolve.conf (this should contain the ip the services use to talk to each other)

Fernando_Marines · August 12, 2020, 1:41pm

Hello @ermias thanks to your instructions I was able to identify the error, was write access to the main folder, once resolved the installation finished, however, the logstash remains DEAD… just that one, and can’t find a specific log for that, could you direct me where to look at?

2020-08-12 09:39:11 INFO [service/alive] Service historyserver is alive
2020-08-12 09:39:11 INFO [service/alive] Service resourcemanager is alive
2020-08-12 09:39:11 INFO [service/alive] Service epipe is alive
2020-08-12 09:39:11 ERROR [service/alive] Service logstash is DEAD.
2020-08-12 09:39:11 INFO [service/alive] Service kibana is alive
2020-08-12 09:39:11 INFO [service/alive] Service hivemetastore is alive
2020-08-12 09:39:11 INFO [service/alive] Service hiveserver2 is alive
2020-08-12 09:39:11 INFO [service/alive] Service livy is alive
2020-08-12 09:39:11 INFO [service/alive] Service filebeat-tf-serving is alive
2020-08-12 09:39:11 INFO [service/alive] Service filebeat-sklearn-serving is alive
2020-08-12 09:39:11 INFO [service/alive] Service filebeat-beamjobservercluster is alive

[root@PER320-2 cluster]# systemctl status logstash
● logstash.service - logstash Server
Loaded: loaded (/usr/lib/systemd/system/logstash.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2020-08-12 09:21:23 EDT; 16min ago
Process: 40752 ExecStart=/srv/hops/logstash/bin/start-logstash.sh (code=exited, status=0/SUCCESS)
Main PID: 40753 (code=exited, status=1/FAILURE)

Aug 12 09:21:16 per320-2.server systemd[1]: Starting logstash Server…
Aug 12 09:21:16 per320-2.server systemd[1]: Started logstash Server.
Aug 12 09:21:23 per320-2.server systemd[1]: logstash.service: main process exited, code=exited, status=1/FAILURE
Aug 12 09:21:23 per320-2.server systemd[1]: Unit logstash.service entered failed state.
Aug 12 09:21:23 per320-2.server systemd[1]: logstash.service failed.

below the error I had before

(PrivilegedOperationExecutor.java:151)
… 6 more
2020-08-12 00:04:51,179 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:314)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:708)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:756)
Caused by: java.io.IOException: Linux container executor not configured properly (error=24)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:189)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:312)
… 3 more
Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=24: File /extend1 must not be world or group writable, but is 775

    at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:177)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:203)
    at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:182)
    ... 4 more

Caused by: ExitCodeException exitCode=24: File /extend1 must not be world or group writable, but is 775

    at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
    at org.apache.hadoop.util.Shell.run(Shell.java:869)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:151)
    ... 6 more

2020-08-12 00:04:51,183 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at per320-2.server/192.168.0.230
************************************************************/

Fernando_Marines · August 12, 2020, 1:44pm

noticed the /srv/hops/logstash/bin/start-logstash.sh defines the log folder : /srv/hops/logstash/log
and its empty

[root@PER320-2 cluster]# ls -l /srv/hops/logstash/log
total 0
[root@PER320-2 cluster]#

ermias · August 12, 2020, 1:54pm

Check if the permission to the log dir is correct.
It should be
drwxr-x— 2 elastic elastic 4096 Aug 11 22:03 log/
if not fix it and rerun logstash recipe

Fernando_Marines · August 12, 2020, 2:15pm

those are correct, ran that recipe and got this error:

-# When a JVM receives a SIGTERM signal it exits with code 143
-SuccessExitStatus=143
-
 [Install]
-WantedBy=multi-user.target
-
-# Built for distribution-6.0.0 (distribution)
+WantedBy = multi-user.target
- change mode from '0644' to '0754'
- restore selinux security context

service[elasticsearch] action enable (up to date)
elastic_start[start_install_elastic] action run
- kagent_config[elasticsearch] action systemd_reload
  - bash[start-if-not-running-elasticsearch] action run
    - execute “bash” “/chef-script20200812-1953-1u7vgdk”
- elastic_http[poll elasticsearch] action get
  - http_request[get request] action get[2020-08-12T10:09:08-04:00] ERROR: Connection refused connecting to https://PER320-2:9200/, retry 1/5
    [2020-08-12T10:09:13-04:00] ERROR: Connection refused connecting to https://PER320-2:9200/, retry 2/5
    [2020-08-12T10:09:18-04:00] ERROR: Connection refused connecting to https://PER320-2:9200/, retry 3/5
    - http_request[get request] GET to https://PER320-2:9200/
- elastic_http[delete projects index] action delete
  - http_request[delete request] action delete (skipped due to only_if)
    (up to date)
- elastic_http[elastic-install-projects-index] action put
  - http_request[put request] action put (skipped due to only_if)
    (up to date)
- elastic_http[elastic-create-logs-template] action put
  - http_request[put request] action put
    - http_request[put request] PUT to https://PER320-2:9200/_template/logs
- elastic_http[elastic-create-experiments-template] action put
  - http_request[put request] action put
    - http_request[put request] PUT to https://PER320-2:9200/_template/experiments

Fernando_Marines · August 12, 2020, 2:18pm

here the resolution of that hostname
[root@PER320-2 install]# ping PER320-2
PING PER320-2 (192.168.0.230) 56(84) bytes of data.
64 bytes from PER320-2 (192.168.0.230): icmp_seq=1 ttl=64 time=0.042 ms
64 bytes from PER320-2 (192.168.0.230): icmp_seq=2 ttl=64 time=0.049 ms
64 bytes from PER320-2 (192.168.0.230): icmp_seq=3 ttl=64 time=0.065 ms
64 bytes from PER320-2 (192.168.0.230): icmp_seq=4 ttl=64 time=0.038 ms
64 bytes from PER320-2 (192.168.0.230): icmp_seq=5 ttl=64 time=0.060 ms
64 bytes from PER320-2 (192.168.0.230): icmp_seq=6 ttl=64 time=0.060 ms
^C64 bytes from PER320-2 (192.168.0.230): icmp_seq=7 ttl=64 time=0.050 ms
^X64 bytes from PER320-2 (192.168.0.230): icmp_seq=8 ttl=64 time=0.038 ms
^Z
[1]+ Stopped ping PER320-2
[root@PER320-2 install]# netstat |grep 9200
[root@PER320-2 install]#

ermias · August 12, 2020, 3:00pm

You can try a basic elastic query to see if elastic can be reached
curl -X GET -u admin:adminpw --insecure https://PER320-2:9200/_cat/indices

You might need to replace the default username and password admin:adminpw

Fernando_Marines · August 12, 2020, 6:10pm

Hi @ermias, just read this after i cleaned up the install, restarted the server and try to install all over again 3 times and fails on the same step now
is the hops:nn recipe, having this in the hops__nn.log file

thanks a lot

Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH…
20/08/12 14:06:20 WARN util.NativeCodeLoader: Loaded the native-hadoop library
20/08/12 14:06:20 WARN ha.FailoverProxyHelper: Failed to get list of NN from default NN. Default NN was hdfs://rpc.namenode.service.consul:8020
20/08/12 14:06:20 WARN hdfs.DFSUtil: Could not resolve Service
com.logicalclocks.servicediscoverclient.exceptions.ServiceNotFoundException: Error: host not found Could not find service ServiceQuery(name=rpc.namenode.service.consul, tags=[])
at com.logicalclocks.servicediscoverclient.resolvers.DnsResolver.getSRVRecordsInternal(DnsResolver.java:112)
at com.logicalclocks.servicediscoverclient.resolvers.DnsResolver.getSRVRecords(DnsResolver.java:98)
at com.logicalclocks.servicediscoverclient.resolvers.DnsResolver.getService(DnsResolver.java:71)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesFromServiceDiscovery(DFSUtil.java:822)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesAsURIs(DFSUtil.java:772)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesAsURIs(DFSUtil.java:764)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesAsURIs(DFSUtil.java:757)
at org.apache.hadoop.hdfs.server.namenode.ha.FailoverProxyHelper.getActiveNamenodes(FailoverProxyHelper.java:100)
at org.apache.hadoop.hdfs.server.namenode.ha.HopsRandomStickyFailoverProxyProvider.(HopsRandomStickyFailoverProxyProvider.java:99)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

Fernando_Marines · August 12, 2020, 6:10pm

[fmarines@per320-2 cluster]$ sudo systemctl status namenode
● namenode.service - NameNode server for HDFS.
Loaded: loaded (/usr/lib/systemd/system/namenode.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/namenode.service.d
└─limits.conf
Active: active (running) since Wed 2020-08-12 14:04:12 EDT; 29s ago
Process: 44199 ExecStop=/srv/hops/hadoop/sbin/stop-nn.sh (code=exited, status=0/SUCCESS)
Process: 44222 ExecStart=/srv/hops/hadoop/sbin/start-nn.sh (code=exited, status=0/SUCCESS)
Main PID: 44259 (java)
Tasks: 202
CGroup: /system.slice/namenode.service
└─44259 /usr/lib/jvm/java-1.8.0/bin/java -Dproc_namenode -Xmx1000m -XX:MaxDirectMemorySize=1000m -XX:MaxDirectMemorySize=1000m -XX:MaxDirect…

Aug 12 14:04:06 per320-2.server start-nn.sh[44222]: rsync from /srv/hops/hadoop
Aug 12 14:04:06 per320-2.server start-nn.sh[44222]: starting namenode, logging to /srv/hops/hadoop/logs/hadoop-hdfs-namenode-per320-2.server.out
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: 2020-08-12 13:44:08,317 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: /************************************************************
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: STARTUP_MSG: Starting NameNode
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: STARTUP_MSG: user = hdfs
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: STARTUP_MSG: host = PER320-2/192.168.0.230
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: STARTUP_MSG: args = []
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: STARTUP_MSG: version = 2.8.2.10-RC1
Aug 12 14:04:12 per320-2.server systemd[1]: Started NameNode server for HDFS…

Fernando_Marines · August 13, 2020, 2:37am

Hi @ermias below is what @antonios suggested in another track from other user

[fmarines@per320-2 cluster]$ dig namenode.service.consul

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> namenode.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 19472
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;namenode.service.consul. IN A

;; AUTHORITY SECTION:
. 10800 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2020081203 1800 900 604800 86400

;; Query time: 11 msec
;; SERVER: 75.75.75.75#53(75.75.75.75)
;; WHEN: Wed Aug 12 22:33:13 EDT 2020
;; MSG SIZE rcvd: 127

ermias · August 13, 2020, 8:15am

Did you clean the machine properly? That is stop all services disable them and remove everything:

systemctl stop airflow-scheduler.service
systemctl stop airflow-webserver.service
systemctl stop dela.service
systemctl stop epipe.service
systemctl stop filebeat-kagent.service
systemctl stop filebeat-serving.service
systemctl stop filebeat-sklearn-serving.service
systemctl stop filebeat-tf-serving.service
systemctl stop glassfish-domain1.service
systemctl stop grafana.service
systemctl stop historyserver.service
systemctl stop hivecleaner.service
systemctl stop hivemetastore.service
systemctl stop hiveserver2.service
systemctl stop influxd.service
systemctl stop influxdb.service
systemctl stop kafka.service
systemctl stop kagent.service
systemctl stop kibana.service
systemctl stop livy.service
systemctl stop logstash.service
systemctl stop mysqld.service
systemctl stop namenode.service
systemctl stop datanode.service
systemctl stop ndb_mgmd.service
systemctl stop ndbmtd.service
systemctl stop resourcemanager.service
systemctl stop nodemanager.service
systemctl stop sparkhistoryserver.service
systemctl stop sqoop.service
systemctl stop telegraf.service
systemctl stop zookeeper.service
systemctl stop elasticsearch.service
systemctl stop docker
systemctl stop kubelet
systemctl stop consul
systemctl disable airflow-scheduler.service
systemctl disable airflow-webserver.service
systemctl disable dela.service
systemctl disable epipe.service
systemctl disable filebeat-kagent.service
systemctl disable filebeat-serving.service
systemctl disable filebeat-sklearn-serving.service
systemctl disable filebeat-tf-serving.service
systemctl disable glassfish-domain1.service
systemctl disable grafana.service
systemctl disable historyserver.service
systemctl disable hivecleaner.service
systemctl disable hivemetastore.service
systemctl disable hiveserver2.service
systemctl disable influxd.service
systemctl disable influxdb.service
systemctl disable kafka.service
systemctl disable kagent.service
systemctl disable kibana.service
systemctl disable livy.service
systemctl disable logstash.service
systemctl disable mysqld.service
systemctl disable namenode.service
systemctl disable datanode.service
systemctl disable ndb_mgmd.service
systemctl disable ndbmtd.service
systemctl disable resourcemanager.service
systemctl disable nodemanager.service
systemctl disable sparkhistoryserver.service
systemctl disable sqoop.service
systemctl disable telegraf.service
systemctl disable zookeeper.service
systemctl disable elasticsearch.service
systemctl disable docker
systemctl disable kubelet
systemctl disable consul
rm -rf /srv/hops
rm -rf /home/anaconda/*
rm -rf /home/kubernetes/*
rm -rf /tmp
rm -rf /home/hdp/.karamel
yum remove -y chefdk
rm -rf /opt/chefdk
rm -rf /etc/pki/ca-trust/source/anchors/*

also if you configured hopsfs data locations different than default delete those as well.

Fernando_Marines · August 13, 2020, 8:00pm

i use the installer.sh method every time, here is the last try, it fails on the same step again

Uninstall

[fmarines@per320-2 cluster]$ ./hopsworks-installer.sh

This program can install Karamel/Chef and/or Hopsworks.

To cancel installation at any time, press CONTROL-C

You appear to have following setup on this host:

available memory: 46
available disk space (on ‘/’ root partition): 18G
available disk space (under ‘/mnt’ partition):
available CPUs: 20
available GPUS: 4
your ip is: 192.168.0.230
installation user: fmarines
linux distro: centos
cluster defn branch: https://raw.githubusercontent.com/logicalclocks/karamel-chef/1.3
hopsworks-chef branch: logicalclocks/hopsworks-chef/1.3

WARNING: We recommend at least 60GB of disk space on the root partition. Minimum is 50GB of available disk.
You have 18G space on ‘/’, and no space on ‘/mnt’.

./hopsworks-installer.sh: line 213: -1: substring expression < 0
-------------------- Installation Options --------------------

What would you like to do?

(1) Install a single-host Hopsworks cluster.

(2) Install a single-host Hopsworks cluster with TLS enabled.

(3) Install a multi-host Hopsworks cluster with TLS enabled.

(4) Install an Enterprise Hopsworks cluster.

(5) Install an Enterprise Hopsworks cluster with Kubernetes

(6) Install and start Karamel.

(7) Install Nvidia drivers and reboot server.

(8) Purge (uninstall) Hopsworks from this host.

(9) Purge (uninstall) Hopsworks from ALL hosts.

Please enter your choice 1, 2, 3, 4, 5, 6, 7, 8, 9, q (quit), or h (help) : 8

Press ENTER to continue
Shutting down services…
2020-08-13 13:30:23 INFO [agent/setupLogging] Hops-Kagent started.
2020-08-13 13:30:23 INFO [agent/setupLogging] Heartbeat URL: https://hopsworks.glassfish.service.consul:443/hopsworks-api/api/agentresource?action=heartbeat
2020-08-13 13:30:23 INFO [agent/setupLogging] Host Id: PER320-2
2020-08-13 13:30:23 INFO [agent/setupLogging] Hostname: PER320-2
2020-08-13 13:30:23 INFO [agent/setupLogging] Public IP: 192.168.0.230
2020-08-13 13:30:23 INFO [agent/setupLogging] Private IP: 192.168.0.230
2020-08-13 13:30:24 INFO [service/stop] Stopped service: namenode
2020-08-13 13:30:24 INFO [service/stop] Stopped service: sqoop
2020-08-13 13:30:24 INFO [service/stop] Stopped service: elastic_exporter
2020-08-13 13:30:25 INFO [service/stop] Stopped service: elasticsearch
2020-08-13 13:30:25 INFO [service/stop] Stopped service: grafana
2020-08-13 13:30:25 INFO [service/stop] Stopped service: influxdb
2020-08-13 13:30:25 INFO [service/stop] Stopped service: consul
2020-08-13 13:30:25 INFO [service/stop] Stopped service: kagent
2020-08-13 13:30:31 INFO [service/stop] Stopped service: glassfish-domain1
2020-08-13 13:30:32 INFO [service/stop] Stopped service: airflow-scheduler
2020-08-13 13:32:02 INFO [service/stop] Stopped service: airflow-webserver
2020-08-13 13:32:02 INFO [service/stop] Stopped service: mysqld_exporter
2020-08-13 13:32:07 INFO [service/stop] Stopped service: mysqld
2020-08-13 13:32:08 INFO [service/stop] Stopped service: ndbmtd
2020-08-13 13:32:08 INFO [service/stop] Stopped service: nvml_monitor
2020-08-13 13:32:08 INFO [service/stop] Stopped service: node_exporter
2020-08-13 13:32:08 INFO [service/stop] Stopped service: prometheus
2020-08-13 13:32:08 INFO [service/stop] Stopped service: alertmanager
2020-08-13 13:32:09 INFO [service/stop] Stopped service: ndb_mgmd
Killing karamel…
Removing karamel…
Removing cookbooks…
Purging old installation…

[fmarines@per320-2 cluster]$ systemctl |grep failed
● airflow-webserver.service loaded failed failed Airflow webserver daemon
● consul.service loaded failed failed “HashiCorp Consul - A service mesh solution”
● elasticsearch.service loaded failed failed Elasticsearch daemon.
● flinkhistoryserver.service loaded failed failed Flink historyserver
● namenode.service loaded failed failed NameNode server for HDFS.
● sqoop.service loaded failed failed Sqoop server

[fmarines@per320-2 cluster]$ sudo systemctl disable airflow-webserver.service
Removed symlink /etc/systemd/system/multi-user.target.wants/airflow-webserver.service.
[fmarines@per320-2 cluster]$ sudo systemctl disable consul.service
Removed symlink /etc/systemd/system/multi-user.target.wants/consul.service.
[fmarines@per320-2 cluster]$ sudo systemctl disable sqoop.service
Removed symlink /etc/systemd/system/multi-user.target.wants/sqoop.service.
[fmarines@per320-2 cluster]$ sudo systemctl disable namenode.service
Removed symlink /etc/systemd/system/multi-user.target.wants/namenode.service.
[fmarines@per320-2 cluster]$ sudo systemctl disable flinkhistoryserver.service
Removed symlink /etc/systemd/system/multi-user.target.wants/flinkhistoryserver.service.
[fmarines@per320-2 cluster]$ sudo systemctl disable elasticsearch.service
Removed symlink /etc/systemd/system/multi-user.target.wants/elasticsearch.service.
[fmarines@per320-2 cluster]$
[fmarines@per320-2 cluster]$ systectl |grep failed
bash: systectl: command not found…
[fmarines@per320-2 cluster]$ systemctl |grep failed
● airflow-webserver.service loaded failed failed Airflow webserver daemon
● consul.service loaded failed failed “HashiCorp Consul - A service mesh solution”
● elasticsearch.service loaded failed failed Elasticsearch daemon.
● flinkhistoryserver.service loaded failed failed Flink historyserver
● namenode.service loaded failed failed NameNode server for HDFS.
● sqoop.service loaded failed failed Sqoop server

[fmarines@per320-2 cluster]$ sudo systemctl reset-failed

[fmarines@per320-2 cluster]$ more /etc/init.d/
devtoolset-8-stap-server functions netconsole README
devtoolset-8-systemtap jexec network
[fmarines@per320-2 cluster]$ more /etc/init.d/

Re-install

Found karamel
Running command from /extend1/cluster/karamel-0.6:

setsid ./bin/karamel -headless -launch …/cluster-defns/hopsworks-installer-active.yml > …/installation.log 2>&1 &

Installation has started, but may take 1 hour or more…

The Karamel installer UI will soon start at: http://192.168.0.230:9090/index.html
Note: port 9090 must be open for external traffic and Karamel will shutdown when installation finishes.

=====================================================================

You can view the installation logs with this command:

tail -f installation.log

[fmarines@per320-2 cluster]$ tail -f installation.log

time later from installation.log file

Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH…
20/08/13 15:10:59 WARN util.NativeCodeLoader: Loaded the native-hadoop library
20/08/13 15:10:59 WARN ha.FailoverProxyHelper: Failed to get list of NN from default NN. Default NN was hdfs://rpc.namenode.service.consul:8020
20/08/13 15:10:59 WARN hdfs.DFSUtil: Could not resolve Service
com.logicalclocks.servicediscoverclient.exceptions.ServiceNotFoundException: Error: host not found Could not find service ServiceQuery(name=rpc.namenode.service.consul, tags=[])
at com.logicalclocks.servicediscoverclient.resolvers.DnsResolver.getSRVRecordsInternal(DnsResolver.java:112)
at com.logicalclocks.servicediscoverclient.resolvers.DnsResolver.getSRVRecords(DnsResolver.java:98)
at com.logicalclocks.servicediscoverclient.resolvers.DnsResolver.getService(DnsResolver.java:71)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesFromServiceDiscovery(DFSUtil.java:822)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesAsURIs(DFSUtil.java:772)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesAsURIs(DFSUtil.java:764)

ermias · August 14, 2020, 8:27am

Then the only problem I see is that you are not getting any answer when you dig the namenode.
You should get something like:
;; ANSWER SECTION: namenode.service.consul. 0 IN A 10.0.2.15
This can be because of a wrong /etc/resolve.conf can you print what is in this file?
In a vm it looks something like: nameserver 10.0.2.15

AND do not ignore this message:
WARNING: We recommend at least 60GB of disk space on the root partition. Minimum is 50GB of available disk.
You have 18G space on ‘/’, and no space on ‘/mnt’.

Fernando_Marines · August 14, 2020, 1:42pm

strange that file does not exist, so I created it and entry a record like namesever 192.168.0.230 and dig namenode again and…i don’t see the ANSWER SECTION. is the same response as before
then I tried other options:

PER320-2 192.X.X.X
namenode.service.consul 192.X.X.X

with their respective dig commands and all back with nothing, however a dig to logicalclocks.com returns a response:

;; ANSWER SECTION:
logicalclocks.com. 1799 IN A 13.248.155.104
logicalclocks.com. 1799 IN A 76.223.27.102

after that I look into the /srv/hops/hadoop/logs/hadoop-hdfs-namenode-per320-2.server.log and it seems to start and restart 3 to 5 times before to give up. this is the error from all the restarts:

2020-08-14 09:28:58,270 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8020: starting
2020-08-14 09:28:58,295 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Leader Node RPC up at: PER320-2/192.168.0.230:8020
2020-08-14 09:28:58,297 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2020-08-14 09:28:58,297 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest edits from old active before taking over writer role in edits logs
2020-08-14 09:28:58,297 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all datandoes as stale
2020-08-14 09:28:58,298 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing replication and invalidation queues
2020-08-14 09:28:58,298 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: initializing replication queues
2020-08-14 09:28:58,309 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Starting CacheReplicationMonitor with interval 30000 milliseconds
2020-08-14 09:28:58,329 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: processMisReplicated read 0/10000 in the Ids range [0 - 10000] (max inodeId when the process started: 1)
2020-08-14 09:28:58,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Total number of blocks = 0
2020-08-14 09:28:58,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of invalid blocks = 0
2020-08-14 09:28:58,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of under-replicated blocks = 0
2020-08-14 09:28:58,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of over-replicated blocks = 0
2020-08-14 09:28:58,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of blocks being written = 0
2020-08-14 09:28:58,337 INFO org.apache.hadoop.hdfs.StateChange: STATE* Replication Queue initialization scan for invalid, over- and under-replicated blocks completed in 34 msec
2020-08-14 09:28:59,042 INFO org.apache.hadoop.fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 60 minutes.
2020-08-14 09:28:59,043 INFO org.apache.hadoop.fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 60 minutes.
2020-08-14 09:29:07,595 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
2020-08-14 09:29:07,702 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at PER320-2/192.168.0.230
/
2020-08-14 09:29:09,958 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG: user = hdfs
STARTUP_MSG: host = PER320-2/192.168.0.230
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.8.2.10-RC1

Regarding the space, this server has 50GB on root partition and has other applications installed but not used and the /mnt is a symbolic link to another disk it shouldn’t have a problem, have it?

what i did today

[root@per320-2 logs]# more /etc/resolve.conf
/etc/resolve.conf: No such file or directory
[root@per320-2 logs]# vi /etc/resolve.conf

[root@per320-2 logs]# more /etc/resolve.conf
nameserver 192.168.0.230
[root@per320-2 logs]#

[root@per320-2 logs]# dig namenode.service.consul

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> namenode.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 37216
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;namenode.service.consul. IN A

;; AUTHORITY SECTION:
. 10800 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2020081400 1800 900 604800 86400

;; Query time: 11 msec
;; SERVER: 75.75.75.75#53(75.75.75.75)
;; WHEN: Fri Aug 14 09:19:40 EDT 2020
;; MSG SIZE rcvd: 127

[root@per320-2 logs]# find / -name resolve.conf -print |more
/etc/resolve.conf
[root@per320-2 logs]#

Fernando_Marines · August 14, 2020, 2:00pm

my bad, i type the filename wrong, here is my resolv.conf that is the reason was not found

nameserver 75.75.75.75
nameserver 75.75.76.76
nameserver 8.8.8.8

Fernando_Marines · August 14, 2020, 2:28pm

i’m confused… i did the same test from a VM (ver 1.2) working fine and does not return as well

this is from the VM:

Installed:
bind-utils.x86_64 32:9.11.4-16.P2.el7_8.6

Dependency Installed:
bind-export-libs.x86_64 32:9.11.4-16.P2.el7_8.6 bind-libs.x86_64 32:9.11.4-16.P2.el7_8.6

Dependency Updated:
bind-libs-lite.x86_64 32:9.11.4-16.P2.el7_8.6 bind-license.noarch 32:9.11.4-16.P2.el7_8.6 dhclient.x86_64 12:4.2.5-79.el7.centos
dhcp-common.x86_64 12:4.2.5-79.el7.centos dhcp-libs.x86_64 12:4.2.5-79.el7.centos

Complete!
[vagrant@hopsworks0 ~]$ dig namenode.service.consul

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> namenode.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 1301
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;namenode.service.consul. IN A

;; Query time: 9 msec
;; SERVER: 10.0.2.3#53(10.0.2.3)
;; WHEN: Fri Aug 14 13:50:02 UTC 2020
;; MSG SIZE rcvd: 41

[vagrant@hopsworks0 ~]$ more /etc/resolv.conf

Generated by NetworkManager

search logicalclocks.com
nameserver 10.0.2.3
options single-request-reopen

this is from the host

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.2 <<>> namenode.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 15937
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;namenode.service.consul. IN A

;; AUTHORITY SECTION:
. 10800 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2020081400 1800 900 604800 86400

;; Query time: 15 msec
;; SERVER: 75.75.75.75#53(75.75.75.75)
;; WHEN: Fri Aug 14 10:25:55 EDT 2020
;; MSG SIZE rcvd: 127

Fernando_Marines · August 14, 2020, 2:29pm

[root@PER420 ~]# more /etc/resolv.conf

Generated by NetworkManager

nameserver 75.75.75.75
nameserver 75.75.76.76
nameserver 8.8.8.8
[root@PER420 ~]#

ermias · August 14, 2020, 2:37pm

Are you installing version 1.2 or 1.3? consul was introduced in 1.3.