Bare-metal -1 cluster install failure

Hi @ermias below is what @antonios suggested in another track from other user


[fmarines@per320-2 cluster]$ dig namenode.service.consul

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> namenode.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 19472
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;namenode.service.consul. IN A

;; AUTHORITY SECTION:
. 10800 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2020081203 1800 900 604800 86400

;; Query time: 11 msec
;; SERVER: 75.75.75.75#53(75.75.75.75)
;; WHEN: Wed Aug 12 22:33:13 EDT 2020
;; MSG SIZE rcvd: 127

Did you clean the machine properly? That is stop all services disable them and remove everything:

systemctl stop airflow-scheduler.service
systemctl stop airflow-webserver.service
systemctl stop dela.service
systemctl stop epipe.service
systemctl stop filebeat-kagent.service
systemctl stop filebeat-serving.service
systemctl stop filebeat-sklearn-serving.service
systemctl stop filebeat-tf-serving.service
systemctl stop glassfish-domain1.service
systemctl stop grafana.service
systemctl stop historyserver.service
systemctl stop hivecleaner.service
systemctl stop hivemetastore.service
systemctl stop hiveserver2.service
systemctl stop influxd.service
systemctl stop influxdb.service
systemctl stop kafka.service
systemctl stop kagent.service
systemctl stop kibana.service
systemctl stop livy.service
systemctl stop logstash.service
systemctl stop mysqld.service
systemctl stop namenode.service
systemctl stop datanode.service
systemctl stop ndb_mgmd.service
systemctl stop ndbmtd.service
systemctl stop resourcemanager.service
systemctl stop nodemanager.service
systemctl stop sparkhistoryserver.service
systemctl stop sqoop.service
systemctl stop telegraf.service
systemctl stop zookeeper.service
systemctl stop elasticsearch.service
systemctl stop docker
systemctl stop kubelet
systemctl stop consul
systemctl disable airflow-scheduler.service
systemctl disable airflow-webserver.service
systemctl disable dela.service
systemctl disable epipe.service
systemctl disable filebeat-kagent.service
systemctl disable filebeat-serving.service
systemctl disable filebeat-sklearn-serving.service
systemctl disable filebeat-tf-serving.service
systemctl disable glassfish-domain1.service
systemctl disable grafana.service
systemctl disable historyserver.service
systemctl disable hivecleaner.service
systemctl disable hivemetastore.service
systemctl disable hiveserver2.service
systemctl disable influxd.service
systemctl disable influxdb.service
systemctl disable kafka.service
systemctl disable kagent.service
systemctl disable kibana.service
systemctl disable livy.service
systemctl disable logstash.service
systemctl disable mysqld.service
systemctl disable namenode.service
systemctl disable datanode.service
systemctl disable ndb_mgmd.service
systemctl disable ndbmtd.service
systemctl disable resourcemanager.service
systemctl disable nodemanager.service
systemctl disable sparkhistoryserver.service
systemctl disable sqoop.service
systemctl disable telegraf.service
systemctl disable zookeeper.service
systemctl disable elasticsearch.service
systemctl disable docker
systemctl disable kubelet
systemctl disable consul
rm -rf /srv/hops
rm -rf /home/anaconda/*
rm -rf /home/kubernetes/*
rm -rf /tmp
rm -rf /home/hdp/.karamel
yum remove -y chefdk
rm -rf /opt/chefdk
rm -rf /etc/pki/ca-trust/source/anchors/*

also if you configured hopsfs data locations different than default delete those as well.

i use the installer.sh method every time, here is the last try, it fails on the same step again


Uninstall

[fmarines@per320-2 cluster]$ ./hopsworks-installer.sh

Karamel/Hopsworks Installer, Copyright© 2020 Logical Clocks AB. All rights reserved.

This program can install Karamel/Chef and/or Hopsworks.

To cancel installation at any time, press CONTROL-C

You appear to have following setup on this host:

  • available memory: 46
  • available disk space (on ‘/’ root partition): 18G
  • available disk space (under ‘/mnt’ partition):
  • available CPUs: 20
  • available GPUS: 4
  • your ip is: 192.168.0.230
  • installation user: fmarines
  • linux distro: centos
  • cluster defn branch: https://raw.githubusercontent.com/logicalclocks/karamel-chef/1.3
  • hopsworks-chef branch: logicalclocks/hopsworks-chef/1.3

WARNING: We recommend at least 60GB of disk space on the root partition. Minimum is 50GB of available disk.
You have 18G space on ‘/’, and no space on ‘/mnt’.

./hopsworks-installer.sh: line 213: -1: substring expression < 0
-------------------- Installation Options --------------------

What would you like to do?

(1) Install a single-host Hopsworks cluster.

(2) Install a single-host Hopsworks cluster with TLS enabled.

(3) Install a multi-host Hopsworks cluster with TLS enabled.

(4) Install an Enterprise Hopsworks cluster.

(5) Install an Enterprise Hopsworks cluster with Kubernetes

(6) Install and start Karamel.

(7) Install Nvidia drivers and reboot server.

(8) Purge (uninstall) Hopsworks from this host.

(9) Purge (uninstall) Hopsworks from ALL hosts.

Please enter your choice 1, 2, 3, 4, 5, 6, 7, 8, 9, q (quit), or h (help) : 8


Press ENTER to continue
Shutting down services…
2020-08-13 13:30:23 INFO [agent/setupLogging] Hops-Kagent started.
2020-08-13 13:30:23 INFO [agent/setupLogging] Heartbeat URL: https://hopsworks.glassfish.service.consul:443/hopsworks-api/api/agentresource?action=heartbeat
2020-08-13 13:30:23 INFO [agent/setupLogging] Host Id: PER320-2
2020-08-13 13:30:23 INFO [agent/setupLogging] Hostname: PER320-2
2020-08-13 13:30:23 INFO [agent/setupLogging] Public IP: 192.168.0.230
2020-08-13 13:30:23 INFO [agent/setupLogging] Private IP: 192.168.0.230
2020-08-13 13:30:24 INFO [service/stop] Stopped service: namenode
2020-08-13 13:30:24 INFO [service/stop] Stopped service: sqoop
2020-08-13 13:30:24 INFO [service/stop] Stopped service: elastic_exporter
2020-08-13 13:30:25 INFO [service/stop] Stopped service: elasticsearch
2020-08-13 13:30:25 INFO [service/stop] Stopped service: grafana
2020-08-13 13:30:25 INFO [service/stop] Stopped service: influxdb
2020-08-13 13:30:25 INFO [service/stop] Stopped service: consul
2020-08-13 13:30:25 INFO [service/stop] Stopped service: kagent
2020-08-13 13:30:31 INFO [service/stop] Stopped service: glassfish-domain1
2020-08-13 13:30:32 INFO [service/stop] Stopped service: airflow-scheduler
2020-08-13 13:32:02 INFO [service/stop] Stopped service: airflow-webserver
2020-08-13 13:32:02 INFO [service/stop] Stopped service: mysqld_exporter
2020-08-13 13:32:07 INFO [service/stop] Stopped service: mysqld
2020-08-13 13:32:08 INFO [service/stop] Stopped service: ndbmtd
2020-08-13 13:32:08 INFO [service/stop] Stopped service: nvml_monitor
2020-08-13 13:32:08 INFO [service/stop] Stopped service: node_exporter
2020-08-13 13:32:08 INFO [service/stop] Stopped service: prometheus
2020-08-13 13:32:08 INFO [service/stop] Stopped service: alertmanager
2020-08-13 13:32:09 INFO [service/stop] Stopped service: ndb_mgmd
Killing karamel…
Removing karamel…
Removing cookbooks…
Purging old installation…


[fmarines@per320-2 cluster]$ systemctl |grep failed
● airflow-webserver.service loaded failed failed Airflow webserver daemon
● consul.service loaded failed failed “HashiCorp Consul - A service mesh solution”
● elasticsearch.service loaded failed failed Elasticsearch daemon.
● flinkhistoryserver.service loaded failed failed Flink historyserver
● namenode.service loaded failed failed NameNode server for HDFS.
● sqoop.service loaded failed failed Sqoop server


[fmarines@per320-2 cluster]$ sudo systemctl disable airflow-webserver.service
Removed symlink /etc/systemd/system/multi-user.target.wants/airflow-webserver.service.
[fmarines@per320-2 cluster]$ sudo systemctl disable consul.service
Removed symlink /etc/systemd/system/multi-user.target.wants/consul.service.
[fmarines@per320-2 cluster]$ sudo systemctl disable sqoop.service
Removed symlink /etc/systemd/system/multi-user.target.wants/sqoop.service.
[fmarines@per320-2 cluster]$ sudo systemctl disable namenode.service
Removed symlink /etc/systemd/system/multi-user.target.wants/namenode.service.
[fmarines@per320-2 cluster]$ sudo systemctl disable flinkhistoryserver.service
Removed symlink /etc/systemd/system/multi-user.target.wants/flinkhistoryserver.service.
[fmarines@per320-2 cluster]$ sudo systemctl disable elasticsearch.service
Removed symlink /etc/systemd/system/multi-user.target.wants/elasticsearch.service.
[fmarines@per320-2 cluster]$
[fmarines@per320-2 cluster]$ systectl |grep failed
bash: systectl: command not found…
[fmarines@per320-2 cluster]$ systemctl |grep failed
● airflow-webserver.service loaded failed failed Airflow webserver daemon
● consul.service loaded failed failed “HashiCorp Consul - A service mesh solution”
● elasticsearch.service loaded failed failed Elasticsearch daemon.
● flinkhistoryserver.service loaded failed failed Flink historyserver
● namenode.service loaded failed failed NameNode server for HDFS.
● sqoop.service loaded failed failed Sqoop server


[fmarines@per320-2 cluster]$ sudo systemctl reset-failed

[fmarines@per320-2 cluster]$ more /etc/init.d/
devtoolset-8-stap-server functions netconsole README
devtoolset-8-systemtap jexec network
[fmarines@per320-2 cluster]$ more /etc/init.d/


Re-install

Found karamel
Running command from /extend1/cluster/karamel-0.6:

setsid ./bin/karamel -headless -launch …/cluster-defns/hopsworks-installer-active.yml > …/installation.log 2>&1 &


Installation has started, but may take 1 hour or more…

The Karamel installer UI will soon start at: http://192.168.0.230:9090/index.html
Note: port 9090 must be open for external traffic and Karamel will shutdown when installation finishes.

=====================================================================

You can view the installation logs with this command:

tail -f installation.log


[fmarines@per320-2 cluster]$ tail -f installation.log


time later from installation.log file

Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH…
20/08/13 15:10:59 WARN util.NativeCodeLoader: Loaded the native-hadoop library
20/08/13 15:10:59 WARN ha.FailoverProxyHelper: Failed to get list of NN from default NN. Default NN was hdfs://rpc.namenode.service.consul:8020
20/08/13 15:10:59 WARN hdfs.DFSUtil: Could not resolve Service
com.logicalclocks.servicediscoverclient.exceptions.ServiceNotFoundException: Error: host not found Could not find service ServiceQuery(name=rpc.namenode.service.consul, tags=[])
at com.logicalclocks.servicediscoverclient.resolvers.DnsResolver.getSRVRecordsInternal(DnsResolver.java:112)
at com.logicalclocks.servicediscoverclient.resolvers.DnsResolver.getSRVRecords(DnsResolver.java:98)
at com.logicalclocks.servicediscoverclient.resolvers.DnsResolver.getService(DnsResolver.java:71)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesFromServiceDiscovery(DFSUtil.java:822)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesAsURIs(DFSUtil.java:772)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesAsURIs(DFSUtil.java:764)

Then the only problem I see is that you are not getting any answer when you dig the namenode.
You should get something like:
;; ANSWER SECTION: namenode.service.consul. 0 IN A 10.0.2.15
This can be because of a wrong /etc/resolve.conf can you print what is in this file?
In a vm it looks something like: nameserver 10.0.2.15

AND do not ignore this message:
WARNING: We recommend at least 60GB of disk space on the root partition. Minimum is 50GB of available disk.
You have 18G space on ‘/’, and no space on ‘/mnt’.

strange that file does not exist, so I created it and entry a record like namesever 192.168.0.230 and dig namenode again and…i don’t see the ANSWER SECTION. is the same response as before
then I tried other options:

  • PER320-2 192.X.X.X
  • namenode.service.consul 192.X.X.X

with their respective dig commands and all back with nothing, however a dig to logicalclocks.com returns a response:

;; ANSWER SECTION:
logicalclocks.com. 1799 IN A 13.248.155.104
logicalclocks.com. 1799 IN A 76.223.27.102

after that I look into the /srv/hops/hadoop/logs/hadoop-hdfs-namenode-per320-2.server.log and it seems to start and restart 3 to 5 times before to give up. this is the error from all the restarts:


2020-08-14 09:28:58,270 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8020: starting
2020-08-14 09:28:58,295 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Leader Node RPC up at: PER320-2/192.168.0.230:8020
2020-08-14 09:28:58,297 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2020-08-14 09:28:58,297 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Catching up to latest edits from old active before taking over writer role in edits logs
2020-08-14 09:28:58,297 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all datandoes as stale
2020-08-14 09:28:58,298 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing replication and invalidation queues
2020-08-14 09:28:58,298 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: initializing replication queues
2020-08-14 09:28:58,309 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Starting CacheReplicationMonitor with interval 30000 milliseconds
2020-08-14 09:28:58,329 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: processMisReplicated read 0/10000 in the Ids range [0 - 10000] (max inodeId when the process started: 1)
2020-08-14 09:28:58,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Total number of blocks = 0
2020-08-14 09:28:58,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of invalid blocks = 0
2020-08-14 09:28:58,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of under-replicated blocks = 0
2020-08-14 09:28:58,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of over-replicated blocks = 0
2020-08-14 09:28:58,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of blocks being written = 0
2020-08-14 09:28:58,337 INFO org.apache.hadoop.hdfs.StateChange: STATE* Replication Queue initialization scan for invalid, over- and under-replicated blocks completed in 34 msec
2020-08-14 09:28:59,042 INFO org.apache.hadoop.fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 60 minutes.
2020-08-14 09:28:59,043 INFO org.apache.hadoop.fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 60 minutes.
2020-08-14 09:29:07,595 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
2020-08-14 09:29:07,702 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at PER320-2/192.168.0.230
/
2020-08-14 09:29:09,958 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/

STARTUP_MSG: Starting NameNode
STARTUP_MSG: user = hdfs
STARTUP_MSG: host = PER320-2/192.168.0.230
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.8.2.10-RC1


Regarding the space, this server has 50GB on root partition and has other applications installed but not used and the /mnt is a symbolic link to another disk it shouldn’t have a problem, have it?


what i did today

[root@per320-2 logs]# more /etc/resolve.conf
/etc/resolve.conf: No such file or directory
[root@per320-2 logs]# vi /etc/resolve.conf


[root@per320-2 logs]# more /etc/resolve.conf
nameserver 192.168.0.230
[root@per320-2 logs]#


[root@per320-2 logs]# dig namenode.service.consul

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> namenode.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 37216
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;namenode.service.consul. IN A

;; AUTHORITY SECTION:
. 10800 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2020081400 1800 900 604800 86400

;; Query time: 11 msec
;; SERVER: 75.75.75.75#53(75.75.75.75)
;; WHEN: Fri Aug 14 09:19:40 EDT 2020
;; MSG SIZE rcvd: 127

[root@per320-2 logs]# find / -name resolve.conf -print |more
/etc/resolve.conf
[root@per320-2 logs]#

my bad, i type the filename wrong, here is my resolv.conf that is the reason was not found

nameserver 75.75.75.75
nameserver 75.75.76.76
nameserver 8.8.8.8

i’m confused… i did the same test from a VM (ver 1.2) working fine and does not return as well

this is from the VM:

Installed:
bind-utils.x86_64 32:9.11.4-16.P2.el7_8.6

Dependency Installed:
bind-export-libs.x86_64 32:9.11.4-16.P2.el7_8.6 bind-libs.x86_64 32:9.11.4-16.P2.el7_8.6

Dependency Updated:
bind-libs-lite.x86_64 32:9.11.4-16.P2.el7_8.6 bind-license.noarch 32:9.11.4-16.P2.el7_8.6 dhclient.x86_64 12:4.2.5-79.el7.centos
dhcp-common.x86_64 12:4.2.5-79.el7.centos dhcp-libs.x86_64 12:4.2.5-79.el7.centos

Complete!
[vagrant@hopsworks0 ~]$ dig namenode.service.consul

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> namenode.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 1301
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;namenode.service.consul. IN A

;; Query time: 9 msec
;; SERVER: 10.0.2.3#53(10.0.2.3)
;; WHEN: Fri Aug 14 13:50:02 UTC 2020
;; MSG SIZE rcvd: 41

[vagrant@hopsworks0 ~]$ more /etc/resolv.conf

Generated by NetworkManager

search logicalclocks.com
nameserver 10.0.2.3
options single-request-reopen

this is from the host

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.2 <<>> namenode.service.consul
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 15937
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;namenode.service.consul. IN A

;; AUTHORITY SECTION:
. 10800 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2020081400 1800 900 604800 86400

;; Query time: 15 msec
;; SERVER: 75.75.75.75#53(75.75.75.75)
;; WHEN: Fri Aug 14 10:25:55 EDT 2020
;; MSG SIZE rcvd: 127

[root@PER420 ~]# more /etc/resolv.conf

Generated by NetworkManager

nameserver 75.75.75.75
nameserver 75.75.76.76
nameserver 8.8.8.8
[root@PER420 ~]#

Are you installing version 1.2 or 1.3? consul was introduced in 1.3.

sorry to confuse you, i’m installing 1.3, we have a VM running since Apr, that is 1.2 where i tested the same commands.

You have to use 1.3 for the vm if you want to compare the results. Consul was introduced in 1.3.