Bare-metal -1 cluster install failure

Fernando_Marines · July 29, 2020, 4:03pm

this server had mariadb installed and the .ssh missing ,i’m not sure if the mariadb was causing any problem, so i uninstall it and crate the .ssh manuall also removed all directories and now is doing the rest of recipes, let you know any other issue.

WARN [2020-07-29 15:06:14,569] net.schmizz.sshj.xfer.scp.SCPEngine: SCP exit status: 1
INFO [2020-07-29 15:06:14,571] se.kth.karamel.backend.machines.SshMachine: 192.168.0.230: Running task: ndb::install
WARN [2020-07-29 15:57:07,006] net.schmizz.sshj.xfer.scp.SCPEngine: SCP exit status: 1
INFO [2020-07-29 15:57:07,008] se.kth.karamel.backend.machines.SshMachine: 192.168.0.230: Running task: hops_airflow::install
WARN [2020-07-29 15:57:35,121] net.schmizz.sshj.xfer.scp.SCPEngine: SCP exit status: 1
INFO [2020-07-29 15:57:35,123] se.kth.karamel.backend.machines.SshMachine: 192.168.0.230: Running task: hopsmonitor::install
WARN [2020-07-29 15:57:51,550] net.schmizz.sshj.xfer.scp.SCPEngine: SCP exit status: 1
INFO [2020-07-29 15:57:51,551] se.kth.karamel.backend.machines.SshMachine: 192.168.0.230: Running task: tensorflow::install
WARN [2020-07-29 15:58:31,031] net.schmizz.sshj.xfer.scp.SCPEngine: SCP exit status: 1
INFO [2020-07-29 15:58:31,033] se.kth.karamel.backend.machines.SshMachine: 192.168.0.230: Running task: hadoop_spark::install
WARN [2020-07-29 16:02:52,753] net.schmizz.sshj.xfer.scp.SCPEngine: SCP exit status: 1
INFO [2020-07-29 16:02:52,755] se.kth.karamel.backend.machines.SshMachine: 192.168.0.230: Running task: hops::install

Thanks a lot!

Fernando_Marines · July 29, 2020, 10:31pm

Hello @Alex @antonios thought my next reply would be to give you good news but unfortunately is not, now is airflow having issues here

content of hops_airflow__default.log

Requirement already satisfied: python3-openid>=2.0 in /extend1/hops/anaconda/envs/airflow/lib/python3.6/site-packages (from Flask-OpenID<2,>=1.2.5->flask-appbuilder==1.12.1->apache-airflow[mysql]==1.10.2) (3.2.0)
Requirement already satisfied: webencodings in /extend1/hops/anaconda/envs/airflow/lib/python3.6/site-packages (from html5lib!=1.0b1,!=1.0b2,!=1.0b3,!=1.0b4,!=1.0b5,!=1.0b6,!=1.0b7,!=1.0b8,>=0.99999999pre->bleach~=2.1.3->apache-airflow[mysql]==1.10.2) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in /extend1/hops/anaconda/envs/airflow/lib/python3.6/site-packages (from python-slugify>=1.2.5->python-nvd3==0.15.0->apache-airflow[mysql]==1.10.2) (1.3)
Requirement already satisfied: defusedxml in /extend1/hops/anaconda/envs/airflow/lib/python3.6/site-packages (from python3-openid>=2.0->Flask-OpenID<2,>=1.2.5->flask-appbuilder==1.12.1->apache-airflow[mysql]==1.10.2) (0.6.0)
Building wheels for collected packages: mysqlclient
Building wheel for mysqlclient (setup.py): started
Building wheel for mysqlclient (setup.py): finished with status ‘error’
Running setup.py clean for mysqlclient
Failed to build mysqlclient
Installing collected packages: mysqlclient
Running setup.py install for mysqlclient: started
Running setup.py install for mysqlclient: finished with status ‘error’
STDERR: ERROR: Command errored out with exit status 1:
command: /srv/hops/anaconda/anaconda/envs/airflow/bin/python -u -c ‘import sys, setuptools, tokenize; sys.argv[0] = ‘"’"’/home/tmp/pip-install-ew4ky9i8/mysqlclient/setup.py’"’"’; file=’"’"’/home/tmp/pip-install-ew4ky9i8/mysqlclient/setup.py’"’"’;f=getattr(tokenize, ‘"’"‘open’"’"’, open)(file);code=f.read().replace(’"’"’\r\n’"’"’, ‘"’"’\n’"’"’);f.close();exec(compile(code, file, ‘"’"‘exec’"’"’))’ bdist_wheel -d /home/tmp/pip-wheel-sia3b05r
cwd: /home/tmp/pip-install-ew4ky9i8/mysqlclient/
Complete output (33 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/MySQLdb
copying MySQLdb/init.py -> build/lib.linux-x86_64-3.6/MySQLdb
copying MySQLdb/_exceptions.py -> build/lib.linux-x86_64-3.6/MySQLdb
copying MySQLdb/connections.py -> build/lib.linux-x86_64-3.6/MySQLdb
copying MySQLdb/converters.py -> build/lib.linux-x86_64-3.6/MySQLdb
copying MySQLdb/cursors.py -> build/lib.linux-x86_64-3.6/MySQLdb
copying MySQLdb/release.py -> build/lib.linux-x86_64-3.6/MySQLdb

Alex · July 30, 2020, 7:14am

Hi @Fernando_Marines,

I think this is related to the previous installation of mariadb as some dependencies might be missing.
I think the issue you encountered is similar to this one:

We can start with the first mention there, a possible missing libssl-dev library. On centos, this should be
yum install openssl-dev. Try manually installing the openssl dependency and then try running manually hops_airflow__default.sh.

Fernando_Marines · July 30, 2020, 4:40pm

Hi @Alex, it seems to have the latest,

[root@PER320-2 ~]# yum install openssl-devel
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile

base: mirror.atlantic.net

centos-sclo-rh: repo1.ash.innoscale.net

centos-sclo-sclo: mirrors.advancedhosters.com

epel: d2lzkl7pfhq30w.cloudfront.net

extras: ewr.edge.kernel.org

nux-dextop: mirror.li.nux.ro

updates: repo1.ash.innoscale.net
Package 1:openssl-devel-1.0.2k-19.el7.x86_64 already installed and latest version

however i uninstall mariadb and run the hops_airflow__default.sh and noticed that installed installs the mariadb again with all dependencies need it, i think.

yum_package[gcc] action install (up to date)

yum_package[gcc-c++] action install (up to date)

yum_package[libjpeg-turbo-devel] action install (up to date)

yum_package[zlib-devel] action install (up to date)

yum_package[python-devel] action install (up to date)

yum_package[epel-release] action install (up to date)

yum_package[libffi-devel] action install (up to date)

yum_package[libffi-devel] action install (up to date)

yum_package[cyrus-sasl-devel] action install (up to date)

yum_package[cyrus-sasl-devel] action install (up to date)

yum_package[mariadb] action install
^[[32m- install version 1:5.5.65-1.el7.x86_64 of package mariadb^[[0m
^[[0m * yum_package[mariadb-devel] action install
^[[32m- install version 1:5.5.65-1.el7.x86_64 of package mariadb-devel^[[0m
^[[0m * yum_package[cyrus-sasl-devel] action install (up to date)

yum_package[libffi-devel] action install (up to date)

bash[remove_airflow_env] action run
^[[32m- execute “bash” “/chef-script20200730-44184-fy8mlt”^[[0m
^[[0m * bash[create_airflow_env] action run
^[[32m- execute “bash” “/chef-script20200730-44184-1n59sfm”^[[0m
^[[0m * bash[install_airflow] action run
^[[32m- execute “bash” “/chef-script20200730-44184-e5akur”^[[0m
^[[0m * bash[install_airflow_hive] action run
^[[32m- execute “bash” “/chef-script20200730-44184-324dfd”^[[0m
^[[0m * bash[install_airflow_mysql] action run
^[[0m
================================================================================^[[0m
^[[31mError executing action run on resource ‘bash[install_airflow_mysql]’^[[0m
================================================================================^[[0m

so, looking thru log lines i noticed these warnings, would make sense with the openssl you mentioned.

[2020-07-30T11:55:14-04:00] WARN: Resource openssl_dhparam from the client is overriding the resource from a cookbook. Please upgrade your cookbook or remove the cookbook from your run_list.
[2020-07-30T11:55:14-04:00] WARN: Resource openssl_rsa_key from the client is overriding the resource from a cookbook. Please upgrade your cookbook or remove the cookbook from your run_list.
[2020-07-30T11:55:14-04:00] WARN: Resource openssl_x509 from the client is overriding the resource from a cookbook. Please upgrade your cookbook or remove the cookbook from your run_list.
[2020-07-30T11:55:14-04:00] WARN: Resource sudo from the client is overriding the resource from a cookbook. Please upgrade your cookbook or remove the cookbook from your run_list.
[2020-07-30T11:55:14-04:00] WARN: Resource sysctl_param from the client is overriding the resource from a cookbook. Please upgrade your cookbook or remove the cookbook from your run_list.

the question becomes how do I solve those?
Regards

Fernando_Marines · July 30, 2020, 9:40pm

this is the content of hops_airflow__default.log

[fmarines@PER320-2 cluster]$ tail -f /home/fmarines/.karamel/install/hops_airflow__default.log
MYSQL_CLIENT_PLUGIN_HEADER
^
gcc -pthread -shared -B /srv/hops/anaconda/anaconda/envs/airflow/compiler_compat -L/srv/hops/anaconda/anaconda/envs/airflow/lib -Wl,-rpath=/srv/hops/anaconda/anaconda/envs/airflow/lib -Wl,–no-as-needed -Wl,–sysroot=/ build/temp.linux-x86_64-3.6/MySQLdb/_mysql.o **

-L/usr/lib64/ -lmariadb

** -o build/lib.linux-x86_64-3.6/MySQLdb/_mysql.cpython-36m-x86_64-linux-gnu.so

**/srv/hops/anaconda/anaconda/envs/airflow/compiler_compat/ld: cannot find -lmariadb**
collect2: error: ld returned 1 exit status
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /srv/hops/anaconda/anaconda/envs/airflow/bin/python -u -c ‘import sys, setuptools, tokenize; sys.argv[0] = ‘"’"’/home/tmp/pip-install-5jan5j_y/mysqlclient/setup.py’"’"’; file=’"’"’/home/tmp/pip-install-5jan5j_y/mysqlclient/setup.py’"’"’;f=getattr(tokenize, ‘"’"‘open’"’"’, open)(file);code=f.read().replace(’"’"’\r\n’"’"’, ‘"’"’\n’"’"’);f.close();exec(compile(code, file, ‘"’"‘exec’"’"’))’ install --record /home/tmp/pip-record-aiw9d27y/install-record.txt --single-version-externally-managed --compile --install-headers /srv/hops/anaconda/anaconda/envs/airflow/include/python3.6m/mysqlclient Check the logs for full command output.
---- End output of “bash” “/chef-script20200730-18879-er48e1” ----
Ran “bash” “/chef-script20200730-18879-er48e1” returned 1

it seems to look for a library does not exist…

[fmarines@PER320-2 cluster]$ sudo find / -name mariadb -print |more
/etc/yum.repos.d/mariadb.repo
/etc/systemd/system/mariadb.service.d
/etc/selinux/targeted/active/modules/400/mariadb
/var/tmp/yum-fmarines-kfKXxK/x86_64/7/mariadb
/var/lib/yum/repos/x86_64/7/mariadb
/var/lib/rpm-state/mariadb
/var/cache/yum/x86_64/7/updates/packages/mariadb-libs-5.5.60-1.el7_5.x86_64.rpm
/var/cache/yum/x86_64/7/mariadb
/usr/bin/mariadb_config
/usr/lib64/pkgconfig/libmariadb.pc
/usr/lib64/pkgconfig/mariadb.pc
/usr/lib64/libmariadbclient.a
/usr/lib64/libmariadbd.a
/usr/include/mysql/mariadb
/usr/include/mysql/mariadb_com.h
/usr/include/mysql/mariadb_ctype.h
/usr/include/mysql/mariadb_dyncol.h
/usr/include/mysql/mariadb_rpl.h
/usr/include/mysql/mariadb_stmt.h
/usr/include/mysql/mariadb_version.h
/usr/include/mysql/server/private/mariadb.h
/extend1/hops/domains/domain1/flyway-5.0.3/drivers/mariadb-java-client-2.2.0.jar

Fernando_Marines · August 3, 2020, 10:04pm

i was able to pass-by this error creating a symbol link as suggested from this post

Because noticed the packages installed are MariaDB not mariadb

sudo ln -s /usr/lib64/libmariadbclient.a /usr/lib64/libmariadb.a

and that pass the previous but got stocked on a different this time, so i continued researching and found notes related to not use maria-devel, instead use mysql libraries because is a problem with mariadb so went a head and installed these:

mysql-community-common-8.0.20-1.el7.x86_64
mysql-community-libs-8.0.20-1.el7.x86_64
mysql-community-devel-8.0.20-1.el7.x86_64

and noticied this library libmariadb.a instead of libmariadbclient.a so i did :
sudo ln -s /usr/lib64/libmariadbd.a /usr/lib64/libmariadb.a
and passed the error, but now this is one i can’t find any note out there that make sense with what i’m doing…some post said is a flag from the compiler (gcc) used instead of g++ but i got both installed.

Running handlers:^[[0m
[2020-08-03T01:30:26-04:00] ERROR: Running exception handlers
Running handlers complete
^[[0m[2020-08-03T01:30:26-04:00] ERROR: Exception handlers complete
Chef Infra Client failed. 54 resources updated in 06 minutes 40 seconds^[[0m
[2020-08-03T01:30:26-04:00] FATAL: Stacktrace dumped to /tmp/chef-solo/chef-stacktrace.out
[2020-08-03T01:30:26-04:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2020-08-03T01:30:26-04:00] FATAL: Mixlib::ShellOut::ShellCommandFailed: bash[init_airflow_db] (hops_airflow::default line 65) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received ‘1’
---- Begin output of “bash” “/chef-script20200803-22186-3jfksd” ----
STDOUT:
STDERR: Traceback (most recent call last):
File “/srv/hops/anaconda/anaconda/envs/airflow/lib/python3.6/site-packages/MySQLdb/init.py”, line 18, in
from . import _mysql
ImportError: /srv/hops/anaconda/anaconda/envs/airflow/lib/python3.6/site-packages/MySQLdb/_mysql.cpython-36m-x86_64-linux-gnu.so: undefined symbol: __cxa_pure_virtual

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/srv/hops/anaconda/anaconda/envs/airflow/bin/airflow”, line 21, in
from airflow import configuration
File “/srv/hops/anaconda/anaconda/envs/airflow/lib/python3.6/site-packages/airflow/init.py”, line 36, in
from airflow import settings, configuration as conf
File “/srv/hops/anaconda/anaconda/envs/airflow/lib/python3.6/site-packages/airflow/settings.py”, line 264, in
configure_adapters()
File “/srv/hops/anaconda/anaconda/envs/airflow/lib/python3.6/site-packages/airflow/settings.py”, line 221, in configure_adapters
import MySQLdb.converters
File “/srv/hops/anaconda/anaconda/envs/airflow/lib/python3.6/site-packages/MySQLdb/init.py”, line 24, in
version_info, _mysql.version_info, _mysql.file
NameError: name ‘_mysql’ is not defined
---- End output of “bash” “/chef-script20200803-22186-3jfksd” ----
Ran “bash” “/chef-script20200803-22186-3jfksd” returned 1

Fernando_Marines · August 11, 2020, 12:37pm

Hello,
i was able to correct the MariaDB installs removing the repository from centos 7 in this server, that was causing the install of those other packages instead of the packages coming from the hopsworks install, now my install is 85% complete but again got an error here, hope you can help me to figure this out.

also noticed the installer.sh has changed asking more questions during the install and registering my email, the previous install it was not that way so I’m assuming things have been improved compared with the install I started 20 days ago.

this is the content from tensorflow__default.log file

^[[0m * bash[witwidget-base_env-python36] action run
^[[0m
================================================================================^[[0m
^[[31mError executing action run on resource ‘bash[witwidget-base_env-python36]’^[[0m
================================================================================^[[0m

^[[0m Resource Declaration:^[[0m
---------------------^[[0m
# In /tmp/chef-solo/cookbooks/tensorflow/recipes/default.rb

content of /tmp/jupyterlab-debug-693399z3.log file
[LabBuildApp] Building in /srv/hops/anaconda/anaconda/envs/python36/share/jupyter/lab
[LabBuildApp] Node v6.17.1

[LabBuildApp] Building jupyterlab assets (build:prod:minimize)
[LabBuildApp] > node /srv/hops/anaconda/anaconda/envs/python36/lib/python3.6/site-packages/jupyterlab/staging/yarn.js install --non-interactive
[LabBuildApp] yarn install v1.15.2
[1/5] Validating package.json…
[2/5] Resolving packages…
[3/5] Fetching packages…
error package-json@6.5.0: The engine “node” is incompatible with this module. Expected version “>=8”. Got “6.17.1”
error Found incompatible module
info Visit https://yarnpkg.com/en/docs/cli/install for documentation about this command.

[LabBuildApp] npm dependencies failed to install
[LabBuildApp] Traceback (most recent call last):

[LabBuildApp] File “/srv/hops/anaconda/anaconda/envs/python36/lib/python3.6/site-packages/jupyterlab/debuglog.py”, line 47, in debug_logging
yield

[LabBuildApp] File “/srv/hops/anaconda/anaconda/envs/python36/lib/python3.6/site-packages/jupyterlab/labapp.py”, line 96, in start
core_config=self.core_config)

[LabBuildApp] File “/srv/hops/anaconda/anaconda/envs/python36/lib/python3.6/site-packages/jupyterlab/commands.py”, line 378, in build
command=command, clean_staging=clean_staging)

[LabBuildApp] File “/srv/hops/anaconda/anaconda/envs/python36/lib/python3.6/site-packages/jupyterlab/commands.py”, line 574, in build
raise RuntimeError(msg)

[LabBuildApp] RuntimeError: npm dependencies failed to install

[LabBuildApp] Exiting application: JupyterLab

Fernando_Marines · August 11, 2020, 12:40pm

missed this lines , node version installed:

[root@PER320-2 fmarines]# node --version
v10.16.0
[root@PER320-2 fmarines]#

ermias · August 11, 2020, 1:25pm

Hello,
There was an upstream problem with installing jupyterlab last week, can you retry it just to rule out that possibility.

Fernando_Marines · August 11, 2020, 5:25pm

thank you for that, the installation finished that step and has now only 2 steps left, unfortunately hops:nm failed

hops::nm | FAILED| retry skip log| 73455

e[32m- change mode from ‘’ to '0774’e[0m
e[32m- change owner from ‘’ to 'root’e[0m
e[32m- restore selinux security contexte[0m
e[0m * kagent_config[nodemanager] action systemd_reload
* bash[start-if-not-running-nodemanager] action run[2020-08-11T10:58:32-04:00] ERROR: bash[start-if-not-running-nodemanager] (/tmp/chef-solo/cookbooks/kagent/providers/config.rb line 49) had an error: Mixlib::ShellOut::ShellCommandFailed: Expected process to exit with [0], but received ‘1’
---- Begin output of “bash” “/chef-script20200811-26266-1o725ri” ----
STDOUT:
STDERR: Job for nodemanager.service failed because the control process exited with error code. See “systemctl status nodemanager.service” and “journalctl -xe” for details.
---- End output of “bash” “/chef-script20200811-26266-1o725ri” ----
Ran “bash” “/chef-script20200811-26266-1o725ri” returned 1; ignore_failure is set, continuing

  e[0m
  ================================================================================e[0m
  e[31mError executing action `run` on resource 'bash[start-if-not-running-nodemanager]'e[0m
  ================================================================================e[0m

e[0m Mixlib::ShellOut::ShellCommandFailede[0m
------------------------------------e[0m
Expected process to exit with [0], but received ‘1’
e[0m ---- Begin output of “bash” “/chef-script20200811-26266-1o725ri” ----
e[0m STDOUT:
e[0m STDERR: Job for nodemanager.service failed because the control process exited with error code. See “systemctl status nodemanager.service” and “journalctl -xe” for details.
e[0m ---- End output of “bash” “/chef-script20200811-26266-1o725ri” ----
e[0m Ran “bash” “/chef-script20200811-26266-1o725ri” returned 1e[0m

e[0m Resource Declaration:e[0m

journalctl -xe

Aug 11 12:37:10 per320-2.server airflow_runner.sh[33874]: [2020-08-11 12:37:10,418] {jobs.py:1559} INFO - Harvesting DAG parsing results
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: 2020-08-11 10:58:30,811 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: STARTUP_MSG:
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: /************************************************************
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: Starting NodeManager
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: user = yarn
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: host = per320-2.server/192.168.0.230
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: args = []
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: version = 2.8.2.10-RC1
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: classpath = /srv/hops/hadoop/etc/hadoop:/srv/hops/hadoop/etc/hadoop:/srv/hops/hadoop/et
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: build = git@github.com:hopshadoop/hops.git -r abc3abcdf67a1e49fd6d0b8b381a23a677996a18;
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: STARTUP_MSG: java = 1.8.0_262
Aug 11 12:37:11 per320-2.server systemd[1]: nodemanager.service: control process exited, code=exited status=1
Aug 11 12:37:11 per320-2.server start-nm.sh[19924]: /srv/hops/hadoop/sbin/start-nm.sh: line 24: kill: (19965) - No such process
Aug 11 12:37:11 per320-2.server systemd[1]: Failed to start NodeManager. The Processing Nodes for YARN…
– Subject: Unit nodemanager.service has failed
– Defined-By: systemd
– Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
–
– Unit nodemanager.service has failed.
–
– The result is failed.

Fernando_Marines · August 11, 2020, 5:55pm

i also verified all the services and loadstash is showing DEAD, tried to restart just that one service but failed as well, not sure if that is related to the previous error or consecuence of that

[root@PER320-2 hops]# ./kagent/kagent-1.3.0/bin/status-all-local-services.sh
2020-08-11 13:48:53 INFO [agent/setupLogging] Hops-Kagent started.
2020-08-11 13:48:53 INFO [agent/setupLogging] Heartbeat URL: https://hopsworks.glassfish.service.consul:443/hopsworks-api/api/agentresource?action=heartbeat
2020-08-11 13:48:53 INFO [agent/setupLogging] Host Id: PER320-2
2020-08-11 13:48:53 INFO [agent/setupLogging] Hostname: PER320-2
2020-08-11 13:48:53 INFO [agent/setupLogging] Public IP: 192.168.0.230
2020-08-11 13:48:53 INFO [agent/setupLogging] Private IP: 192.168.0.230
2020-08-11 13:48:53 INFO [service/alive] Service ndb_mgmd is alive
2020-08-11 13:48:53 INFO [service/alive] Service alertmanager is alive
2020-08-11 13:48:53 INFO [service/alive] Service prometheus is alive
2020-08-11 13:48:54 INFO [service/alive] Service node_exporter is alive
2020-08-11 13:48:54 INFO [service/alive] Service nvml_monitor is alive
2020-08-11 13:48:54 INFO [service/alive] Service ndbmtd is alive
2020-08-11 13:48:54 INFO [service/alive] Service mysqld is alive
2020-08-11 13:48:54 INFO [service/alive] Service mysqld_exporter is alive
2020-08-11 13:48:54 INFO [service/alive] Service airflow-webserver is alive
2020-08-11 13:48:54 INFO [service/alive] Service airflow-scheduler is alive
2020-08-11 13:48:54 INFO [service/alive] Service glassfish-domain1 is alive
2020-08-11 13:48:54 INFO [service/alive] Service kagent is alive
2020-08-11 13:48:54 INFO [service/alive] Service consul is alive
2020-08-11 13:48:54 INFO [service/alive] Service influxdb is alive
2020-08-11 13:48:54 INFO [service/alive] Service grafana is alive
2020-08-11 13:48:54 INFO [service/alive] Service elasticsearch is alive
2020-08-11 13:48:54 INFO [service/alive] Service elastic_exporter is alive
2020-08-11 13:48:54 INFO [service/alive] Service sqoop is alive
2020-08-11 13:48:54 INFO [service/alive] Service namenode is alive
2020-08-11 13:48:54 INFO [service/alive] Service zookeeper is alive
2020-08-11 13:48:54 INFO [service/alive] Service datanode is alive
2020-08-11 13:48:54 INFO [service/alive] Service kafka is alive
2020-08-11 13:48:54 INFO [service/alive] Service historyserver is alive
2020-08-11 13:48:54 INFO [service/alive] Service resourcemanager is alive
2020-08-11 13:48:54 INFO [service/alive] Service epipe is alive
2020-08-11 13:48:54 ERROR [service/alive] Service logstash is DEAD.
2020-08-11 13:48:54 INFO [service/alive] Service kibana is alive
2020-08-11 13:48:54 INFO [service/alive] Service hivemetastore is alive
2020-08-11 13:48:54 INFO [service/alive] Service hiveserver2 is alive
2020-08-11 13:48:54 INFO [service/alive] Service livy is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-tf-serving is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-sklearn-serving is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-beamjobservercluster is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-beamjobserverlocal is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-beamsdkworker is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-spark is alive
2020-08-11 13:48:54 INFO [service/alive] Service filebeat-kagent is alive
2020-08-11 13:48:54 INFO [service/alive] Service flinkhistoryserver is alive
2020-08-11 13:48:54 INFO [service/alive] Service sparkhistoryserver is alive
2020-08-11 13:48:54 ERROR [service/alive] Service nodemanager is DEAD.
[root@PER320-2 hops]# systemctl status logstash.service
● logstash.service - logstash Server
Loaded: loaded (/usr/lib/systemd/system/logstash.service; enabled; vendor preset: disabled)
Active: inactive (dead) since Tue 2020-08-11 10:43:11 EDT; 3h 5min ago
Main PID: 36080 (code=exited, status=1/FAILURE)

Aug 11 10:43:04 per320-2.server systemd[1]: Starting logstash Server…
Aug 11 10:43:04 per320-2.server systemd[1]: Started logstash Server.
Aug 11 10:43:11 per320-2.server systemd[1]: logstash.service: main process exited, code=exited, status=1/FAILURE
Aug 11 10:43:11 per320-2.server systemd[1]: Unit logstash.service entered failed state.
Aug 11 10:43:11 per320-2.server systemd[1]: logstash.service failed.
[root@PER320-2 hops]# systemctl restart logstash.service
[root@PER320-2 hops]# systemctl status logstash.service

ermias · August 12, 2020, 7:45am

Can you check the nodemanager log? You can find it in /srv/hops/hadoop/logs/hadoop-yarn-nodemanager-hopsworks0.logicalclocks.com.log and get back to us with the last 50 lines.

Also can you print the content of /etc/resolve.conf (this should contain the ip the services use to talk to each other)

Fernando_Marines · August 12, 2020, 1:41pm

Hello @ermias thanks to your instructions I was able to identify the error, was write access to the main folder, once resolved the installation finished, however, the logstash remains DEAD… just that one, and can’t find a specific log for that, could you direct me where to look at?

2020-08-12 09:39:11 INFO [service/alive] Service historyserver is alive
2020-08-12 09:39:11 INFO [service/alive] Service resourcemanager is alive
2020-08-12 09:39:11 INFO [service/alive] Service epipe is alive
2020-08-12 09:39:11 ERROR [service/alive] Service logstash is DEAD.
2020-08-12 09:39:11 INFO [service/alive] Service kibana is alive
2020-08-12 09:39:11 INFO [service/alive] Service hivemetastore is alive
2020-08-12 09:39:11 INFO [service/alive] Service hiveserver2 is alive
2020-08-12 09:39:11 INFO [service/alive] Service livy is alive
2020-08-12 09:39:11 INFO [service/alive] Service filebeat-tf-serving is alive
2020-08-12 09:39:11 INFO [service/alive] Service filebeat-sklearn-serving is alive
2020-08-12 09:39:11 INFO [service/alive] Service filebeat-beamjobservercluster is alive

[root@PER320-2 cluster]# systemctl status logstash
● logstash.service - logstash Server
Loaded: loaded (/usr/lib/systemd/system/logstash.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Wed 2020-08-12 09:21:23 EDT; 16min ago
Process: 40752 ExecStart=/srv/hops/logstash/bin/start-logstash.sh (code=exited, status=0/SUCCESS)
Main PID: 40753 (code=exited, status=1/FAILURE)

Aug 12 09:21:16 per320-2.server systemd[1]: Starting logstash Server…
Aug 12 09:21:16 per320-2.server systemd[1]: Started logstash Server.
Aug 12 09:21:23 per320-2.server systemd[1]: logstash.service: main process exited, code=exited, status=1/FAILURE
Aug 12 09:21:23 per320-2.server systemd[1]: Unit logstash.service entered failed state.
Aug 12 09:21:23 per320-2.server systemd[1]: logstash.service failed.

below the error I had before

(PrivilegedOperationExecutor.java:151)
… 6 more
2020-08-12 00:04:51,179 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to initialize container executor
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:314)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:708)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:756)
Caused by: java.io.IOException: Linux container executor not configured properly (error=24)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:189)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:312)
… 3 more
Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=24: File /extend1 must not be world or group writable, but is 775

    at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:177)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:203)
    at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:182)
    ... 4 more

Caused by: ExitCodeException exitCode=24: File /extend1 must not be world or group writable, but is 775

    at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
    at org.apache.hadoop.util.Shell.run(Shell.java:869)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:151)
    ... 6 more

2020-08-12 00:04:51,183 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NodeManager at per320-2.server/192.168.0.230
************************************************************/

Fernando_Marines · August 12, 2020, 1:44pm

noticed the /srv/hops/logstash/bin/start-logstash.sh defines the log folder : /srv/hops/logstash/log
and its empty

[root@PER320-2 cluster]# ls -l /srv/hops/logstash/log
total 0
[root@PER320-2 cluster]#

ermias · August 12, 2020, 1:54pm

Check if the permission to the log dir is correct.
It should be
drwxr-x— 2 elastic elastic 4096 Aug 11 22:03 log/
if not fix it and rerun logstash recipe

Fernando_Marines · August 12, 2020, 2:15pm

those are correct, ran that recipe and got this error:

-# When a JVM receives a SIGTERM signal it exits with code 143
-SuccessExitStatus=143
-
 [Install]
-WantedBy=multi-user.target
-
-# Built for distribution-6.0.0 (distribution)
+WantedBy = multi-user.target
- change mode from '0644' to '0754'
- restore selinux security context

service[elasticsearch] action enable (up to date)
elastic_start[start_install_elastic] action run
- kagent_config[elasticsearch] action systemd_reload
  - bash[start-if-not-running-elasticsearch] action run
    - execute “bash” “/chef-script20200812-1953-1u7vgdk”
- elastic_http[poll elasticsearch] action get
  - http_request[get request] action get[2020-08-12T10:09:08-04:00] ERROR: Connection refused connecting to https://PER320-2:9200/, retry 1/5
    [2020-08-12T10:09:13-04:00] ERROR: Connection refused connecting to https://PER320-2:9200/, retry 2/5
    [2020-08-12T10:09:18-04:00] ERROR: Connection refused connecting to https://PER320-2:9200/, retry 3/5
    - http_request[get request] GET to https://PER320-2:9200/
- elastic_http[delete projects index] action delete
  - http_request[delete request] action delete (skipped due to only_if)
    (up to date)
- elastic_http[elastic-install-projects-index] action put
  - http_request[put request] action put (skipped due to only_if)
    (up to date)
- elastic_http[elastic-create-logs-template] action put
  - http_request[put request] action put
    - http_request[put request] PUT to https://PER320-2:9200/_template/logs
- elastic_http[elastic-create-experiments-template] action put
  - http_request[put request] action put
    - http_request[put request] PUT to https://PER320-2:9200/_template/experiments

Fernando_Marines · August 12, 2020, 2:18pm

here the resolution of that hostname
[root@PER320-2 install]# ping PER320-2
PING PER320-2 (192.168.0.230) 56(84) bytes of data.
64 bytes from PER320-2 (192.168.0.230): icmp_seq=1 ttl=64 time=0.042 ms
64 bytes from PER320-2 (192.168.0.230): icmp_seq=2 ttl=64 time=0.049 ms
64 bytes from PER320-2 (192.168.0.230): icmp_seq=3 ttl=64 time=0.065 ms
64 bytes from PER320-2 (192.168.0.230): icmp_seq=4 ttl=64 time=0.038 ms
64 bytes from PER320-2 (192.168.0.230): icmp_seq=5 ttl=64 time=0.060 ms
64 bytes from PER320-2 (192.168.0.230): icmp_seq=6 ttl=64 time=0.060 ms
^C64 bytes from PER320-2 (192.168.0.230): icmp_seq=7 ttl=64 time=0.050 ms
^X64 bytes from PER320-2 (192.168.0.230): icmp_seq=8 ttl=64 time=0.038 ms
^Z
[1]+ Stopped ping PER320-2
[root@PER320-2 install]# netstat |grep 9200
[root@PER320-2 install]#

ermias · August 12, 2020, 3:00pm

You can try a basic elastic query to see if elastic can be reached
curl -X GET -u admin:adminpw --insecure https://PER320-2:9200/_cat/indices

You might need to replace the default username and password admin:adminpw

Fernando_Marines · August 12, 2020, 6:10pm

Hi @ermias, just read this after i cleaned up the install, restarted the server and try to install all over again 3 times and fails on the same step now
is the hops:nn recipe, having this in the hops__nn.log file

thanks a lot

Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH…
20/08/12 14:06:20 WARN util.NativeCodeLoader: Loaded the native-hadoop library
20/08/12 14:06:20 WARN ha.FailoverProxyHelper: Failed to get list of NN from default NN. Default NN was hdfs://rpc.namenode.service.consul:8020
20/08/12 14:06:20 WARN hdfs.DFSUtil: Could not resolve Service
com.logicalclocks.servicediscoverclient.exceptions.ServiceNotFoundException: Error: host not found Could not find service ServiceQuery(name=rpc.namenode.service.consul, tags=[])
at com.logicalclocks.servicediscoverclient.resolvers.DnsResolver.getSRVRecordsInternal(DnsResolver.java:112)
at com.logicalclocks.servicediscoverclient.resolvers.DnsResolver.getSRVRecords(DnsResolver.java:98)
at com.logicalclocks.servicediscoverclient.resolvers.DnsResolver.getService(DnsResolver.java:71)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesFromServiceDiscovery(DFSUtil.java:822)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesAsURIs(DFSUtil.java:772)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesAsURIs(DFSUtil.java:764)
at org.apache.hadoop.hdfs.DFSUtil.getNameNodesRPCAddressesAsURIs(DFSUtil.java:757)
at org.apache.hadoop.hdfs.server.namenode.ha.FailoverProxyHelper.getActiveNamenodes(FailoverProxyHelper.java:100)
at org.apache.hadoop.hdfs.server.namenode.ha.HopsRandomStickyFailoverProxyProvider.(HopsRandomStickyFailoverProxyProvider.java:99)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

Fernando_Marines · August 12, 2020, 6:10pm

[fmarines@per320-2 cluster]$ sudo systemctl status namenode
● namenode.service - NameNode server for HDFS.
Loaded: loaded (/usr/lib/systemd/system/namenode.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/namenode.service.d
└─limits.conf
Active: active (running) since Wed 2020-08-12 14:04:12 EDT; 29s ago
Process: 44199 ExecStop=/srv/hops/hadoop/sbin/stop-nn.sh (code=exited, status=0/SUCCESS)
Process: 44222 ExecStart=/srv/hops/hadoop/sbin/start-nn.sh (code=exited, status=0/SUCCESS)
Main PID: 44259 (java)
Tasks: 202
CGroup: /system.slice/namenode.service
└─44259 /usr/lib/jvm/java-1.8.0/bin/java -Dproc_namenode -Xmx1000m -XX:MaxDirectMemorySize=1000m -XX:MaxDirectMemorySize=1000m -XX:MaxDirect…

Aug 12 14:04:06 per320-2.server start-nn.sh[44222]: rsync from /srv/hops/hadoop
Aug 12 14:04:06 per320-2.server start-nn.sh[44222]: starting namenode, logging to /srv/hops/hadoop/logs/hadoop-hdfs-namenode-per320-2.server.out
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: 2020-08-12 13:44:08,317 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: /************************************************************
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: STARTUP_MSG: Starting NameNode
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: STARTUP_MSG: user = hdfs
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: STARTUP_MSG: host = PER320-2/192.168.0.230
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: STARTUP_MSG: args = []
Aug 12 14:04:12 per320-2.server start-nn.sh[44222]: STARTUP_MSG: version = 2.8.2.10-RC1
Aug 12 14:04:12 per320-2.server systemd[1]: Started NameNode server for HDFS…