On-prem deloyment, spark issues

Greetings! Im trying to set up an on-prem deployment of Hopsworks at a local datacenter for some ML experiments on earth observation data and im having trouble with any operations that involve spark - they all fail due to java not being able to resolve hops-test-deploy as a name or service (see the stderr log output below). The available hardware resources are way above the requested amounts, so im ruling this out as a cause.

I have no idea how to proceed with solving this error, so Im reaching out here after giving up trying my way forward.

The particular output sample below is generated from trying to run the first step in the official “Fraud Tutorial” material, on the line which I expect is what starts the spark process:

window_aggs_fg.insert(window_aggs_df)

This does indeed start a job that registers under “executions”, but fails with the output pasted below.

Over all the documentation was a nice read and there seems to have been lots of progress in the 3.0 version. Please tell me if there is any additional information I can provide to help.

stderr.txt:

Container: container_e01_1666354074621_0013_01_000001 on hops-test-deploy_9000_1666600690220
============================================================================================== 
Log Type: prelaunch.err
Log Length: 0
Container: container_e01_1666354074621_0013_01_000001 on hops-test-deploy_9000_1666600690220
============================================================================================== 
Log Type: stderr
Log Length: 2744
Log Contents: 
2022-10-24 08:38:08,393 INFO model_serving,transactions_4h_aggs_fraud_batch_fg_1_offline_fg_backfill,41,application_1666354074621_0013 SignalUtils: Registering signal handler for TERM
2022-10-24 08:38:08,396 INFO model_serving,transactions_4h_aggs_fraud_batch_fg_1_offline_fg_backfill,41,application_1666354074621_0013 SignalUtils: Registering signal handler for HUP
2022-10-24 08:38:08,396 INFO model_serving,transactions_4h_aggs_fraud_batch_fg_1_offline_fg_backfill,41,application_1666354074621_0013 SignalUtils: Registering signal handler for INT
Exception in thread "main" java.lang.ExceptionInInitializerError
	at org.apache.spark.SparkConf$.<init>(SparkConf.scala:654)
	at org.apache.spark.SparkConf$.<clinit>(SparkConf.scala)
	at org.apache.spark.SparkConf.set(SparkConf.scala:94)
	at org.apache.spark.SparkConf.$anonfun$loadFromSystemProperties$3(SparkConf.scala:76)
	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
	at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
	at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
	at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
	at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:75)
	at org.apache.spark.SparkConf.<init>(SparkConf.scala:70)
	at org.apache.spark.SparkConf.<init>(SparkConf.scala:59)
	at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:852)
	at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: java.net.UnknownHostException: hops-test-deploy: hops-test-deploy: Name or service not known
	at java.net.InetAddress.getLocalHost(InetAddress.java:1512)
	at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:998)
	at org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:991)
	at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:991)
	at org.apache.spark.util.Utils$.$anonfun$localCanonicalHostName$1(Utils.scala:1048)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.util.Utils$.localCanonicalHostName(Utils.scala:1048)
	at org.apache.spark.internal.config.package$.<init>(package.scala:936)
	at org.apache.spark.internal.config.package$.<clinit>(package.scala)
	... 14 more
Caused by: java.net.UnknownHostException: hops-test-deploy: Name or service not known
	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1330)
	at java.net.InetAddress.getLocalHost(InetAddress.java:1507)
	... 22 more

Log Type: stderr.txt
Log Length: 0

Hi @ekallman ,

The problem is that you can’t resolve the hostname. Do you have a single node or multiple nodes? How does your /etc/hosts look like?


Fabio

Thanks for reaching out, Fabio.

My hardware resource is under maintenence at the moment. I will check this asap.

Hello again,

Im running a single node here, and my /etc/hosts contains:

127.0.0.1 localhost

The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

So there is no definition for hops-test-deploy.

Hi @ekallman ,

Adding the line:

[IP OF THE MACHINE] hops-test-deploy

to the /etc/hosts should make the hostname resolvable by Hopsworks services.

After you have done that, my suggestion is to stop all services with:

/srv/hops/kagent/kagent/bin/shutdown-all-local-services.sh

and then restart them all with:

/srv/hops/kagent/kagent/bin/start-all-local-services.sh


Fabio