Spark timeout parameter need to modify?

Fernando_Marines · July 16, 2020, 11:18pm

Hello, we have some features coming from jdbc sources and some jobs failed with this error:

ExecutorLostFailure (executor 14 exited caused by one of the running tasks) Reason: Container from a bad node: container_e06_1594496268635_0006_01_000015 on host: hopsworks0.logicalclocks.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

do we have to modify a parameter or what could be causing these errors?
Regards!

Theo · July 17, 2020, 9:35am

Hi @Fernando_Marines,

Is this error coming from a Hopsworks job? Can you check the out and err logs for more details?

How many resources (memory, cpu) is allocated to the job?

Fernando_Marines · July 17, 2020, 2:12pm

Hi @Theo
yes its from a hopswork job that reads from a jdbc source they features using the standard parameters
(2048m dirver,4080m executor) the cluster is default (12g) all in one VM server with 125Gb ram.

that feature job reads from 10 different tables from the same location and seems to reach the max at certain step
what i did was reset the kernel and the application an split the job in 2 and seems to work fine. but i wonder what parameter needs to be increased…either the cluster, yarn …because increasing the driver and executors more than the cluster max fails.

Theo · July 17, 2020, 2:42pm

Maybe there is too much data being stored on the driver and it runs out of memory. Can you check if the container failing is the driver or one of the executors?

You can also try to set in Hopsworks Jupyter dashboard in the Spark properties textbox the spark.yarn.am.memoryOverhead to a value higher than the default http://spark.apache.org/docs/2.4.3/running-on-yarn.html

Fernando_Marines · July 17, 2020, 2:54pm

those were the executors, was watching the job from the app GUI and noticed it tries 4 times to execute the same thing before to do failure.

i ran it again and doing refresh all time and it tries one, then that executor fails after 10 mins and a new executor show up.

I’m getting my hands on all the techs around hops works, so may need to read what you gave me before and increase parameters here.
Thanks a lot!

Fernando_Marines · July 17, 2020, 7:16pm

@Theo
i can’t find what i would need, was looking the spark doc and websites about but could not find what i would require, like have the cluster with more than 16gb RAM like you can do on cloud solution. would you know how can i modify the config?
Regards

Fernando_Marines · July 19, 2020, 5:10pm

i was able to modify the cluster default definition from yarn, restart the environment to take the changes and it now has enough resources.
Regards