Hello there,
I am backfilling an offline feature store using materialization jobs, and everything was working fine until the first time this exception appeared at the strerr logs:
2024-08-31 17:12:13,565 WARN fsrigolearning,ohlc_feature_group_1_offline_fg_materialization,1151014,application_1723634500689_19958 TaskSetManager: Lost task 0.0 in stage 24.0 (TID 42) (ip-172-16-4-64.us-east-2.compute.internal executor 1): org.apache.hudi.exception.HoodieIOException: Failed to read from Parquet file hopsfs://172.16.4.90:8020/apps/hive/warehouse/fsrigolearning_featurestore.db/ohlc_feature_group_1/dae9225c-0bd6-4953-9230-177a5015a965-0_0-17-46_20240831171027609.parquet
at org.apache.hudi.common.util.ParquetUtils.getHoodieKeyIterator(ParquetUtils.java:181)
at org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:196)
at org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:147)
at org.apache.hudi.io.HoodieKeyLocationFetchHandle.locations(HoodieKeyLocationFetchHandle.java:62)
at org.apache.hudi.index.simple.HoodieSimpleIndex.lambda$fetchRecordLocations$33972fb4$1(HoodieSimpleIndex.java:155)
at org.apache.hudi.data.HoodieJavaRDD.lambda$flatMap$a6598fcb$1(HoodieJavaRDD.java:122)
at org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: File does not exist: hopsfs://172.16.4.90:8020/apps/hive/warehouse/fsrigolearning_featurestore.db/ohlc_feature_group_1/dae9225c-0bd6-4953-9230-177a5015a965-0_0-17-46_20240831171027609.parquet
at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1338)
at org.apache.hadoop.hdfs.DistributedFileSystem$21.doCall(DistributedFileSystem.java:1330)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1346)
at org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:320)
at org.apache.hudi.common.util.ParquetUtils.getHoodieKeyIterator(ParquetUtils.java:178)
… 20 more
Since then, each and every materialization job fails due to the same error.
While searching for a solution, I have found this support ticket at Hudi that is pretty similar to what I am facing : [SUPPORT] File does not exisit(parquet) while reading Hudi Table from Spark · Issue #2098 · apache/hudi · GitHub
Nothing else is ingesting information at this feature store and I was submitting one job every 1,5 min to Hopsworks. I don’t find any configuration file which I can use to manage the ingestion process.
As I am almost done with the initial load for my project, so I wonder if there is a way to have the feature store to just forget about this file and let me proceed (ok, if I lost it I lost it, I can backfill the information it used to have later on). I would like to avoid deleting the feature store altogether and starting over.
Does anyone have an idea of what to do in this case?
Thank you in advance