You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to build and run the sparkler from the source. I am following the example given in the readme. I have injected a url and is visible in solr.
I face problem while crawling and see given below error -
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (ip-172-31-39-218.ap-southeast-1.compute.internal executor driver): org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
crawldb.backend: solr # "solr" is default until "elasticsearch" becomes usable.
solr.uri: http://localhost:8983/solr/crawldb
Run following command to inject - java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main inject -id sjob-1 -su https://news.bbc.co.uk
Run following command to crawl - java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main crawl -id sjob-1 -tn 10 -i 1
Additional changes: I have modified Crawler.scala and have added below code at line 171 conf.set("spark.io.compression.codec", "snappy")
Please let me know how to pass spark-conf in the runtime configurations so that I can avoid doing this.
Environment and Version Information
Please indicate relevant versions, including, if relevant:
Java Version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
Spark Version - 3.0.3, Scala version 2.12.10
Operating System name and version - AWS Instance based on 20.04.1-Ubuntu
Solr - 8.5.0 (in local mode)
I see the Content Hash object in the sparkler-core code, but do not see it getting injected in the solr, then why it is expected while fetching. The same error I see in the solr.log
2022-01-31 04:44:54.871 ERROR (qtp1984990929-17) [ x:crawldb] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:226)
at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:109)
at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:977)
StackTrace from sparkler-crawl -
04:44:54.877 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.storage.BlockManagerMaster - Updated info of block rdd_7_0
04:44:54.877 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.storage.BlockManager - Told master about block rdd_7_0
04:44:54.880 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 3.0 (TID 3)
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:665)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:156)
at edu.usc.irds.sparkler.storage.solr.SolrProxy.addResource(SolrProxy.scala:121)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:158)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:37)
at scala.collection.Iterator.toStream(Iterator.scala:1417)
at scala.collection.Iterator.toStream$(Iterator.scala:1416)
at edu.usc.irds.sparkler.pipeline.FairFetcher.toStream(FairFetcher.scala:37)
at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:336)
at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:336)
at edu.usc.irds.sparkler.pipeline.FairFetcher.toSeq(FairFetcher.scala:37)
at edu.usc.irds.sparkler.pipeline.Crawler.$anonfun$run$3(Crawler.scala:258)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1418)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1345)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1409)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1230)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
04:44:54.908 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.executor.ExecutorMetricsPoller - removing (3, 0) from stageTCMP
The text was updated successfully, but these errors were encountered:
I tried a work-around by removing this line from the StatusUpdateSolrTransformer -
//Constants.storage.CONTENTHASH -> ContentHash.fetchHash(data.fetchedData.getContent)
And it works for me for now.
But my hunch is that there is a better solution and maybe I am missing something in the configurations.
Thanks for replying. Yes I could see the webpage content was fetched correctly. I injected total 2 urls (additionally : edition.cnn.com) and both were fetched and stored correctly in the solr. there were about 300+ doc for both the sources (websites).
Issue Description
I am trying to build and run the sparkler from the source. I am following the example given in the readme. I have injected a url and is visible in solr.
I face problem while crawling and see given below error -
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (ip-172-31-39-218.ap-southeast-1.compute.internal executor driver): org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
How to reproduce it
java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main inject -id sjob-1 -su https://news.bbc.co.uk
java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main crawl -id sjob-1 -tn 10 -i 1
Additional changes: I have modified Crawler.scala and have added below code at line 171
conf.set("spark.io.compression.codec", "snappy")
Please let me know how to pass spark-conf in the runtime configurations so that I can avoid doing this.
Environment and Version Information
Please indicate relevant versions, including, if relevant:
Java Version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
Spark Version - 3.0.3, Scala version 2.12.10
Operating System name and version - AWS Instance based on 20.04.1-Ubuntu
Solr - 8.5.0 (in local mode)
I see the Content Hash object in the sparkler-core code, but do not see it getting injected in the solr, then why it is expected while fetching. The same error I see in the solr.log
StackTrace from sparkler-crawl -
The text was updated successfully, but these errors were encountered: