Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=<>] unknown field 'contenthash' #247

Open
ravindrabajpai opened this issue Jan 31, 2022 · 3 comments

Comments

@ravindrabajpai
Copy link

ravindrabajpai commented Jan 31, 2022

Issue Description

I am trying to build and run the sparkler from the source. I am following the example given in the readme. I have injected a url and is visible in solr.
I face problem while crawling and see given below error -

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (ip-172-31-39-218.ap-southeast-1.compute.internal executor driver): org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'

How to reproduce it

  1. git clone the main branch.
  2. build sparkler-core
  3. modify /home/ubuntu/sparkler/sparkler-core/build/conf/sparkler-default.yaml
crawldb.backend: solr  # "solr" is default until "elasticsearch" becomes usable.
solr.uri: http://localhost:8983/solr/crawldb
  1. Run following command to inject -
    java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main inject -id sjob-1 -su https://news.bbc.co.uk
  2. Run following command to crawl -
    java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main crawl -id sjob-1 -tn 10 -i 1

Additional changes: I have modified Crawler.scala and have added below code at line 171
conf.set("spark.io.compression.codec", "snappy")
Please let me know how to pass spark-conf in the runtime configurations so that I can avoid doing this.

Environment and Version Information

Please indicate relevant versions, including, if relevant:

  • Java Version
    openjdk version "1.8.0_312"
    OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07)
    OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

  • Spark Version - 3.0.3, Scala version 2.12.10

  • Operating System name and version - AWS Instance based on 20.04.1-Ubuntu

  • Solr - 8.5.0 (in local mode)

I see the Content Hash object in the sparkler-core code, but do not see it getting injected in the solr, then why it is expected while fetching. The same error I see in the solr.log

2022-01-31 04:44:54.871 ERROR (qtp1984990929-17) [   x:crawldb] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
        at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:226)
        at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:109)
        at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:977)

StackTrace from sparkler-crawl -

04:44:54.877 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.storage.BlockManagerMaster - Updated info of block rdd_7_0
04:44:54.877 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.storage.BlockManager - Told master about block rdd_7_0
04:44:54.880 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 3.0 (TID 3)
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
	at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:665)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
	at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
	at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177)
	at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
	at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:156)
	at edu.usc.irds.sparkler.storage.solr.SolrProxy.addResource(SolrProxy.scala:121)
	at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:158)
	at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:37)
	at scala.collection.Iterator.toStream(Iterator.scala:1417)
	at scala.collection.Iterator.toStream$(Iterator.scala:1416)
	at edu.usc.irds.sparkler.pipeline.FairFetcher.toStream(FairFetcher.scala:37)
	at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:336)
	at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:336)
	at edu.usc.irds.sparkler.pipeline.FairFetcher.toSeq(FairFetcher.scala:37)
	at edu.usc.irds.sparkler.pipeline.Crawler.$anonfun$run$3(Crawler.scala:258)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1418)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1345)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1409)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1230)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
04:44:54.908 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.executor.ExecutorMetricsPoller - removing (3, 0) from stageTCMP
@ravindrabajpai
Copy link
Author

I tried a work-around by removing this line from the StatusUpdateSolrTransformer -
//Constants.storage.CONTENTHASH -> ContentHash.fetchHash(data.fetchedData.getContent)

And it works for me for now.

But my hunch is that there is a better solution and maybe I am missing something in the configurations.

@lewismc
Copy link
Member

lewismc commented Feb 2, 2022

Hi @ravindrabajpai thanks for reporting the bug!

I see the Content Hash object in the sparkler-core code, but do not see it getting injected in the solr,

the content signature cannot be calculated at inject phase as it is based on Webpage content rather than the URL.

then why it is expected while fetching.

I suspect it is expected 'after' fetching but before indexing.

But my hunch is that there is a better solution and maybe I am missing something in the configurations.

Can you check that the webpage content was actually fetched?

@ravindrabajpai
Copy link
Author

Hi @lewismc

Thanks for replying. Yes I could see the webpage content was fetched correctly. I injected total 2 urls (additionally : edition.cnn.com) and both were fetched and stored correctly in the solr. there were about 300+ doc for both the sources (websites).

For all the Steps I did - https://github.com/ravindrabajpai/ana/blob/main/ground_zero

thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants