product-crawler

Javascript based web crawler to crawl product from e-commerce sites

Goal

The goal of the project is to develop Apify based crawler to crawl all the product page from the a German supermarket called Edeka.

Requirements

The crawler should run locally on your machine as a Node.js based CLI application
The crawler should not use a browser (please base it on Apify's CheerioCrawler)
The site to crawl is: https://www.edeka24.de
The crawler should find and download every product page available on the site (here is an example of one: https://www.edeka24.de/Wein/Sekt-Prosecco/Deutschland/Kessler-Sekt-Hochgewae chs-Chardonnay-Brut-0-75L.html)
All product pages found by your crawler should be saved locally to disk in a dedicated Apify KeyValueStore named "product-pages"
Resources that are not product pages (e.g. the home page) can be downloaded but if you do so, these should be stored in a separate Apify KeyValueStore

Datapoints to extract from Product page

An e-commerce crawler is mainly built to crawl the price of the product. Considering this we are going to extract the below datapoint from the product page

Entire HTML page of the product
Title of the product
Price of the product

How the Crawler Works?

This crawler uses sitemap to crawl the product from website. The sitemap.xml file is a jackpot for every web scraper and of course for edeka scraper too.

Following are the steps involvd in crawler development

Locate the sitemap URL
Add all the product links to Request list
Call the HandlePagefunction
Extract the datapoint(title,price,html)
Save the datapoints in Key value store product-store

what is sitemap?

A sitemap is a file where websites provide information about the pages, on their site in an orgainsed way. Search engines like Google read this file to crawl your site more efficiently

how to locate the sitemap?

The sitemap can be found at the path /sitemap.xml or /robots.txt (rules defined for crawlers on how to crawl their site). It is always worth trying that URL.

for Edeka , the sitemap URL is https://www.edeka24.de/sitemaps/sitemap-index.xml which is found from /robots.txt file - https://www.edeka24.de/robots.txt

In this page you can clearly find the sitemap for products is https://www.edeka24.de/sitemaps/sitemap_0-products-0.xml which list all the product pages jackpot for us right?

why use sitemap for crawling?

Focused Crawling (Crawl what you need)- (Product pages)
Less web request to get the Href from each page and filter product pages
Efficient way of Crawling

How to setup the project in Local?

Pre-requisites

Node 10.17 or higher
NPM

npm install -g apify-cli # install apify CLI 

git clone https://github.com/Jebaseelanravi/product-crawler.git

cd product-crawler

cd edeka-crawler

apify run -p # run the crawler

Once you run the crawler the results will be stored in the path ./apify_storage/key_value_stores/product_store/id.json

Sample result

{
  "title": "Bortzmeyer Traditionell Elsass Quetsch 0,7 ltr",
  "price": "10,99€",
  "webpage": "<!DOCTYPE HTML>\n\n<html lang=\"de\">
            <body>HTML content truncated for example</body>\n</html>"
}

TODO

Set the Crawl Delay = 5 sec (time between subsequent request to make sure our crawl does not get banned)

Crawl delay is specified in /robots.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

product-crawler

Goal

Requirements

Datapoints to extract from Product page

How the Crawler Works?

what is sitemap?

how to locate the sitemap?

why use sitemap for crawling?

How to setup the project in Local?

Sample result

TODO

Files

README.md

Latest commit

History

README.md

File metadata and controls

product-crawler

Goal

Requirements

Datapoints to extract from Product page

How the Crawler Works?

what is sitemap?

how to locate the sitemap?

why use sitemap for crawling?

How to setup the project in Local?

Sample result

TODO