This repository contains a Python-based web scraper for extracting detailed specifications of mobile phones from GSMArena. It utilizes Playwright and BeautifulSoup for robust data extraction and supports multi-threaded execution for efficient scraping.
- Progress Saving: Ensures data is not lost and scraping can resume from the last saved point in case of interruptions.
- Concurrent Scraping: Uses
ThreadPoolExecutor
to scrape multiple pages concurrently. - Comprehensive Data Extraction: Extracts various phone specifications including model name, release date, OS details, CPU/GPU information, and more.
- Custom Logging: Provides detailed logs of the scraping process for monitoring and debugging.
- Python 3.7+
- Playwright
- BeautifulSoup4
- Requests
- Logging
- Pickle
-
Clone the repository:
git clone https://github.com/ahthserhsluk/GSMARENA-Mobile-Data-Scapper.git cd gsmarena-phone-scraper
-
Install the required packages:
pip install -r requirements.txt
-
Install Playwright browsers:
playwright install
-
Update the
main
function inscraper.py
with the desired manufacturer and start URL:if __name__ == "__main__": manufacturer = "Nokia" # Replace with the desired manufacturer start_url = "https://www.gsmarena.com/nokia-phones-1.php" end_page = 5 # Change this to set an end page or set to None to scrape all pages main(manufacturer, start_url, end_page)
-
Run the scraper:
python scraper.py
-
The scraped data will be saved to a CSV file in the manufacturer's directory.
scraper.py
: The main script containing the scraping logic.requirements.txt
: The dependencies required to run the scraper.logs/
: Directory where logs are saved.data/
: Directory where the scraped CSV files are saved.
- Fork the repository.
- Create your feature branch (
git checkout -b feature/your-feature
). - Commit your changes (
git commit -m 'Add some feature'
). - Push to the branch (
git push origin feature/your-feature
). - Open a Pull Request.
This project is licensed under the MIT License