Download FREE eBook every day from www.packtpub.com
This crawler automates the following step:
- access to private account
- claim the daily free eBook
- parse title, description and useful information
- download favorite format .pdf .epub .mobi
- download source code and book cover
- upload files to Google Drive or via scp
- store data on Firebase
- notify via email, IFTTT or Join (on success and errors)
- schedule daily job on Heroku or with Docker
# upload pdf to drive, store data and notify via email
python script/spider.py -c config/prod.cfg -u drive -s firebase -n gmail
# download all format
python script/spider.py --config config/prod.cfg --all
# download only one format: pdf|epub|mobi
python script/spider.py --config config/prod.cfg --type pdf
# download also additional material: source code (if exists) and book cover
python script/spider.py --config config/prod.cfg -t pdf --extras
# equivalent (default is pdf)
python script/spider.py -c config/prod.cfg -e
# download and then upload to Drive (given the download url anyone can download it)
python script/spider.py -c config/prod.cfg -t epub --upload drive
python script/spider.py --config config/prod.cfg --all --extras --upload drive
# download and notify: gmail|ifttt|join
python script/spider.py -c config/prod.cfg --notify gmail
# only claim book (no downloads):
python script/spider.py -c config/prod.cfg --notify gmail --claimOnly
Before you start you should
- Verify that your currently installed version of Python is 2.x with
python --version
- Clone the repository
git clone https://github.com/niqdev/packtpub-crawler.git
- Install all the dependencies (you might need sudo privilege)
pip install -r requirements.txt
- Create a config file
cp config/prod_example.cfg config/prod.cfg
- Change your Packtpub credentials in the config file
[credential]
credential.email=PACKTPUB_EMAIL
credential.password=PACKTPUB_PASSWORD
Now you should be able to claim and download your first eBook
python script/spider.py --config config/prod.cfg
From the documentation, Drive API requires OAuth2.0 for authentication, so to upload files you should:
- Go to Google APIs Console and create a new Drive project named PacktpubDrive
- On API manager > Overview menu
- Enable Google Drive API
- On API manager > Credentials menu
- In OAuth consent screen tab set PacktpubDrive as the product name shown to users
- In Credentials tab create credentials of type OAuth client ID and choose Application type Other named PacktpubDriveCredentials
- Click Download JSON and save the file
config/client_secrets.json
- Change your Drive credentials in the config file
[drive]
...
drive.client_secrets=config/client_secrets.json
[email protected]
Now you should be able to upload your eBook to Drive
python script/spider.py --config config/prod.cfg --upload drive
Only the first time you will be prompted to login in a browser which has javascript enabled (no text-based browser) to generate config/auth_token.json
.
You should also copy and paste in the config the FOLDER_ID, otherwise every time a new folder with the same name will be created.
[drive]
...
drive.default_folder=packtpub
drive.upload_folder=FOLDER_ID
Documentation: OAuth, Quickstart, example and permissions
To upload your eBook via scp
on a remote server update the configs
[scp]
scp.host=SCP_HOST
scp.user=SCP_USER
scp.password=SCP_PASSWORD
scp.path=SCP_UPLOAD_PATH
Now you should be able to upload your eBook
python script/spider.py --config config/prod.cfg --upload scp
Note:
- the destination folder
scp.path
on the remote server must exists in advance - the option
--upload scp
is incompatible with--store
and--notify
Create a new Firebase project, copy the database secret from your settings
https://console.firebase.google.com/project/PROJECT_NAME/settings/database
and update the configs
[firebase]
firebase.database_secret=DATABASE_SECRET
firebase.url=https://PROJECT_NAME.firebaseio.com
Now you should be able to store your eBook details on Firebase
python script/spider.py --config config/prod.cfg --upload drive --store firebase
To send a notification via email using Gmail you should:
- Allow "less secure apps" and "DisplayUnlockCaptcha" on your account
- Troubleshoot sign-in problems and examples
- Change your Gmail credentials in the config file
[gmail]
...
[email protected]
gmail.password=EMAIL_PASSWORD
[email protected]
[email protected],[email protected]
Now you should be able to notify your accounts
python script/spider.py --config config/prod.cfg --notify gmail
- Get an account on IFTTT
- Go to your Maker settings and activate the channel
- Create a new applet using the Maker service with the trigger "Receive a web request" and the event name "packtpub-crawler"
- Change your IFTTT key in the config file
[ifttt]
ifttt.event_name=packtpub-crawler
ifttt.key=IFTTT_MAKER_KEY
Now you should be able to trigger the applet
python script/spider.py --config config/prod.cfg --notify ifttt
- Get the Join Chrome extension and/or App
- You can find your device ids here
- (Optional) You can use multiple devices or groups (group.all, group.android, group.chrome, group.windows10, group.phone, group.tablet, group.pc) separated by comma
- Change your Join credentials in the config file
[join]
join.device_ids=DEVICE_IDS_COMMA_SEPARATED_OR_GROUP_NAME
join.api_key=API_KEY
Now you should be able to trigger the event
python script/spider.py --config config/prod.cfg --notify join
Create a new branch
git checkout -b heroku-scheduler
Update the .gitignore
and commit your changes
# remove
config/prod.cfg
config/client_secrets.json
config/auth_token.json
# add
dev/
config/dev.cfg
config/prod_example.cfg
Create, config and deploy the scheduler
heroku login
# create a new app
heroku create APP_NAME
# or if you already have an existing app
heroku git:remote -a APP_NAME
# deploy your app
git push -u heroku heroku-scheduler:master
heroku ps:scale clock=1
# useful commands
heroku ps
heroku logs --ps clock.1
heroku logs --tail
heroku run bash
Update script/scheduler.py
with your own preferences.
More info about Heroku Scheduler, Clock Processes, Add-on and APScheduler
Build your image
docker build -t niqdev/packtpub-crawler:2.2.0 .
Run manually
docker run \
--rm \
--name my-packtpub-crawler \
niqdev/packtpub-crawler:2.2.0 \
python script/spider.py --config config/prod.cfg
Run scheduled crawler in background
docker run \
--detach \
--name my-packtpub-crawler \
niqdev/packtpub-crawler:2.2.0
# useful commands
docker exec -i -t my-packtpub-crawler bash
docker logs -f my-packtpub-crawler
Alternatively you can pull from Docker Hub this fork
docker pull kuchy/packtpub-crawler
Add this to your crontab to run the job daily at 9 AM:
crontab -e
00 09 * * * cd PATH_TO_PROJECT/packtpub-crawler && /usr/bin/python script/spider.py --config config/prod.cfg >> /tmp/packtpub.log 2>&1
The script downloads also the free ebooks from the weekly packtpub newsletter.
The URL is generated by a Google Apps Script which parses all the mails.
You can get the code here, if you want to see the actual script, please clone the spreadsheet and go to Tools > Script editor...
.
To use your own source, modify in the config
url.bookFromNewsletter=https://goo.gl/kUciut
The URL should point to a file containing only the URL (no semicolons, HTML, JSON, etc).
You can also clone the spreadsheet to use your own Gmail account. Subscribe to the newsletter (on the bottom of the page) and create a filter to tag your mails accordingly.
Run a simple static server with
node dev/server.js
and test the crawler with
python script/spider.py --dev --config config/dev.cfg --all
This project is just a Proof of Concept and not intended for any illegal usage. I'm not responsible for any damage or abuse, use it at your own risk.