🐕 ScrapPY: PDF Scraping Made Easy

ScrapPY is a Python utility for scraping manuals, documents, and other sensitive PDFs to generate targeted wordlists that can be utilized by offensive security tools to perform brute force, forced browsing, and dictionary attacks. ScrapPY performs word frequency, entropy, and metadata analysis, and can run in full output modes to craft custom wordlists for targeted attacks. The tool dives deep to discover keywords and phrases leading to potential passwords or hidden directories, outputting to a text file that is readable by tools such as Hydra, Dirb, and Nmap. Expedite initial access, vulnerability discovery, and lateral movement with ScrapPY!

Demo:

ScrapPY.mov

Install:

Download Repository:

$ mkdir ScrapPY
$ cd ScrapPY/
$ sudo git clone https://github.com/RoseSecurity/ScrapPY.git

Install Dependencies:

$ python3 -m venv venv
$ source .venv/bin/activate
$ pip3 install -r requirements.txt

ScrapPY Usage:

usage: ScrapPY.py [-h] [-f FILE] [-m {word-frequency,full,metadata,entropy}] [-o OUTPUT]

Output metadata of document:

$ python3 ScrapPY.py -f example.pdf -m metadata

Output top 100 frequently used keywords to a file name Top_100_Keywords.txt:

$ python3 ScrapPY.py -f example.pdf -m word-frequency -o Top_100_Keywords.txt

Output all keywords to default ScrapPY.txt file:

$ python3 ScrapPY.py -f example.pdf

Output top 100 keywords with highest entropy rating:

$ python3 ScrapPY.py -f example.pdf -m entropy

ScrapPY Output:

# ScrapPY outputs the ScrapPY.txt file or specified name file to the directory in which the tool was ran. To view the first fifty lines of the file, run this command:

$ head -50 ScrapPY.txt

# To see how many words were generated, run this command:

$ wc -l ScrapPY.txt

Integration with Offensive Security Tools:

Easily integrate with tools such as Dirb to expedite the process of discovering hidden subdirectories:

root@RoseSecurity:~# dirb http://192.168.1.123/ /root/ScrapPY/ScrapPY.txt

-----------------
DIRB v2.21
By The Dark Raver
-----------------

START_TIME: Fri May 16 13:41:45 2014
URL_BASE: http://192.168.1.123/
WORDLIST_FILES: /root/ScrapPY/ScrapPY.txt

-----------------

GENERATED WORDS: 4592

---- Scanning URL: http://192.168.1.123/ ----
==> DIRECTORY: http://192.168.1.123/vi/
+ http://192.168.1.123/programming (CODE:200|SIZE:2726)
+ http://192.168.1.123/s7-logic/ (CODE:403|SIZE:1122)
==> DIRECTORY: http://192.168.1.123/config/
==> DIRECTORY: http://192.168.1.123/docs/
==> DIRECTORY: http://192.168.1.123/external/

Utilize ScrapPY with Hydra for advanced brute force attacks:

root@RoseSecurity:~# hydra -l root -P /root/ScrapPY/ScrapPY.txt -t 6 ssh://192.168.1.123
Hydra v7.6 (c)2013 by van Hauser/THC & David Maciejak - for legal purposes only

Hydra (http://www.thc.org/thc-hydra) starting at 2014-05-19 07:53:33
[DATA] 6 tasks, 1 server, 1003 login tries (l:1/p:1003), ~167 tries per task
[DATA] attacking service ssh on port 22

Enhance Nmap scripts with ScrapPY wordlists:

nmap -p445 --script smb-brute.nse --script-args userdb=users.txt,passdb=ScrapPY.txt 192.168.1.123

Future Development:

Allow for custom output file naming and increased verbosity
Integrate different modes of operation including word frequency analysis
Allow for metadata analysis
Search for high-entropy data
Prepare packaging for homebrew installation
Search for path-like data
Implement image OCR to enumerate data from images in PDFs
Allow for processing of multiple PDFs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

🐕 ScrapPY: PDF Scraping Made Easy

Demo:

Install:

ScrapPY Usage:

Integration with Offensive Security Tools:

Future Development:

Files

README.md

Latest commit

History

README.md

File metadata and controls

🐕 ScrapPY: PDF Scraping Made Easy

Demo:

Install:

ScrapPY Usage:

Integration with Offensive Security Tools:

Future Development: