Still searching link-by-link and reading line-by-line to compose a review in 5202?
A comprehensive Python-based pipeline for collecting, downloading, and analyzing academic papers, with a focus on LLM and Security research papers.
- Paper Collection: Scrapes Google Scholar for relevant papers
- Automated Downloads: Downloads papers from various academic sources
- Text Extraction: Supports multiple PDF extraction methods
- AI-Powered Summaries: Generates research paper summaries using Claude 3
- Scrapes Google Scholar using
scholarly
- Streams results to markdown in real-time
- Async implementation for better performance
- Outputs structured paper metadata
- Async download of papers from URLs
- Smart rate limiting
- Progress tracking
- Filename sanitization
- Handles various academic sources
- Dual extraction methods:
- PyPDF2 (fast, basic extraction)
- Nougat (ML-based, better accuracy)
- Comparison reporting
- Error handling and validation
- Uses Claude 3 API for intelligent summarization
- Structured summary format
- Index generation
- Async processing
pip install -r requirements.txt
Required packages:
scholarly
aiohttp
aiofiles
PyPDF2
nougat-ocr
anthropic-sdk
tqdm
Required API keys:
- Anthropic API key for Claude (for summarization)
Set up your API key:
export ANTHROPIC_API_KEY='your-api-key-here'
Or I recommend you create a .env
file and write the keys in it like:
ANTHROPIC_API_KEY=sk-this-is-your-api-key
# base_url defaults to the official API
# ANTHROPIC_BASE_URL=https://example.com
and run it using dotenvx
:
dotenvx run -- python summarize_papers.py
python generate_papers.py
Outputs: papers.md
python download_papers.py
Outputs: papers/*.pdf
python extract_markdown.py
Outputs:
markdown/pypdf/*.md
markdown/nougat/*.md
python summarize_papers.py
Outputs: summaries/*.md
src/
├── config/
│ └── settings.py
├── core/
│ ├── __init__.py
│ ├── scholarly_client.py
│ ├── pdf_processor.py
│ └── llm_client.py
├── extractors/
│ ├── __init__.py
│ ├── base.py
│ ├── pypdf.py
│ └── nougat.py
├── chains/
│ ├── __init__.py
│ ├── paper_collection.py
│ ├── text_extraction.py
│ └── summarization.py
├── utils/
│ ├── __init__.py
│ ├── file_utils.py
│ └── async_utils.py
└── main.py
| Title | Authors | Year | Citations | Link |
| ----- | ------- | ---- | --------- | ---- |
| ... | ... | ... | ... | ... |
1. Title
2. Key Points
3. Main Contributions
4. Methodology
5. Results and Conclusions
6. Future Work
- Some papers may be behind paywalls
- Download rates may be limited by sources
- PDF extraction quality varies
- API costs for summarization
- Network dependencies
Feel free to submit issues and enhancement requests!
MIT License