-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to scrape PDF from CNKI #116
Comments
@lullabymia , does the medium article work? Seems it included a complete example to extract text from PDFs. |
@lullabymia How many files you have? Maybe you can send to me, I will help you doing OCR with tools like |
We might need to scrape thousands of PDF files in this website http://navi.cnki.net/KNavi/NPaperDetail?pcode=CCND&bzpym=RMRB (at least from Jan 1, 2008-June 31, 2008 ) |
@lullabymia yes, you need first get pdf from the website, and the following is an example of words extraction from pdf: |
I also tried above method mentioned in medium, cannot proceed more now with installing module. Will try later.https://github.com/ChicoXYC/exercise/blob/master/extract-words-pdf/failure-extract-pdf-textract.ipynb |
Don't dwell on |
But there is an error called |
Besides of the above reading error, I cannot click the link of "download pdf" in the webpage import selenium
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
url="http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!"
browser.get(url)
html=browser.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
links = soup.find('div',attrs={'class':"dllink"})
link = links.find('a',attrs={'class':"icon icon-dlpdf"})
link.click() @ChicoXYC Can you also help me with that ? |
@lullabymia You need use selenium instead of requests method to find elements. The following will help import selenium
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
url="http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB201812110011&dbname=CCNDCOMMIT_DAY&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1ZzF6aU1NNVIrL0tTbXlSS3lURW1FWT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!"
browser.get(url)
link = browser.find_element_by_css_selector('.dllink a.icon.icon-dlpdf')
link.click() |
import PyPDF2
import os
path='pdfs/' #pass the path where your pdf files locate. suggest to put them into the folder where your jupyter notebooks are
for file in os.listdir(path):
pdfFileObject = open('pdfs/{0}'.format(file), 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText()) |
|
redirect to #135 |
Troubleshooting
Describe your environment
Describe your question
I want to scrape the whole PDF text from people's daily in CNKI but have no idea how to do it. Do I need to download all the articles?
(one article)
http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CCND&filename=RMRB200801010026&dbname=CCND2008&uid=WEEvREcwSlJHSldRa1FhdXNXa0d1YXREREQva29YUjBMb0hPUG15bXpFaz0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!
http://navi.cnki.net/KNavi/NPaperDetail?pcode=CCND&bzpym=RMRB
Describe the efforts you have spent on this issue
I found this website about pdf scraping
https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f
The text was updated successfully, but these errors were encountered: