Skip to content

Commit

Permalink
完成对于gs.amac.org.cn所有信息的url抓取
Browse files Browse the repository at this point in the history
  • Loading branch information
copie committed Jan 4, 2017
1 parent a66c114 commit 95bda80
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 32 deletions.
2 changes: 2 additions & 0 deletions 6.爬虫项目源码/9.gs.amac.org.cn/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
完成对于gs.amac.org.cn所有信息url的抓取,抓取的信息放在urllist里面。没有进行保存也没有进行详细信息的抓取,详细信息有太多项目,对我来说工作量太大
Completion of all the information for the gs.amac.org.cn crawl, crawl information on the inside of the urllist. There is no save the details of the crawl, there are too many details of the project, for me too much work
51 changes: 19 additions & 32 deletions 6.爬虫项目源码/9.gs.amac.org.cn/shimu.py
Original file line number Diff line number Diff line change
@@ -1,41 +1,28 @@
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
#-------------------------------------------------------------------------
# 程序:shimu.py
# 版本:0.1
# 作者:copie
# 日期:编写日期2016/12/15
# 语言:Python 3.5.x
# 系统: archlinux
# 操作:python shimu.py
# 功能:
#-------------------------------------------------------------------------
from selenium import webdriver
import time

browser = webdriver.PhantomJS()
import bs4
urllist = set()
browser = webdriver.Chrome()
browser.get('http://gs.amac.org.cn/amac-infodisc/res/pof/manager/index.html')
time.sleep(10)
time.sleep(5)
browser.get('http://gs.amac.org.cn/amac-infodisc/res/pof/manager/index.html')
files = open('ziliao.txt','w')
body=browser.find_element_by_xpath('//*[@id="managerList"]/tbody')
nextButton=browser.find_element_by_xpath('//*[@id="managerList_paginate"]/a[3]')
t=browser.find_element_by_xpath('//*[@id="managerList_length"]/label/select/option[4]')

nextButton = browser.find_element_by_xpath(
'//*[@id="managerList_paginate"]/a[3]')
t = browser.find_element_by_xpath(
'//*[@id="managerList_length"]/label/select/option[4]')
t.click()
i=1
while 1:
str=body.text
strs=str.split('\n')
for s in strs:
files.writelines(s)
files.writelines('\n')
print(len(strs))
if len(strs) < 100:
time.sleep(3)
i = 1
while True:
soup = bs4.BeautifulSoup(browser.page_source, 'lxml')
if len(soup.findAll('tbody')[4].findAll('tr')) < 100: # 有一个BUG最后一个页面没有抓取就结束了
break
for tmp in soup.findAll('tbody')[4].findAll('tr'):
urllist.add(tmp.findAll('td')[1].find('a').get('href'))
nextButton.click()
time.sleep(2)
print(i)
i=i+1
print(len(urllist))
i = i + 1
time.sleep(5)
files.close()
browser.close()

0 comments on commit 95bda80

Please sign in to comment.