forked from luyishisi/Anti-Anti-Spider
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
32 changed files
with
1,459 additions
and
416 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
|
||
本项目主要内容: | ||
|
||
1:通过模拟用户在百度搜索定站关键词来搜集足够多的百家作者id | ||
|
||
2:通过百度作者id获取百家号数以百万的文章URL及其标题阅读量标签等 | ||
|
||
|
||
操作1: | ||
|
||
get_id 该目录下运行 | ||
请确保keylist.txt在同一个目录下 | ||
|
||
python baijiahao.py | ||
|
||
便在同目录下产生urllist.txt文件,,便是通过百度搜索得出的作者id | ||
运行时间越久数量越多。 | ||
|
||
|
||
|
||
操作2: | ||
在id_to_excel目录下运行 | ||
请确保urllist.txt在同一个目录下(如果是从上部代码获取的urllist需要单独抽取出其id,请看好格式) | ||
|
||
python spider_list_to_excel | ||
之后运算以及耗时半小时以上,根据其id的数量, | ||
|
||
将获得到的url会例如2017_2_6.xslx |
File renamed without changes.
Oops, something went wrong.