L_SPIDER

A spider with some methods that been used to crawling data like news and title or pic ,and another things which not encode in Complex algorithm（well,the truth is i do not learn about it ,if someday i got it ,a new spider will been creat ,i promise.）
At very first i code this just for help my friend to finish thier subject , and by the way training my programing capacity
. But more i try to improve and perfect ,more i astounded about pentent ablity of SPIDER.
Excellent data killer，you can analyse anything by useing magic algorithm and your favourite programing language with the huge data you crawled by spider if web server allow.😈
Why i don't use scrapy or another frame? very simple,caused easy task need no Rocket Launcher，isn't it?
In other words, I want to make a wheel to play by myself.
Wish my code would help you to save some worry like i assisted my friends,good luck with you,guy.😇
LOVEMOSTISBUG
这是一个用来抓取数据，比如新闻、标题或图片，还有一些没有用复杂的算法进行编码的数据的小爬虫（好吧，其实是小弟学业不精，哪天我懂了，会把该有的都加上去的，保证）。
一开始我编写这个代码只是为了帮我的朋友完成他们的课题，顺便训练我自己的编程能力。
但越是加油改进和完善，就越惊讶于爬虫的潜能。
牛X的数据杀手，你可以分析任何事物通过使用魔术算法和你最喜欢的编程语言与你爬下来的巨大数据，如果网络服务器允许。😈
为毛我不用Scrapy这样的框架去爬？杀鸡焉用牛刀，你说是吧~
还有的话其实是我也想造个轮子自己玩（小声bb）
希望我的代码能帮你省去一些烦恼，就像我帮助我的朋友一样，祝你好运，伙计。😇
LOVEMOSTISBUG

在教程开始前有件事必须得告诉你

高强度和没有得到允许的脚本访问是所有服务器君和网站管理员都不想接待的。《从入门到入狱》。
高强度和没有得到允许的脚本访问是所有服务器君和网站管理员都不想接待的。《从入门到入狱》。
高强度和没有得到允许的脚本访问是所有服务器君和网站管理员都不想接待的。《从入门到入狱》。
仅供学习参考。

How to crawl 游戏开始

SPIDER这个类里的函数我来解释一下吧。
初始定义时需要两个参数，一个是目标url还有便是目标的正则。
url_open就是伪造报头和代理IP访问URL并返回值(默认二进制)
get_html访问本身地址并返回且保存URL
show_html输出本身地址HTML
get_aim_list访问本身地址并返回且保存目标URL列表
show_aim_list逐个输出目标的URL
一般是先用crawl初始化基本的HTML和目标URL列表当然看你具体需求，代码已经尽量简略好不损失自由度了个人觉得。
L_print是因为某些字符无法打印出来又不想报错写的类似print函数
download就是download 参数有目标url和保存到文件夹，默认是根目录
keep_data_one_page只保存一页的数据参数为url和爬取一次得到的是元组时的分隔符号
keep_data_by_pages参数为爬取页数前段url和后段url（中间夹着页数）还有分隔符号以及页数跨度
deep_crawl 首先是深入爬的正则还有就是前段url和后段url
deep_crawl_and_save 深入爬的正则保存文件夹还有就是前段url和后段url

TASK 0 如果你只是轻量级的爬取
比如接下来的爬取新闻网站带有Chinese的所有新闻标题
你甚至不需要用到我写的类
直接这样就行
当然用我的也能很不错的完成任务 ~能帮你剩下些时间去干其他事情

import urllib.request
import urllib.parse
import random
import re
import time

def url_open(url):
    my_headers = list(set(open('user_agent.txt','r').read().split('\n')))
    iplist = list(set(open('ip.txt','r').read().split('\n')))
    my_ip = random.choice(iplist)
    my_head = random.choice(my_headers)
    print (my_ip+'\n'+my_head+'\n')
    iplist =list(set(open('ip.txt','r').read().split('\n')))
    proxy_support = urllib.request.ProxyHandler({'http':my_ip})
    opener = urllib.request.build_opener(proxy_support)
    opener.addheaders = [('User-Agent',my_head)]
    urllib.request.install_opener(opener)
    req = urllib.request.Request(url)
    response = urllib.request.urlopen(req)
    html1 = response.read()
    return html1

def gkd(url,k):
    c = re.findall(k,url_open(url).decode('utf-8'))
    c=list(set(c))
    return c

k1 = re.compile(r'class="story-txt">\r\n\t\t\t\t\t\t((?:.).*?)\t\t\t\t\t</div>')
for i in range(1,200):
    url2 = 'https://globalnews.ca/gnca-ajax/search-results/%7B%22term%22:%22china%22,%22type%22:%22news%22,%22page%22:'+str(i)+'%7D/'
    t = gkd(url2,k1)
    for i in t:
        b = str(i).replace('&#039;','\'')
        b = b.replace('&quot;',' ')
        print(b)
        with open ('T.txt','a',encoding="utf-8")as f:
            f.write(b+'\n')
    
print('ok done')

TASK 1 百度贴吧：某吧首推前几十页的帖子标题及回复命名为标题.txt 写入内容为一回复加一换行保存到原目录data文件夹内代码量：23行

from L_SPIDER import SPIDER
import re
import urllib.parse
import multiprocessing as mp

k_aim = re.compile('''errer" href="((?:(?:.).*?))" title="(?:(?:.).*?)"''')
k_aim_deep= r'''j_d_post_content " style="display:;">((?:(?:.).*?))<'''
k_aim_deep_file_name =re.compile(r'''<title>((?:(?:.).*?))</title>''')
tieba = urllib.parse.quote('抗压背锅')
page = 50
b = SPIDER(f'https://tieba.baidu.com/f?kw={tieba}&ie=utf-8&pn={str(page)}',k_aim)
b.get_html()
ls = b.get_aim_list()
b.show_aim_list()
def run(urls):
    b.deep_crawl_and_save(k_aim_deep,k_aim_deep_file_name,f_url='https://tieba.baidu.com')

if __name__ == '__main__':
    p = mp.Pool(10)
    rel = p.map(run,ls)
    p.close()
    p.join()

多线程下，爬取非常迅速，5s内3000数据应该问题不大，对课题或者建立什么模型基本就能够开始了。
正则表达式如果不会写可以用这招。把前面的特征和后面的特征换成你对应的内容前后特征便是。

k_aim = re.compile('''前面的特征((?:(?:.).*?))后面的特征''')

emmm我觉得都来爬虫了文本操作什么的应该都会了吧所以不多说了。
其中效果如下：

Some things you maybe want know

you are been ban.respect~

web server was boom.respect~

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
L_SPIDER.py		L_SPIDER.py
README.md		README.md
ip.txt		ip.txt
user_agent.txt		user_agent.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

L_SPIDER

在教程开始前有件事必须得告诉你

How to crawl 游戏开始

TASK 0 如果你只是轻量级的爬取
比如接下来的爬取新闻网站带有Chinese的所有新闻标题
你甚至不需要用到我写的类
直接这样就行
当然用我的也能很不错的完成任务 ~能帮你剩下些时间去干其他事情

TASK 1 百度贴吧：某吧首推前几十页的帖子标题及回复命名为标题.txt 写入内容为一回复加一换行保存到原目录data文件夹内代码量：23行

Some things you maybe want know

About

Releases

Packages

Languages

LOVEMOSTISBUG/L_SPIDER

Folders and files

Latest commit

History

Repository files navigation

L_SPIDER

在教程开始前有件事必须得告诉你

How to crawl 游戏开始

TASK 0 如果你只是轻量级的爬取 比如接下来的爬取新闻网站带有Chinese的所有新闻标题 你甚至不需要用到我写的类直接这样就行当然 用我的也能很不错的完成任务 ~能帮你剩下些时间去干其他事情

TASK 1 百度贴吧：某吧首推前几十页的帖子标题及回复 命名为标题.txt 写入内容为一回复加一换行 保存到原目录data文件夹内 代码量：23行

Some things you maybe want know

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

TASK 0 如果你只是轻量级的爬取
比如接下来的爬取新闻网站带有Chinese的所有新闻标题
你甚至不需要用到我写的类
直接这样就行
当然用我的也能很不错的完成任务 ~能帮你剩下些时间去干其他事情

TASK 1 百度贴吧：某吧首推前几十页的帖子标题及回复命名为标题.txt 写入内容为一回复加一换行保存到原目录data文件夹内代码量：23行

Packages