-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CFP crawler #2
base: master
Are you sure you want to change the base?
CFP crawler #2
Conversation
This looks good. You can just use a list instead of key value pairs for each proposal. I will have look at the code more carefully after some time. |
I think it becomes easier to use when key-pairs are used. If one has to know the name of speaker of 10th proposal, he/she can simply do |
The same applies if you store it in a list right?
…On Fri 6 Jul, 2018, 11:10 AM Nivesh Krishna, ***@***.***> wrote:
I think it becomes easier to use when key-pairs are used. If one has to
know the name of speaker of 10th proposal, he/she can simply do
proposals[10]["author"]
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKACR9TaDcuwh2fA3jPk7KBEtI_hAZStks5uDvhlgaJpZM4U5QA5>
.
|
Right, got you..! |
@ananyo2012 Changed it to list instead of dict. |
|
||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a newline here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just a test file, not of any importance.
# Obey robots.txt rules | ||
ROBOTSTXT_OBEY = True | ||
|
||
# Configure maximum concurrent requests performed by Scrapy (default: 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have so many comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is added by default by scrapy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaqaishtyaq please review this PR as you are also working with scrapy these days.
import scrapy | ||
import json | ||
from scrapy import signals | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be 2 empty lines after import statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that can be done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Niveshkrishna can you fix the things I mentioned in comments. Make a good documentation for the usage and also remove the binary files .pyc
from the changes and add them to .gitignore
@@ -0,0 +1,3 @@ | |||
### Basic Usage | |||
|
|||
scrapy crawl crawler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some documentation for the project ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do it
cfp_crawler/logs.log
Outdated
@@ -0,0 +1,2947 @@ | |||
2018-06-27 12:25:46 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: proposal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this generated file ? Then please remove this and add to .gitignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -0,0 +1,4910 @@ | |||
[ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the data dump right ? Just keep one sample data and remove the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
created_on = response.xpath("//p[@class='text-muted']/small/b/time/text()").extract()[0].strip() | ||
|
||
section = response.xpath("//section[@class='col-sm-8 proposal-writeup']/div") | ||
some_dic = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this to proposal
. hard to understand what is some_dic
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
some_dic = {} | ||
for div in section: | ||
heading = div.xpath(".//h4[@class='heading']/b/text()").extract()[0] | ||
data = self.format_data(div.xpath(".//text()").extract(), heading) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which data is this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok got it. This is the main proposal content. How about we name the variables as
heading
=> section_heading
data
=> section_content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just being curious does div.xpath()
return 2 values in a tuple ? Since I see format_data()
takes in 2 arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which div.xpath()
are you referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Niveshkrishna NVM I got it. div.xpath(".//text()").extract()
is the first argument and heading is second. Rename the variable name as I suggested. Also to be more clear make a separate variable for div.xpath(".//text()").extract()
say raw_section_content
some_dic[heading[:-1]] = data | ||
|
||
table = response.xpath("//table/tr") | ||
for col in table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this should be for row in table_rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
data = data[2:-2] | ||
some_dic[heading[:-1]] = data | ||
|
||
table = response.xpath("//table/tr") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be table_rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
table = response.xpath("//table/tr") | ||
for col in table: | ||
heading = col.xpath(".//td/small/text()").extract()[0].strip() | ||
data = col.xpath(".//td/text()").extract()[0].strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we name these variables as
heading
=> extra_info_heading
data
=> extra_info_content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
allowed_domains = ['in.pycon.org'] | ||
url = "https://in.pycon.org" | ||
proposals = [] | ||
file = open("proposals.json", "w") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file is opened but never closed. You should close the file after use. The best way to do this would be to use with
. Something like
with open("filename.json", "w") as f:
f.write("data")
Move it inside the spider_closed
method and if you want you make the filename as a member.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops! forgot to close it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@Niveshkrishna also it would be good to add requirements.txt, pip freeze > requirements.txt and as @ananyo2012 suggested, documentation > how to setup the project using virtualenv/pipenv.. |
@@ -3,6 +3,7 @@ | |||
import json | |||
from scrapy import signals | |||
|
|||
|
|||
class CrawlerSpider(scrapy.Spider): | |||
name = 'crawler' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can you change the name of the spider, something like cfpcrawler
or cfp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm about this @ananyo2012 ?
|
||
def spider_closed(self, spider): | ||
print("Closing spider") | ||
json.dump(self.proposals, self.file, indent = 2, sort_keys = True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a common practice to use pipelines
to save scrapy items to a file.
Something like in pipelines.py
,
class JsonWritePipeline(object):
def open_spider(self, spider):
self.file = open('proposal.json', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, could have used middlewears.py as well. But since this being not so complicated to crawl, I just hardcoded everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's always a possibility of extension of any project, don't hardcode anything unless that's the only way out.
cc: @realslimshanky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, will keep in mind
|
||
class ProposalItem(scrapy.Item): | ||
# define the fields for your item here like: | ||
# name = scrapy.Field() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Niveshkrishna, You are not declaring any item in items.py
. Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there are multiple crawlers which are using the same item, there's not much of difference in using the items and normal variables. Since there is only one crawler, it is okay to have them declared directly in crawler.py
file. Or is it not? Is there any advantage of using items in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. But it would help to reuse a bunch of code in future and to keep everything tidy, clean and understandable.
I think everything except |
@Niveshkrishna Can you complete the documentation so that we can merge this PR ? |
This crawler crawls through all the proposal on cfp page and saves them into proposals.json file