Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFP crawler #2

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open

Conversation

Niveshkrishna
Copy link

This crawler crawls through all the proposal on cfp page and saves them into proposals.json file

@ananyo2012
Copy link
Contributor

This looks good. You can just use a list instead of key value pairs for each proposal. I will have look at the code more carefully after some time.

@Niveshkrishna
Copy link
Author

I think it becomes easier to use when key-pairs are used. If one has to know the name of speaker of 10th proposal, he/she can simply do proposals[10]["author"]

@ananyo2012
Copy link
Contributor

ananyo2012 commented Jul 6, 2018 via email

@Niveshkrishna
Copy link
Author

Right, got you..!

@Niveshkrishna
Copy link
Author

@ananyo2012 Changed it to list instead of dict.


}

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a newline here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a test file, not of any importance.

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have so many comments?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is added by default by scrapy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaqaishtyaq please review this PR as you are also working with scrapy these days.

import scrapy
import json
from scrapy import signals

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be 2 empty lines after import statements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that can be done

Copy link
Contributor

@ananyo2012 ananyo2012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Niveshkrishna can you fix the things I mentioned in comments. Make a good documentation for the usage and also remove the binary files .pyc from the changes and add them to .gitignore

@@ -0,0 +1,3 @@
### Basic Usage

scrapy crawl crawler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some documentation for the project ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do it

@@ -0,0 +1,2947 @@
2018-06-27 12:25:46 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: proposal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this generated file ? Then please remove this and add to .gitignore

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -0,0 +1,4910 @@
[
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the data dump right ? Just keep one sample data and remove the rest.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

created_on = response.xpath("//p[@class='text-muted']/small/b/time/text()").extract()[0].strip()

section = response.xpath("//section[@class='col-sm-8 proposal-writeup']/div")
some_dic = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this to proposal . hard to understand what is some_dic .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

some_dic = {}
for div in section:
heading = div.xpath(".//h4[@class='heading']/b/text()").extract()[0]
data = self.format_data(div.xpath(".//text()").extract(), heading)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which data is this ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it. This is the main proposal content. How about we name the variables as

heading => section_heading
data => section_content

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just being curious does div.xpath() return 2 values in a tuple ? Since I see format_data() takes in 2 arguments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which div.xpath() are you referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Niveshkrishna NVM I got it. div.xpath(".//text()").extract() is the first argument and heading is second. Rename the variable name as I suggested. Also to be more clear make a separate variable for div.xpath(".//text()").extract() say raw_section_content

some_dic[heading[:-1]] = data

table = response.xpath("//table/tr")
for col in table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this should be for row in table_rows

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

data = data[2:-2]
some_dic[heading[:-1]] = data

table = response.xpath("//table/tr")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be table_rows

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

table = response.xpath("//table/tr")
for col in table:
heading = col.xpath(".//td/small/text()").extract()[0].strip()
data = col.xpath(".//td/text()").extract()[0].strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we name these variables as

heading => extra_info_heading
data => extra_info_content

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

allowed_domains = ['in.pycon.org']
url = "https://in.pycon.org"
proposals = []
file = open("proposals.json", "w")
Copy link
Contributor

@ananyo2012 ananyo2012 Jul 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is opened but never closed. You should close the file after use. The best way to do this would be to use with. Something like

with open("filename.json", "w") as f:
  f.write("data")

Move it inside the spider_closed method and if you want you make the filename as a member.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops! forgot to close it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@aaqaishtyaq
Copy link

aaqaishtyaq commented Jul 22, 2018

@Niveshkrishna also it would be good to add requirements.txt,

  pip freeze > requirements.txt

and as @ananyo2012 suggested, documentation > how to setup the project using virtualenv/pipenv..

@@ -3,6 +3,7 @@
import json
from scrapy import signals


class CrawlerSpider(scrapy.Spider):
name = 'crawler'
Copy link

@aaqaishtyaq aaqaishtyaq Jul 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you change the name of the spider, something like cfpcrawler or cfp.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm about this @ananyo2012 ?


def spider_closed(self, spider):
print("Closing spider")
json.dump(self.proposals, self.file, indent = 2, sort_keys = True)
Copy link

@aaqaishtyaq aaqaishtyaq Jul 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a common practice to use pipelines to save scrapy items to a file.
Something like in pipelines.py,

class JsonWritePipeline(object):

    def open_spider(self, spider):
        self.file = open('proposal.json', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, could have used middlewears.py as well. But since this being not so complicated to crawl, I just hardcoded everywhere.

Copy link

@aaqaishtyaq aaqaishtyaq Jul 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's always a possibility of extension of any project, don't hardcode anything unless that's the only way out.

cc: @realslimshanky

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will keep in mind


class ProposalItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Niveshkrishna, You are not declaring any item in items.py. Why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there are multiple crawlers which are using the same item, there's not much of difference in using the items and normal variables. Since there is only one crawler, it is okay to have them declared directly in crawler.py file. Or is it not? Is there any advantage of using items in this case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. But it would help to reuse a bunch of code in future and to keep everything tidy, clean and understandable.

@Niveshkrishna
Copy link
Author

I think everything except Documentation is okay. Let me know if there's anything else

@ananyo2012
Copy link
Contributor

@Niveshkrishna Can you complete the documentation so that we can merge this PR ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants