CFP crawler #2

Niveshkrishna · 2018-06-27T08:14:45Z

This crawler crawls through all the proposal on cfp page and saves them into proposals.json file

ananyo2012 · 2018-06-30T06:07:08Z

This looks good. You can just use a list instead of key value pairs for each proposal. I will have look at the code more carefully after some time.

Niveshkrishna · 2018-07-06T05:40:52Z

I think it becomes easier to use when key-pairs are used. If one has to know the name of speaker of 10th proposal, he/she can simply do proposals[10]["author"]

ananyo2012 · 2018-07-06T15:22:40Z

The same applies if you store it in a list right?

…

On Fri 6 Jul, 2018, 11:10 AM Nivesh Krishna, ***@***.***> wrote: I think it becomes easier to use when key-pairs are used. If one has to know the name of speaker of 10th proposal, he/she can simply do proposals[10]["author"] — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKACR9TaDcuwh2fA3jPk7KBEtI_hAZStks5uDvhlgaJpZM4U5QA5> .

Niveshkrishna · 2018-07-07T05:17:50Z

Right, got you..!

Niveshkrishna · 2018-07-11T03:52:24Z

@ananyo2012 Changed it to list instead of dict.

realslimshanky · 2018-07-13T14:20:14Z

cfp_crawler/proposal/spiders/test.json

+
+    }
+
+}


Please add a newline here.

It was just a test file, not of any importance.

realslimshanky · 2018-07-13T14:20:30Z

cfp_crawler/proposal/settings.py

+# Obey robots.txt rules
+ROBOTSTXT_OBEY = True
+
+# Configure maximum concurrent requests performed by Scrapy (default: 16)


Why do we have so many comments?

It is added by default by scrapy

@aaqaishtyaq please review this PR as you are also working with scrapy these days.

realslimshanky · 2018-07-13T14:20:52Z

cfp_crawler/proposal/spiders/crawler.py

+import scrapy
+import json
+from scrapy import signals
+


There should be 2 empty lines after import statements.

Sure, that can be done

ananyo2012

@Niveshkrishna can you fix the things I mentioned in comments. Make a good documentation for the usage and also remove the binary files .pyc from the changes and add them to .gitignore

ananyo2012 · 2018-07-21T19:17:13Z

cfp_crawler/README.md

@@ -0,0 +1,3 @@
+### Basic Usage
+
+scrapy crawl crawler


Can you add some documentation for the project ?

Sure, will do it

ananyo2012 · 2018-07-21T19:17:53Z

cfp_crawler/logs.log

@@ -0,0 +1,2947 @@
+2018-06-27 12:25:46 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: proposal)


Is this generated file ? Then please remove this and add to .gitignore

ananyo2012 · 2018-07-21T19:23:48Z

cfp_crawler/proposals.json

@@ -0,0 +1,4910 @@
+[
+  {


This is the data dump right ? Just keep one sample data and remove the rest.

ananyo2012 · 2018-07-21T19:32:11Z

cfp_crawler/proposal/spiders/crawler.py

+        created_on = response.xpath("//p[@class='text-muted']/small/b/time/text()").extract()[0].strip()
+
+        section = response.xpath("//section[@class='col-sm-8 proposal-writeup']/div")
+        some_dic = {}


Rename this to proposal . hard to understand what is some_dic .

ananyo2012 · 2018-07-21T19:41:16Z

cfp_crawler/proposal/spiders/crawler.py

+        some_dic = {}
+        for div in section:
+            heading = div.xpath(".//h4[@class='heading']/b/text()").extract()[0]
+            data = self.format_data(div.xpath(".//text()").extract(), heading)


Which data is this ?

Ok got it. This is the main proposal content. How about we name the variables as

heading => section_heading
data => section_content

Also just being curious does div.xpath() return 2 values in a tuple ? Since I see format_data() takes in 2 arguments.

which div.xpath() are you referring to?

@Niveshkrishna NVM I got it. div.xpath(".//text()").extract() is the first argument and heading is second. Rename the variable name as I suggested. Also to be more clear make a separate variable for div.xpath(".//text()").extract() say raw_section_content

ananyo2012 · 2018-07-21T20:22:14Z

cfp_crawler/proposal/spiders/crawler.py

+            some_dic[heading[:-1]] = data
+
+        table = response.xpath("//table/tr")
+        for col in table:


And this should be for row in table_rows

ananyo2012 · 2018-07-21T20:24:32Z

cfp_crawler/proposal/spiders/crawler.py

+            data = data[2:-2]
+            some_dic[heading[:-1]] = data
+
+        table = response.xpath("//table/tr")


This should be table_rows

ananyo2012 · 2018-07-21T20:29:29Z

cfp_crawler/proposal/spiders/crawler.py

+        table = response.xpath("//table/tr")
+        for col in table:
+            heading = col.xpath(".//td/small/text()").extract()[0].strip()
+            data = col.xpath(".//td/text()").extract()[0].strip()


how about we name these variables as

heading => extra_info_heading
data => extra_info_content

ananyo2012 · 2018-07-22T09:10:33Z

cfp_crawler/proposal/spiders/crawler.py

+    allowed_domains = ['in.pycon.org']
+    url = "https://in.pycon.org"
+    proposals = []
+    file = open("proposals.json", "w")


The file is opened but never closed. You should close the file after use. The best way to do this would be to use with. Something like

with open("filename.json", "w") as f: f.write("data")

Move it inside the spider_closed method and if you want you make the filename as a member.

oops! forgot to close it.

aaqaishtyaq · 2018-07-22T14:19:13Z

@Niveshkrishna also it would be good to add requirements.txt,

  pip freeze > requirements.txt

and as @ananyo2012 suggested, documentation > how to setup the project using virtualenv/pipenv..

aaqaishtyaq · 2018-07-22T14:28:54Z

cfp_crawler/proposal/spiders/crawler.py

@@ -3,6 +3,7 @@
 import json
 from scrapy import signals

+
 class CrawlerSpider(scrapy.Spider):
    name = 'crawler'


Also can you change the name of the spider, something like cfpcrawler or cfp.

Can you confirm about this @ananyo2012 ?

aaqaishtyaq · 2018-07-22T14:39:54Z

cfp_crawler/proposal/spiders/crawler.py

+
+    def spider_closed(self, spider):
+        print("Closing spider")
+        json.dump(self.proposals, self.file, indent = 2, sort_keys = True)


It's a common practice to use pipelines to save scrapy items to a file.
Something like in pipelines.py,

class JsonWritePipeline(object): def open_spider(self, spider): self.file = open('proposal.json', 'w') def close_spider(self, spider): self.file.close() def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" self.file.write(line) return item

True, could have used middlewears.py as well. But since this being not so complicated to crawl, I just hardcoded everywhere.

There's always a possibility of extension of any project, don't hardcode anything unless that's the only way out.

cc: @realslimshanky

Okay, will keep in mind

aaqaishtyaq · 2018-07-22T14:46:24Z

cfp_crawler/proposal/items.py

+
+class ProposalItem(scrapy.Item):
+    # define the fields for your item here like:
+    # name = scrapy.Field()


@Niveshkrishna, You are not declaring any item in items.py. Why?

Unless there are multiple crawlers which are using the same item, there's not much of difference in using the items and normal variables. Since there is only one crawler, it is okay to have them declared directly in crawler.py file. Or is it not? Is there any advantage of using items in this case?

You are right. But it would help to reuse a bunch of code in future and to keep everything tidy, clean and understandable.

Niveshkrishna · 2018-07-22T17:53:12Z

I think everything except Documentation is okay. Let me know if there's anything else

ananyo2012 · 2018-10-27T19:11:23Z

@Niveshkrishna Can you complete the documentation so that we can merge this PR ?

Niveshkrishna added 4 commits June 27, 2018 13:39

added cfp crawler

74c2fbd

added readme file

1fc026c

Delete README.md

a3238b2

Create README.md

88a4f1a

Update crawler.py

b00db07

realslimshanky reviewed Jul 13, 2018

View reviewed changes

Niveshkrishna added 4 commits July 13, 2018 20:01

Delete proposals.json

1a2a1dd

latest cfp json

fc64312

removed redudant json file

ce13792

removed unnecessary files

2a2dc64

ananyo2012 reviewed Jul 21, 2018

View reviewed changes

Niveshkrishna added 8 commits July 22, 2018 12:07

made changes as requested by @ananyo2012

123d323

Delete logs.log

26632bf

Delete __init__.pyc

fc5efc4

Delete settings.pyc

6213d08

Delete __init__.pyc

e5abb38

Delete crawler.pyc

08f965a

update proposal.json file to contain only one proposal

2f281f8

Create .gitignore

f2cf342

ananyo2012 reviewed Jul 22, 2018

View reviewed changes

aaqaishtyaq reviewed Jul 22, 2018

View reviewed changes

closed proposal.json file

e4d2f60

		@@ -0,0 +1,2947 @@
		2018-06-27 12:25:46 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: proposal)

CFP crawler #2

Are you sure you want to change the base?

CFP crawler #2

Conversation

Niveshkrishna commented Jun 27, 2018

ananyo2012 commented Jun 30, 2018

Niveshkrishna commented Jul 6, 2018

ananyo2012 commented Jul 6, 2018 via email

Niveshkrishna commented Jul 7, 2018

Niveshkrishna commented Jul 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ananyo2012 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ananyo2012 Jul 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaqaishtyaq commented Jul 22, 2018 • edited Loading

aaqaishtyaq Jul 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaqaishtyaq Jul 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaqaishtyaq Jul 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Niveshkrishna commented Jul 22, 2018

ananyo2012 commented Oct 27, 2018

ananyo2012 Jul 22, 2018 •

edited

Loading

aaqaishtyaq commented Jul 22, 2018 •

edited

Loading

aaqaishtyaq Jul 22, 2018 •

edited

Loading

aaqaishtyaq Jul 22, 2018 •

edited

Loading

aaqaishtyaq Jul 22, 2018 •

edited

Loading