4c API Crawler

Used https://github.com/4chan/4chan-API for request commands and example responses

Use of Rules (as outlined in the link above):

Code adheres to 1 request per second
Did not use this one since I did not know how to
The first function creates the specified header but it seemed useless (did not have any effect)

Term of Service:

This is purposely not called "4chan API Crawler" because the name "4chan" is not permitted

Functions in 4c API Crawler:

Look at "Use of Rules" #3
getBoardList() returns a list of board IDs that are required in the following functions requests
getArchivedThreadList(board) returns a list of threads (ints) that have been archived for the parameter board. These may or may not be within the desired date range of threads that are to be extracted. I did not check that; I included them in case and to reduce run time (by not checking their threads' last updated dates and eliminating them if so). You can go through and see if all the boards archived threads are within the desired date range and if they are all outside of the range, you can stop using this archived thread list function (assuming the program worked for its runs that considered the earliest date range)
getCurrentThreadList(board,deltaSeconds,deltaMinutes,deltaHours,deltaDays,deltaWeeks) returns a comprehensive thread list for the given board within the specified range of time from the start run time of the function
getThreadData(board,threadNum) returns the json file for a thread (stream of comments) for the given board that the threadNum is in
getCompleteThreadList(board,deltaSeconds,deltaMinutes,deltaHours,deltaDays,deltaWeeks) returns the combined list of current and archived threads lists that are returned from the function in #3 and #4
getEarliestDateDeltas(csvFile) returns the earliest date in a csvFile formatted the same way as the csv file in this folder. Can add a parameter "valueName" and replace instances of "published_time" with "valueName" in the function for more generality. Dates will have to be in month/date/year hour:minute format for this to work. This is to make sure we get the earliest dated threads needed when first running the program for collection of data.
updateCSV(fileName) updates the csv file according to how many threads each link appears and how many replies do the threads have that the links appear in. You must read the comments to edit this function accordingly. Check the comments to see what needs to be done after a successful first run

To Do List:

Find out how and where to the organization/Temple wants to extract comments from threads that have the article urls
Run updateCSV successfully for the first time
Run updateCSV function according with the start date of the last run as the parameter

Things to consider:

The data collected may only be an estimate of actual values since in the span of the 3 days (maybe less since it was edited for efficiency) this program takes to run, users can post new comments, so there is a chance of missing data to collect

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
4cAPICrawler.py		4cAPICrawler.py
ArticlesToCheck.csv		ArticlesToCheck.csv
README.md		README.md
Rumble.py		Rumble.py
ViafouraCrawler.py		ViafouraCrawler.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

4c API Crawler

Use of Rules (as outlined in the link above):

Term of Service:

Functions in 4c API Crawler:

To Do List:

Things to consider:

About

Releases

Packages

Languages

ereizas/Social-Media-Web-Crawlers

Folders and files

Latest commit

History

Repository files navigation

4c API Crawler

Use of Rules (as outlined in the link above):

Term of Service:

Functions in 4c API Crawler:

To Do List:

Things to consider:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages