This repository holds a MongoDB dump consisting of the metadata of ~1,000,000 YouTube videos. The collection process was conducted during November and December 2016.
There are two possibilities to use the dump
- Import to a MongoDB instance
- Parse the .json files
The dump in this repository (mongodb.dump.tar.gz
) was exported from a MongoDB database (v2.6.10) via the mongodump command. To import the dump, extract the .tar.gz
file in an arbitrary folder, e.g., to home/user
-- the dump files will be extracted in a subfolder called dump
. To import the dumped collections, issue the following command in the directory that contains the dump
folder:
$ mongorestore --db youtube
After importing, the database should be accessible via mongo youtube
.
For convinience, the dump was additionaly exported via mongoexport in JSON. Each collection (jobs
and yt-metadata
) was exported in a separate file (ytmetadata.noid.json.gz
and jobs.noid.json.gz
) and each line in a file represents a single document in the collection. The *.sample.json
files contain the first 500 documents in an uncompressed format.
The database consists of two important collections:
- jobs
- yt-metadata
The jobs collection was used in the pre-collection phase of the video metadata collection process. In this phase, random video ID prefixes consisting of 5 characters ([a-z]{5}
) were generated and saved as a "job". These jobs were processed by a worker who used the YouTube search engine to submit queries in the form inurl:"watch?v=[A-Z]{5}
. The results of these queries were then attached to the job.
Example document:
"prefix" : "wxmdw"
This was the video ID prefix that was used to generate the search query. Any videos found by the query are expected to have a video ID that starts with this prefix (case-insensitive)."status" : 0
- Status 0: Job was processed, at least 1 video was found
- Status 10: Job is pending, await processing
- Status 20: Too many results. This status would have been used if a search yields more than 20 results for a given prefix. However, this never happend during the collection process
- Status 21: The search yielded no result for the submitted query.
"results" : [ { "request_language" : "zh-TW", "content_language" : "TW", "via" : "tcp://60.199.198.24:8080", "videoId" : "wXMdW-Mb9k0", "inserted_at" : 1479411573 }, { "request_language" : "zh-TW", "content_language" : "TW", "via" : "tcp://60.199.198.24:8080", "videoId" : "wxmDw-q4Bjk", "inserted_at" : 1479411573 } ]
Array of search results. In addition to the video ID of a particular entry in the list of results, it contains the proxy address via which the search query was performed, the request and content language that YouTube responded with and the unix timestamp at which this entry was inserted to the database (which corresponds to the time at which the search was performed)
Because prefixes were generated in a random fashion, some of them were generated multiple times but saved as a separate job. These jobs were later merged (i.e. the results of those jobs with a common "prefix" were aggregated into a single job). Therefore it is not reliable to count the elements in the "results" array to determine the number of search results! Furthermore, it is possible that the amount of search results for a given prefix vary, depending on the proxy that was used to submit the query. This is because some videos are geo-blocked by YouTube in specific countries. To reliably count the search results for a given prefix, only count those elements in the array that share the same inserted_at timestamp."created_at" : 1479382240
The time at which this document was inserted to the database"updated_at" : 1479411573
The last time that this document was updated"generated_at" : 1479411509
This was used internaly
Once enough jobs were generated and processed, the collection-phase was initiated. In this phase, for each prefix in the job collection for which at least a single video could be found (status 0), a random video ID was selected from the list of search results. For these videos (1,043,429), the metadata was observed over a period of 50 days on a daily basis. For this task, the YouTube API was used.
Example document:
"videoid" : "yoftU-xA4Fw"
The video ID"parts" : "id,statistics"
This was populated by a scheduler to indicate which parts the worker should crawl, the next time it requests the video metadata from the YouTube API"status" : 0
This is the status of the worker that was replied after fetching the metadata of this video.- Status 0: OK. The YouTube API successfuly replied with the metadata for this video.
- Status 10: PENDING. This status was used by the scheduler to indicate that the last processing time of the video was ~24 hours ago and the worker needs to recrawl the statistics of this video.
- Status 20: NOT FOUND. The YouTube API did not replied with any metadata for this video. Once in this status, the worker did not tried any further attempts to collect metadata for this video.
"results" : [ { "contentDetails" : { "duration" : "PT42S", "dimension" : "2d", "definition" : "hd", "caption" : "false", "licensedContent" : "0", "projection" : "rectangular" }, "id" : "yoftU-xA4Fw", "snippet" : { "publishedAt" : "2015-08-26T23:00:56.000Z", "channelId" : "UCJ6u6QrV4O4aJP7JYF19ejw", "title" : "Rew. Orange Circles (Real)", "channelTitle" : "HappierChunk7", "categoryId" : "20", "liveBroadcastContent" : "none" }, "statistics" : { "viewCount" : "10", "likeCount" : "1", "dislikeCount" : "0", "favoriteCount" : "0", "commentCount" : "0" }, "status" : { "uploadStatus" : "processed", "privacyStatus" : "public", "license" : "youtube", "embeddable" : "1", "publicStatsViewable" : "1" }, "inserted_at" : 1482086167 }, ... ]
This contains the reported data from the YouTube API. For the first crawl, both static and dynamic video metadata was captured. For successive crawls only the dynamic (= statistics) video metadata was crawled."updated_at" : 1486318262
The last time this document was updated"created_at" : 1481217193
The time this document was created"reserved_at" : 1486318262
Used internaly. The last time that the worker reserved this document