Skip to content

Commit

Permalink
Major refactorings and adaptations for new SOTorrent release
Browse files Browse the repository at this point in the history
  • Loading branch information
sbaltes committed Apr 14, 2020
1 parent 8e9f18d commit bf3f5be
Show file tree
Hide file tree
Showing 61 changed files with 346 additions and 133 deletions.
14 changes: 10 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,17 @@ All notable changes to the SOTorrent dataset project will be documented in this

## [Upcoming]

* Extract language information from Stack Snippets and link individual snippets to their predecessors, add `MostRecentVersion` flag
* Add table [PostTags](https://github.com/sotorrent/db-scripts/issues/4)
* Add user reference to table `PostVersion`
* Extract language information from Stack Snippets and link individual snippets to their predecessors
* Update database schema on website
* Add historical user reputation?
* Add historical user reputation
* Remove foreign key constraints, switch to SQLite, make it possible to only party import SOTorrent
* Replace XML by CSV files

## [2020-03-15] - First release based on SO data dump 2020-03-02

* Update to Stack Overflow data dump 2020-03-02
* Update GitHub references to 2020-03-13 (according to BigQuery table info, retrieved 2020-03-15)
* Add table [PostTags](https://github.com/sotorrent/db-scripts/issues/4)

## [2020-01-24] - Second release based on SO data dump 2019-12-02

Expand Down
2 changes: 1 addition & 1 deletion sotorrent/LICENSE.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ The tables `PostReferenceGH`, `GHMatches`, and `GHCommits` were retrieved from t

The following tables are based on the tables from the official Stack Exchange data dump listed above. We license them under *Creative Commons Attribution-ShareAlike 4.0 International* ([CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode)):

`CommentUrl`, `PostBlockDiff`, `PostBlockVersion`, `PostVersion`, `PostVersionUrl`, `StackSnippetVersion`, `Threads`, `TitleVersion`, `PostViews`
`CommentUrl`, `PostBlockDiff`, `PostBlockVersion`, `PostVersion`, `PostVersionUrl`, `StackSnippetVersion`, `Threads`, `TitleVersion`, `PostViews`, `PostTags`

Legal code can be found below ([source](https://github.com/creativecommons/legalcode/blob/master/by-sa_4.0.txt)).

Expand Down
4 changes: 2 additions & 2 deletions sotorrent/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@

## Data

The Stack Overflow data has been extracted from the official [Stack Exchange data dump](https://archive.org/details/stackexchange) released 2019-12-02.
The Stack Overflow data has been extracted from the official [Stack Exchange data dump](https://archive.org/details/stackexchange) released 2020-03-02.

The GitHub references have been retrieved from the [Google BigQuery GitHub data set](https://cloud.google.com/bigquery/public-data/github) on 2020-01-24 (last updated 2020-01-24 according to table info).
The GitHub references have been retrieved from the [Google BigQuery GitHub data set](https://cloud.google.com/bigquery/public-data/github) on 2020-03-15 (last updated 2020-03-13 according to table info).

## MySQL Troubleshooting

Expand Down
2 changes: 1 addition & 1 deletion sotorrent/export/1_export-to-csv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

sotorrent_password="4ar7JKS2mfgGHiDA"
log_file="sotorrent.log"
sotorrent_db="sotorrent19_12"
sotorrent_db="sotorrent20_03"

# absolute path to XML and CSV files (consider MySQL's secure-file-priv option)
# escape slashes in path because the string is used in a sed command
Expand Down
22 changes: 22 additions & 0 deletions sotorrent/export/2_compress-csv.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash

if [ "$1" = "so-dump" ]; then
7za a Badges.csv.7z Badges.csv && rm Badges.csv
7za a Comments.csv.7z Comments.csv && rm Comments.csv
7za a PostHistory.csv.7z PostHistory.csv && rm PostHistory.csv
7za a PostLinks.csv.7z PostLinks.csv && rm PostLinks.csv
7za a Posts.csv.7z Posts.csv && rm Posts.csv
7za a Tags.csv.7z Tags.csv && rm Tags.csv
7za a Users.csv.7z Users.csv && rm Users.csv
7za a Votes.csv.7z Votes.csv && rm Votes.csv
elif [ "$1" = "sotorrent" ]; then
7za a PostBlockDiff.csv.7z PostBlockDiff.csv && rm PostBlockDiff.csv
7za a PostVersion.csv.7z PostVersion.csv && rm PostVersion.csv
7za a PostBlockVersion.csv.7z PostBlockVersion.csv && rm PostBlockVersion.csv
7za a PostVersionUrl.csv.7z PostVersionUrl.csv && rm PostVersionUrl.csv
7za a CommentUrl.csv.7z CommentUrl.csv && rm CommentUrl.csv
7za a TitleVersion.csv.7z TitleVersion.csv && rm TitleVersion.csv
7za a StackSnippetVersion.csv.7z StackSnippetVersion.csv && rm StackSnippetVersion.csv
7za a PostViews.csv.7z PostViews.csv && rm PostViews.csv
7za a PostTags.csv.7z PostTags.csv && rm PostTags.csv
fi
10 changes: 0 additions & 10 deletions sotorrent/export/2_compress_so-dump.sh

This file was deleted.

10 changes: 0 additions & 10 deletions sotorrent/export/2_compress_sotorrent.sh

This file was deleted.

9 changes: 9 additions & 0 deletions sotorrent/export/sql/export_sotorrent.sql
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,12 @@ OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\"'
LINES TERMINATED BY '\n'
FROM `PostViews`;

SELECT PostId, TagId
INTO OUTFILE '<PATH>PostTags.csv'
CHARACTER SET utf8mb4
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\"'
LINES TERMINATED BY '\n'
FROM `PostTags`;
12 changes: 6 additions & 6 deletions sotorrent/gh-references/retrieve-gh-references.sh
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
#!/bin/bash

project="sotorrent-org"
dataset="gh_so_references_2020_01_24"
sotorrent="2019_12_25"
dataset="gh_so_references_2020_03_15"
sotorrent="2020_03_15"
bucket="sotorrent"
logfile="bigquery.log"

# "Table Info" of table "bigquery-public-data:github_repos.contents"
# Last Modified: Jan 24, 2020, 6:19:07 AM
# Number of Rows: 264,153,976
# Last Modified: Mar 13, 2020, 6:29:52 AM
# Number of Rows: 263,975,088
# Table Size: 2.25 TB
#
# Unique file contents of text files under 1 MiB on the HEAD branch.
# Can be joined to [bigquery-public-data:github_repos.files] table using the id columns to identify the repository and file path.

# "Table Info" of table "bigquery-public-data:github_repos.commits"
# Last Modified: Jan 24, 2020, 5:55:03 AM
# Number of Rows: 237,651,394
# Last Modified: Mar 13, 2020, 5:57:01 AM
# Number of Rows: 237,161,297
# Table Size: 774 GB
#
# Unique Git commits from open source repositories on GitHub, pre-grouped by repositories they appear in.
Expand Down
36 changes: 0 additions & 36 deletions sotorrent/import/biguery/import-tables.sh

This file was deleted.

23 changes: 0 additions & 23 deletions sotorrent/import/biguery/upload-to-bigquery.sh

This file was deleted.

35 changes: 21 additions & 14 deletions sotorrent/load_sotorrent.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@
root_password="_AqUjvtv68E\$N!r]"
sotorrent_password="4ar7JKS2mfgGHiDA"
log_file="sotorrent.log"
sotorrent_db="sotorrent19_12"
db_init=true
load_so=true
load_gh=true
load_sotorrent=true
sotorrent_db="sotorrent20_03"
db_init=false
load_so=false
load_gh=false
load_sotorrent=false

# absolute path to XML and CSV files (consider MySQL's secure-file-priv option)
# escape slashes in path because the string is used in a sed command
Expand All @@ -16,21 +16,25 @@ data_path="F:\/Temp\/" # Cygwin

rm -f $log_file

if [ "$1" = "so-only" ]; then
if [ "$1" = "so-dump" ]; then
echo "Will only load SO tables." | tee -a "$log_file"
db_init=true
load_so=true
load_gh=false
load_sotorrent=false
elif [ "$1" = "gh-only" ]; then
elif [ "$1" = "gh-references" ]; then
echo "Will only load GH tables." | tee -a "$log_file"
db_init=false
load_so=false
load_gh=true
load_sotorrent=false
elif [ "$1" = "complete" ]; then
echo "Will load all tables." | tee -a "$log_file"
load_so=true
load_gh=true
load_sotorrent=true
fi

if [ "$db_init" = true ] ; then
if [ "$2" = "db-init" ] ; then
db_init=true
echo "Creating database..." | tee -a "$log_file"
mysql -u root --password="$root_password" -e "DROP DATABASE IF EXISTS $sotorrent_db;
SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;
Expand Down Expand Up @@ -68,16 +72,19 @@ if [ "$load_gh" = true ] ; then
sed -e"s/<PATH>/$data_path/g" ./sql/5_load_gh-references.sql > ./sql/5_load_gh-references_paths.sql
mysql $sotorrent_db -u root --password="$root_password" < ./sql/5_load_gh-references_paths.sql >> $log_file 2>&1
rm ./sql/5_load_gh-references_paths.sql

echo "Creating indices for GH References tables..." | tee -a "$log_file"
mysql $sotorrent_db -u root --password="$root_password" < ./sql/6_create_gh-references_indices.sql >> $log_file 2>&1
fi

if [ "$load_sotorrent" = true ] ; then
echo "Loading SOTorrent tables..." | tee -a "$log_file"
sed -e"s/<PATH>/$data_path/g" ./sql/6_load_sotorrent.sql > ./sql/6_load_sotorrent_paths.sql
mysql $sotorrent_db -u root --password="$root_password" < ./sql/6_load_sotorrent_paths.sql >> $log_file 2>&1
rm ./sql/6_load_sotorrent_paths.sql
sed -e"s/<PATH>/$data_path/g" ./sql/7_load_sotorrent.sql > ./sql/7_load_sotorrent_paths.sql
mysql $sotorrent_db -u root --password="$root_password" < ./sql/7_load_sotorrent_paths.sql >> $log_file 2>&1
rm ./sql/7_load_sotorrent_paths.sql

echo "Creating indices for SOTorrent tables..." | tee -a "$log_file"
mysql $sotorrent_db -u root --password="$root_password" < ./sql/7_create_sotorrent_indices.sql >> $log_file 2>&1
mysql $sotorrent_db -u root --password="$root_password" < ./sql/8_create_sotorrent_indices.sql >> $log_file 2>&1
fi

echo "Finished." | tee -a "$log_file"
6 changes: 6 additions & 0 deletions sotorrent/posttags/bigquery/PostTags.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
SELECT temp.PostId AS PostId, tags.Id AS TagId
FROM `sotorrent-org.2020_03_15.Tags` tags
JOIN `sotorrent-org.2020_03_15.PostTagsTemp` temp
ON tags.TagName = temp.Tag;

=> `sotorrent-org.2020_03_15.PostTags`
29 changes: 29 additions & 0 deletions sotorrent/posttags/export_posttags.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/sh

root_password="_AqUjvtv68E\$N!r]"
sotorrent_password="4ar7JKS2mfgGHiDA"
log_file="sotorrent.log"
sotorrent_db="sotorrent20_03"

# absolute path to XML and CSV files (consider MySQL's secure-file-priv option)
# escape slashes in path because the string is used in a sed command
data_path="F:\/Temp\/" # Cygwin
#data_path="\/tmp\/" # Linux

rm -f $log_file

echo "Creating temporary PostTags table..." | tee -a "$log_file"
mysql $sotorrent_db -u root --password="$root_password" < ./sql/create_posttags_temp.sql >> $log_file 2>&1

echo "Loading temporary PostTags table..." | tee -a "$log_file"
sed -e"s/<PATH>/$data_path/g" ./sql/load_posttags_temp.sql > ./sql/load_posttags_temp_absolute_paths.sql
echo "Reading PostTags.xml from $data_path..."
mysql $sotorrent_db -u root --password="$root_password" < ./sql/load_posttags_temp_absolute_paths.sql >> $log_file 2>&1
rm ./sql/load_posttags_temp_absolute_paths.sql

echo "Deleting temporary PostTags table..." | tee -a "$log_file"
mysql $sotorrent_db -u root --password="$root_password" < ./sql/delete_posttags_temp.sql >> $log_file 2>&1

echo "Finished." | tee -a "$log_file"

# Next step: Upload table to BigQuery and replace tags by tag ids
29 changes: 29 additions & 0 deletions sotorrent/posttags/load_posttags.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/sh

root_password="_AqUjvtv68E\$N!r]"
sotorrent_password="4ar7JKS2mfgGHiDA"
log_file="sotorrent.log"
sotorrent_db="sotorrent20_03"

# absolute path to XML and CSV files (consider MySQL's secure-file-priv option)
# escape slashes in path because the string is used in a sed command
data_path="F:\/Temp\/" # Cygwin
#data_path="\/tmp\/" # Linux

rm -f $log_file

echo "Loading PostTags table..." | tee -a "$log_file"
dir=`pwd`
cd "$data_path"
echo "Extracting PostTags.csv.7z in $..."
7za e "PostTags.csv.7z"
cd "$dir"
echo "Reading PostTags.csv from $data_path..."
sed -e"s/<PATH>/$data_path/g" ./sql/load_posttags.sql | sed -e"s/<VERSION>/$version/g" > ./sql/load_posttags_absolute_paths.sql
mysql $sotorrent_db -u root --password="$root_password" < ./sql/load_posttags_absolute_paths.sql >> $log_file 2>&1
rm ./sql/load_posttags_absolute_paths.sql
cd "$data_path"
rm "PostTags.csv"
cd "$dir"

echo "Finished." | tee -a "$log_file"
12 changes: 12 additions & 0 deletions sotorrent/posttags/schema/PostTagsTemp.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[
{
"mode": "REQUIRED",
"name": "PostId",
"type": "INTEGER"
},
{
"mode": "REQUIRED",
"name": "Tag",
"type": "STRING"
}
]
4 changes: 4 additions & 0 deletions sotorrent/posttags/sql/create_posttags_temp.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
CREATE TABLE `PostTagsTemp` (
PostId INT NOT NULL,
Tag VARCHAR(40) NOT NULL
);
1 change: 1 addition & 0 deletions sotorrent/posttags/sql/delete_posttags_temp.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DROP TABLE IF EXISTS `PostTagsTemp`;
10 changes: 10 additions & 0 deletions sotorrent/posttags/sql/load_posttags.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
SET foreign_key_checks = 0;
LOAD DATA INFILE '<PATH>PostTags.csv' INTO TABLE `PostTags`
CHARACTER SET utf8mb4
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(PostId, TagId);
SET foreign_key_checks = 1;
10 changes: 10 additions & 0 deletions sotorrent/posttags/sql/load_posttags_temp.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
SET foreign_key_checks = 0;
LOAD DATA INFILE '<PATH>PostTagsTemp.csv' INTO TABLE `PostTagsTemp`
CHARACTER SET utf8mb4
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(PostId, Tag);
SET foreign_key_checks = 1;
Loading

0 comments on commit bf3f5be

Please sign in to comment.