Skip to content

Commit

Permalink
Update scripts and readme
Browse files Browse the repository at this point in the history
  • Loading branch information
sbaltes committed Nov 10, 2020
1 parent aabdce7 commit ea62c85
Show file tree
Hide file tree
Showing 5 changed files with 22 additions and 29 deletions.
6 changes: 3 additions & 3 deletions sotorrent/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@

`7za e sql.7z -osql`

3. Edit the SQL script `load_sotorrent.sh` to change the passwords for the `root` and `sotorrent` MySQL users and the path where the CSV and XML files are located.
3. Edit the SQL script `load_sotorrent.sh` to change the passwords for the `root` and `sotorrent` MySQL users and the path where the MySQL dump files are located.

4. Run the `load_sotorrent.sh` script.

## Data

The Stack Overflow data has been extracted from the official [Stack Exchange data dump](https://archive.org/details/stackexchange) released 2020-03-02.
The Stack Overflow data has been extracted from the official [Stack Exchange data dump](https://archive.org/details/stackexchange) released 2020-06-02.

The GitHub references have been retrieved from the [Google BigQuery GitHub data set](https://cloud.google.com/bigquery/public-data/github) on 2020-03-15 (last updated 2020-03-13 according to table info).
The GitHub references have been retrieved from the [Google BigQuery GitHub data set](https://cloud.google.com/bigquery/public-data/github) on 2020-11-02 (last updated 2020-10-29 according to table info).

## MySQL Troubleshooting

Expand Down
File renamed without changes.
7 changes: 3 additions & 4 deletions sotorrent/export/1_export.sh → sotorrent/export/export.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,16 @@ log_file="sotorrent.log"
sotorrent_db="sotorrent20_06"

# absolute path to XML and CSV files (consider MySQL's secure-file-priv option)
# escape slashes in path because the string is used in a sed command
data_path="E:\/Temp\/" # Cygwin
#data_path="\/tmp\/" # Linux
data_path="E:/Temp/" # Cygwin
#data_path="/tmp/" # Linux

rm -f $log_file

if [ "$1" = "so-dump" ]; then
echo "Exporting $1 tables..." | tee -a "$log_file"
mysqldump -usotorrent -p$sotorrent_password --default-character-set=utf8mb4 $sotorrent_db Users -r $data_path/so-dump/Users.sql
mysqldump -usotorrent -p$sotorrent_password --default-character-set=utf8mb4 $sotorrent_db Badges -r $data_path/so-dump/Badges.sql
mysqldump -usotorrent -p$sotorrent_password --default-character-set=utf8mb4 $sotorrent_db PostLinks -$data_path/so-dump/PostLinks.sql
mysqldump -usotorrent -p$sotorrent_password --default-character-set=utf8mb4 $sotorrent_db PostLinks -r $data_path/so-dump/PostLinks.sql
mysqldump -usotorrent -p$sotorrent_password --default-character-set=utf8mb4 $sotorrent_db Tags -r $data_path/so-dump/Tags.sql
mysqldump -usotorrent -p$sotorrent_password --default-character-set=utf8mb4 $sotorrent_db Votes -r $data_path/so-dump/Votes.sql
mysqldump -usotorrent -p$sotorrent_password --default-character-set=utf8mb4 $sotorrent_db Comments -r $data_path/so-dump/Comments.sql
Expand Down
2 changes: 1 addition & 1 deletion sotorrent/upload/zenodo/1_get-zenodo-bucket-id.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@
# retrieve Zenodo bucket id

ZENODO_TOKEN="" # update this
DEPOSIT_ID="3746061" # update this
DEPOSIT_ID="4264652" # update this
curl "https://zenodo.org/api/deposit/depositions/$DEPOSIT_ID?access_token=$ZENODO_TOKEN" | grep -Eo '"links":{"download":"https://zenodo\.org/api/files/[^/]+'
36 changes: 15 additions & 21 deletions sotorrent/upload/zenodo/2_upload-to-zenodo.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,12 @@
# Before executing this script, create a new dataset version and delete the old files on the Zenodo website

ZENODO_TOKEN="" # update this
ZENODO_BUCKET="ac77dee7-b26a-476f-8464-475e8f7c5715" # update this (see get-zenodo-bucket-id.sh)
ZENODO_BUCKET="113a8155-8bf0-43c2-9dad-46840be68019" # update this (see get-zenodo-bucket-id.sh)

# absolute path to SQL dump files (consider MySQL's secure-file-priv option)
# escape slashes in path because the string is used in a sed command
data_path="E:\/Temp\/" # Cygwin
#data_path="\/tmp\/" # Linux

upload_file() {
FILE_PATH="$1"
Expand All @@ -13,27 +18,16 @@ upload_file() {
}

echo "Uploading so-dump..."
upload_file "so-dump/Badges.xml.7z"
upload_file "so-dump/Comments.xml.7z"
upload_file "so-dump/PostHistory.xml.7z"
upload_file "so-dump/PostLinks.xml.7z"
upload_file "so-dump/Posts.xml.7z"
upload_file "so-dump/Tags.xml.7z"
upload_file "so-dump/Users.xml.7z"
upload_file "so-dump/Votes.xml.7z"
for file in $data_path/so-dump/*.sql.7z; do
upload_file "$file";
done

echo "Uploading sotorrent..."
upload_file "sotorrent/CommentUrl.sql.7z"
upload_file "sotorrent/PostBlockDiff.sql.7z"
upload_file "sotorrent/PostBlockVersion.sql.7z"
upload_file "sotorrent/PostVersion.sql.7z"
upload_file "sotorrent/PostVersionUrl.sql.7z"
upload_file "sotorrent/TitleVersion.sql.7z"
upload_file "sotorrent/StackSnippetVersion.sql.7z"
upload_file "sotorrent/PostViews.sql.7z"
upload_file "sotorrent/PostTags.sql.7z"
for file in $data_path/sotorrent/*.sql.7z; do
upload_file "$file";
done

echo "Uploading gh-references..."
upload_file "gh-references/GHMatches.sql.7z"
upload_file "gh-references/PostReferenceGH.sql.7z"
upload_file "gh-references/GHCommits.sql.7z"
for file in $data_path/gh-references/*.sql.7z; do
upload_file "$file";
done

0 comments on commit ea62c85

Please sign in to comment.