Skip to content

Commit

Permalink
Prep for initial release.
Browse files Browse the repository at this point in the history
  • Loading branch information
ruebot committed Dec 10, 2019
1 parent 3c37dfc commit d130853
Show file tree
Hide file tree
Showing 2 changed files with 118 additions and 21 deletions.
48 changes: 28 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

[![Build Status](https://travis-ci.org/archivesunleashed/twut.svg?branch=master)](https://travis-ci.org/archivesunleashed/twut)
[![codecov](https://codecov.io/gh/archivesunleashed/twut/branch/master/graph/badge.svg)](https://codecov.io/gh/archivesunleashed/twut)
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/io.archivesunleashed/twut/badge.svg)](https://maven-badges.herokuapp.com/maven-central/io.archivesunleashed/twut)
[![LICENSE](https://img.shields.io/badge/license-Apache-blue.svg?style=flat)](https://www.apache.org/licenses/LICENSE-2.0)
[![Contribution Guidelines](http://img.shields.io/badge/CONTRIBUTING-Guidelines-blue.svg)](./CONTRIBUTING.md)

Expand All @@ -10,39 +11,46 @@ An open-source toolkit for analyzing line-oriented JSON Twitter archives with Ap
## Dependencies

- Java 8 or 11
- Python 3
- [Apache Spark](https://spark.apache.org/downloads.html)

## Getting Started

Until we have a release, you'll need to clone the repo, build it, and pass the jar to Apache Spark.
### Packages

```shell
#### Spark Shell

$ git clone https://github.com/archivesunleashed/twut.git
$ cd twut
$ mvn clean install
$ /path/to/spark/bin/spark-shell --jars /path/to/twut-0.0.1-SNAPSHOT-fatjar.jar"
```
$ spark-shell --packages "io.archivesunleashed:twut:0.0.2"
```

#### PySpark

```
$ pyspark --py-files /path/to/twut.zip --packages "io.archivesunleashed:twut:0.0.2"
```

You will need the `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables set.

Spark context Web UI available at http://10.0.1.44:4040
Spark context available as 'sc' (master = local[*], app id = local-1575383157031).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview
/_/
### Jars

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
Type in expressions to have them evaluated.
Type :help for more information.
You can download the [latest release files here](https://github.com/archivesunleashed/twut/releases) and include it like so:

scala>
#### Spark Shell

```
$ spark-shell --jars /path/to/twut-0.0.2-fatjar.jar
```

#### PySpark

```
$ pyspark --py-files /path/to/twut.zip --driver-class-path /path/to/twut-0.0.2-fatjar.jar --jars /path/to/twut-0.0.2-fatjar.jar
```

## Documentation! Or, how do I use this?

Once built or downloaded, you can follow the basic set of recipes and tutorials [here](https://github.com/archivesunleashed/twut/tree/master/docs).
Once built or downloaded, you can follow the basic set of recipes and tutorials [here](https://github.com/archivesunleashed/twut/tree/master/docs/usage.md).

# License

Expand Down
91 changes: 90 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,14 +32,16 @@ val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

ids(tweetsDF).show(2, false)
```

**Output**:
```
+-------------------+
|id_str |
+-------------------+
|1201505319257403392|
|1201505319282565121|
+-------------------+
only showing top 2 rows
```

### Python DF
Expand All @@ -51,7 +53,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.ids(df).show(2, False)
```

**Output**:
```
+-------------------+
|id_str |
+-------------------+
Expand All @@ -72,7 +77,10 @@ val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

userInfo(tweetsDF).show(2, false)
```

**Output**:
```
+----------------+---------------+-------------+-------------------+--------+-------------------+------------+--------------+--------+
|favourites_count|followers_count|friends_count|id_str |location|name |screen_name |statuses_count|verified|
+----------------+---------------+-------------+-------------------+--------+-------------------+------------+--------------+--------+
Expand All @@ -91,7 +99,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.userInfo(df).show(2, False)
```

**Output**:
```
+----------------+---------------+-------------+-------------------+--------+-------------------+------------+--------------+--------+
|favourites_count|followers_count|friends_count|id_str |location|name |screen_name |statuses_count|verified|
+----------------+---------------+-------------+-------------------+--------+-------------------+------------+--------------+--------+
Expand All @@ -112,7 +123,10 @@ val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

text(tweetsDF).show(2, false)
```

**Output**:
```
+---------------------------------+
|text |
+---------------------------------+
Expand All @@ -130,7 +144,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.text(df).show(2, False)
```

**Output**:
```
+---------------------------------+
|text |
+---------------------------------+
Expand All @@ -152,7 +169,10 @@ val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

times(tweetsDF).show(2, false)
```

**Output**:
```
+------------------------------+
|created_at |
+------------------------------+
Expand All @@ -171,7 +191,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.times(df).show(2, False)
```

**Output**:
```
+------------------------------+
|created_at |
+------------------------------+
Expand All @@ -193,7 +216,10 @@ val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

sources(tweetsDF).show(10, false)
```

**Output**:
```
+------------------------------------------------------------------------------------+
|source |
+------------------------------------------------------------------------------------+
Expand All @@ -219,7 +245,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.sources(df).show(10, False)
```

**Output**:
```
+------------------------------------------------------------------------------------+
|source |
+------------------------------------------------------------------------------------+
Expand Down Expand Up @@ -249,7 +278,10 @@ val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

hashtags(tweetsDF).show
```

**Output**:
```
+------------------+
| hashtags|
+------------------+
Expand All @@ -266,7 +298,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.hashtags(df).show()
```

**Output**:
```
+------------------+
| hashtags|
+------------------+
Expand All @@ -287,7 +322,10 @@ val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

urls(tweetsDF).show(10, false)
```

**Output**:
```
+-----------------------------------------------------------+
|url |
+-----------------------------------------------------------+
Expand All @@ -305,7 +343,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.urls(df).show(10, False)
```

**Output**:
```
+-----------------------+
|url |
+-----------------------+
Expand Down Expand Up @@ -335,7 +376,10 @@ val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

animatedGifUrls(tweetsDF).show(10, false)
```

**Output**:
```
+-----------------------------------------------------------+
|animated_gif_url |
+-----------------------------------------------------------+
Expand All @@ -354,7 +398,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.animatedGifUrls(df).show(10, False)
```

**Output**:
```
+-----------------------------------------------------------+
|animated_gif_url |
+-----------------------------------------------------------+
Expand All @@ -377,7 +424,10 @@ val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

imageUrls(tweetsDF).show(5, false)
```

**Output**:
```
+-----------------------------------------------+
|image_url |
+-----------------------------------------------+
Expand All @@ -398,7 +448,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.imageUrls(df).show(5, False)
```

**Output**:
```
+-----------------------------------------------+
|image_url |
+-----------------------------------------------+
Expand All @@ -421,7 +474,10 @@ val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

mediaUrls(tweetsDF).show(5, false)
```

**Output**:
```
+-----------------------------------------------+
|image_url |
+-----------------------------------------------+
Expand All @@ -442,7 +498,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.mediaUrls(df).show(5, False)
```

**Output**:
```
+-----------------------------------------------+
|image_url |
+-----------------------------------------------+
Expand All @@ -467,7 +526,10 @@ val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

videoUrls(tweetsDF).show(5, false)
```

**Output**:
```
+---------------------------------------------------------------------------------------------------+
|video_url |
+---------------------------------------------------------------------------------------------------+
Expand All @@ -488,7 +550,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.videoUrls(df).show(5, False)
```

**Output**:
```
+---------------------------------------------------------------------------------------------------+
|video_url |
+---------------------------------------------------------------------------------------------------+
Expand All @@ -513,6 +578,10 @@ val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

removeSensitive(tweetsDF).count
```

**Output**:
```
res0: Long = 246
```

Expand All @@ -525,6 +594,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

FilterTweet.removeSensitive(df).count()
```

**Output**:
```
246
```

Expand All @@ -541,6 +614,10 @@ val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

removeRetweets(tweetsDF).count
```

**Output**:
```
res0: Long = 230
```

Expand All @@ -553,6 +630,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

FilterTweet.removeRetweets(df).count()
```

**Output**:
```
230
```

Expand All @@ -569,6 +650,10 @@ val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

removeNonVerified(tweetsDF).count
```

**Output**:
```
res0: Long = 5
```

Expand All @@ -581,6 +666,10 @@ path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

FilterTweet.removeNonVerified(df).count()
```

**Output**:
```
5
```

Expand Down

0 comments on commit d130853

Please sign in to comment.