index.Rmd

---
title: "Workshop de Topic Modeling"
subtitle: "Slides -- [.white[storopoli.io/topic-modeling-workshop]](https://storopoli.io/topic-modeling-workshop)"
author: 
  - "Jose Storopoli, PhD"
  - '`r knitr::include_graphics(c("images/UNINOVE_CIS.png", "images/UNINOVE_PPGA.png"), dpi = 400)`'
  - "[![CC BY-SA 4.0](https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg)](http://creativecommons.org/licenses/by-sa/4.0/)"
date: "04/05/2021"
output:
  xaringan::moon_reader:
    lib_dir: libs
    css: css/xaringan-themer.css
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---

class: animated, fadeIn
layout: true

---
```{r setup, include=FALSE}
library(magrittr)
options(htmltools.dir.version = FALSE,
        htmltools.preserve.raw = FALSE,
        scipen = 0, digits=3)
knitr::opts_chunk$set(fig.retina = 4,
                      warning = FALSE,
                      message = FALSE,
                      echo = FALSE)
set.seed(123)
```

```{r xaringan-themer, include=FALSE, warning=FALSE}
library(xaringanthemer)

extra_css <- list(
  ".tiny" = list("font-size" = "40%"),
  ".small" = list("font-size" = "70%"),
  ".large" = list("font-size" = "130%"),
  ".xlarge" = list("font-size" = "200%"),
  ".full-width" = list(
    display = "flex",
    width   = "100%",
    flex    = "1 1 auto"
  ),
  "white" = list("color" = "white !important")
)

# UNINOVE Colors
style_mono_accent(
  base_color = "#29427A",
  header_font_google = google_font("Josefin Sans"),
  text_font_google   = google_font("Montserrat", "300", "300i"),
  code_font_google   = google_font("Fira Mono"),
  code_font_size     = "0.8rem",
  text_font_size     = "1.5em",
  footnote_font_size = "0.4em",
  extra_css          = extra_css,
  outfile            = "css/xaringan-themer.css"
)
```

```{r xaringan-logo, echo=FALSE}
# xaringanExtra tile view press key "O"
xaringanExtra::use_tile_view()

xaringanExtra::use_logo(
  image_url = "https://raw.githubusercontent.com/storopoli/UNINOVE-xaringan-theme/master/resources/uninove.png",
  link_url = "https://www.uninove.br",
  width = "110px",
  height = "55px")

xaringanExtra::use_fit_screen()
#xaringanExtra::use_animate_css()
xaringanExtra::use_tachyons()

# xaringanExtra webcam press key "W"
xaringanExtra::use_webcam()
```

# O que é modelagem de tópicos?

```{r topic-modeling, out.width='100%'}
knitr::include_graphics("images/topic-modeling.jpg")
```

.footnote[
Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255–260. https://doi.org/10.1126/science.aaa8415
]

???

Modelo Probabilístico

---
class: inverse, middle, center

# Iramuteq vs Topic Modeling

```{r fight}
knitr::include_graphics("images/fight.jpg")
```


---

.footnote[
Reinert, M. (1990). Alceste une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval. Bulletin of Sociological Methodology/Bulletin de méthodologie sociologique, 26(1), 24-54.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022.
]

.pull-left[
## Iramuteq
* Clusterização Hierárquica
* Estimativa Pontual
* Erro
* Muitos "graus de liberdade" do pesquisador
* Reinert (1990) - 787 citações
]

--

.pull-right[
## Topic Modeling
* Modelo Probabilístico
* Densidade Posterior
* Incerteza 
* Apenas um "grau de liberdade" do pesquisador
* Blei et al. (2003) - 37.287 citações
* Publicado na Nature, PLoS, PNAS etc.
* Usado pela Amazon
]

???

Iramuteq é clusterização -> Point Estimate e um único pertencimento

Enquanto TM é um modelo generativo probabilístico (bayesiano) -> Densidade posterior completa e uma probabilidade de pertencimento $\sum_p =1 e \mathbf{p} = (p_1, \dots, p_k)$.

---
class: inverse, middle, center

# Mas ainda temos que processar texto

```{r corpus, out.width='75%'}
knitr::include_graphics("images/corpus.jpg")
```

---
## Pré-processamento de Texto

```{r graph-text-preprocessing, out.width='80%'}
library(DiagrammeR)
grViz("
 digraph text_preprocessing {
  graph [overlap = false,
         fontsize = 12,
         rankdir = LR]
  node [shape = oval,
        fontname = Helvetica]
  A [label = 'Pré\nProcessamento']
  node [shape = box,
        fontname = Helvetica]
  B [label = 'Tokenização']; C [label = 'Lematização']; D [label = 'Stop Words'];
  A -> {B C D } [dir = forward,
                    tailport = 'e',
                    headport = 'w']
} 
")
```

.footnote[
Denny, M. J., & Spirling, A. (2018). Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Political Analysis, 26(2), 168–189. https://doi.org/10.1017/pan.2017.44

Storopoli, J. E. (2019). Topic Modeling: How and why to use in management research. Iberoamerican Journal of Strategic Management (IJSM), 18(3), 8–20.
]

---
# Structural Topic Modeling (STM)

.small[
* Topic Modeling com Esteroides

* Metadados dos documentos para gerar inferências sobre a prevalência e conteúdo de cada tópico

* Além de descobrir tópicos

* Analisa a relação das informações dos documentos com os tópicos
* Farrell (2016) analisou mais de 40 mil documentos sobre mudança climática de 120 organizações
* Kuhn (2018) analisou mais de 25 mil relatórios de acidentes de aviação
]

.footnote[
Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., & Rand, D. G. (2014). Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4), 1064–1082. https://doi.org/10.1111/ajps.12103

Farrell, J. (2016). Corporate funding and ideological polarization about climate change. Proceedings of the National Academy of Sciences, 113(1), 92–97. https://doi.org/10.1073/PNAS.1509433112

Kuhn, K. D. (2018). Using structural topic modeling to identify latent topics and trends in aviation incident reports. Transportation Research Part C: Emerging Technologies, 87, 105-122.
]

---
class: inverse, middle, center

# Ferramentas

`r icons::icon_style(icons::fontawesome("tools"), scale = 6, fill = "white")`

---
class: middle

# R
### [`{stm}`](https://www.structuraltopicmodel.com/) e [`{quanteda}`](http://quanteda.io/)
# Python
### [`gensim`](https://radimrehurek.com/gensim/) e [`scikit-learn`](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
# Julia
### [`TextAnalysis.jl`](https://juliahub.com/docs/TextAnalysis)

---
class: inverse, middle

```{r case-study, out.width='100%'}
knitr::include_graphics("images/case-study.jpg")
```

---
# Senhor dos Anéis - [Kaggle](https://www.kaggle.com/paultimothymooney/lord-of-the-rings-data?select=lotr_scripts.csv)

.pull-left[
### `lotr_scripts.csv`
* 2,389 falas

* `char`: Personagem

* `dialog`: Fala

* `movie`: Filme
]
.pull-right[
### `lotr_characters.csv`
* 911 personagens

* `race`: Elfo, Orc, Humano etc.

* `gender`: Male, Female
]

.footnote[
Kaggle - https://www.kaggle.com/paultimothymooney/lord-of-the-rings-data
]

---
# Preparação dos dados
.pull-left[
```{r data-prep, echo=TRUE, eval=FALSE}
library(readtext)

df <- readtext(
  "data/lotr_scripts.csv",
  text_field = "dialog") #<<
```

```{r data-prep-true}
df <- readRDS("processed-data/df.rds")
```
]

.pull-right[
* FRODO
* SAM
* GANDALF
* ARAGORN
* PIPPIN
* MERRY
* GOLLUM
* GIMLI
* LEGOLAS
]

---
# Corpus e Tokens

```{r corpus-r, echo=TRUE, eval=FALSE}
library(quanteda)
corpus <- corpus(df) #<<
summary(corpus)
```

```{r corpus-real}
library(quanteda)
corpus <- readRDS("processed-data/corpus.rds")
summary(corpus, n = 3)
```

```{r tokens, echo=TRUE, eval=FALSE}
toks <- tokens(corpus,
               remove_punct = TRUE, #<<
               remove_symbols = TRUE, #<<
               remove_numbers = TRUE, #<<
               remove_separators = TRUE, #<<
               split_hyphens = TRUE #<<
)
```

```{r tokens-real}
toks <- readRDS("processed-data/toks.rds")
```


---
# Document-Term Matrix (`dtm`)

```{r dfm, echo=TRUE, eval=FALSE}
library(stopwords)
dfm_mat <- dfm(toks,
  tolower = TRUE) #<<

dfm_mat <- dfm_remove(dfm_mat,
  pattern = stopwords(language = "en", source = "snowball")) #<<

dfm_mat <- dfm_wordstem(dfm_mat,
  language = "en") #<<
dfm_mat
```

```{r dfm-real}
dfm_mat <- readRDS("processed-data/dfm_mat.rds")
dfm_mat[, 1:8]
```

---
# Falas x Personagem x Filme
```{r ggplot2, fig.align='center'}
library(ggplot2)
library(dplyr)
library(forcats)
library(stringr)
df %>% 
  mutate(char = fct_infreq(char)) %>% 
  filter(char != "Other") %>%
  ggplot(aes(x = char, fill = movie)) +
  geom_bar() +
  theme(
    legend.position = "bottom",
    axis.text.x = element_text(angle = 45),
    text = element_text(size = 16)
  ) +
  ylab(NULL) +
  xlab(NULL) +
  guides(fill = guide_legend(nrow = 2))
```

---
# Quantos tópicos?
.small[
Antes precisamos converter a `dtm` do `{quanteda}` para o `{stm}`:
]
```{r dtm-convert, echo=TRUE, eval=FALSE}
dtm_stm <- convert(dfm_mat, to = "stm")
```

```{r dtm-convert-real}
dtm_stm <- readRDS("processed-data/dtm_stm.rds")
how_many_k <- readRDS("processed-data/how_many_k.rds")
```

```{r how_many_k, out.width='50%', fig.align='center'}
library(tidyr)
how_many_k$results %>% 
  unnest() %>% 
  ggplot(aes(K, semcoh, label = K)) +
  geom_line(color = "#377eb8") +
  geom_point() +
  geom_label() +
  labs(x = "Número de Tópicos",
       y = "Coerência Semântica",
       label = "") +
  scale_x_continuous(breaks = 3:10) +
  scale_y_continuous(labels = scales::number, breaks = scales::breaks_extended(10)) +
  theme(text = element_text(size = 18))
```

---
# Topic Modeling

```{r tm, echo=TRUE, eval=FALSE}
topic_model <- stm(dtm_stm$documents,
                   vocab = dtm_stm$vocab,
                   data = dtm_stm$meta,
                   K = 3, #<<
                   prevalence =~ movie + char, #<<
                   seed = 123) #<<
```

.large[
```{r tm-real}
library(stm)
topic_model <- readRDS("processed-data/topic_model.rds")
labelTopics(topic_model, topics = 1:3, n = 6)
```
]

---
```{r plot-tm, fig.align='center'}
plot(topic_model, type = "hist")
```

---
# STM

.large[
```{r stm, echo=TRUE, eval=FALSE}
regression <- estimateEffect(
  1:3 ~ movie + char, #<<
  topic_model,
  meta = dtm_stm$meta)
summary(regression, topics = 1)
```
]

```{r stm-real}
regression <- readRDS("processed-data/regression.rds")
```

---
**Tópico 1**: king, lord, gondor, smeagol, rohan, citi, saruman 
```{r stm-1}
summary(regression, topics = 1)
```

---
**Tópico 2**: gandalf, sam, day, death, friend, war, tree 
```{r stm-2}
summary(regression, topics = 2)
```
---
**Tópico 3**: frodo, master, hobbit, time, dead, merri, aragorn 
```{r stm-3}
summary(regression, topics = 3)
```

---
class: inverse, middle

```{r closing-thought, out.width='100%'}
knitr::include_graphics("images/meme-final.jpg")
```

---
# Créditos!

Slides criado pelo pacote R [`xaringan`](https://github.com/yihui/xaringan).

Código Fonte dos Slides disponível no GitHub [storopoli/topic-modeling-workshop](https://github.com/storopoli/topic-modeling-workshop).

.pull-left[
```{r profile-pic, out.width='70%', fig.align='left'}
knitr::include_graphics("images/Profile Pic.png")
```

[![CC BY-SA 4.0][cc-by-sa-image]][cc-by-sa]
]

.pull-right[
[`r icons::fontawesome("globe")` storopoli.io](https://storopoli.io)

[`r icons::fontawesome("linkedin")` @storopoli](https://www.linkedin.com/in/storopoli/) 

[`r icons::fontawesome("twitter")` @JoseStoropoli](https://www.twitter.com/JoseStoropoli)

[`r icons::fontawesome("github")` @storopoli](http://github.com/storopoli)  

[`r icons::fontawesome("paper-plane")` josees@uni9.pro.br](mailto:josees@uni9.pro.br)

]

[cc-by-sa]: http://creativecommons.org/licenses/by-sa/4.0/
[cc-by-sa-image]: https://licensebuttons.net/l/by-sa/4.0/88x31.png
[cc-by-sa-shield]: https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg