---
title: "Topic modeling  - very simple techniques"
output: html_notebook
---

The Czech Press Agency (ČTK) uses a set of labels to classify the topics of the
press releases. Let us list 20 words most typical of each label (i. e. topic) and
compare the topics by their most typical words, based on a publicly available 
corpus of press releases.


# The data resource
A corpus of Czech press releases, labeled with categories. Each text comes in 
5 file formats. The category labels are integrated into file names. A file can 
have several labels at the same time. Downloaded from here:

[https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2884]
(https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2884)

A paper about this lexical resource is here:

[http://www.lrec-conf.org/proceedings/lrec2018/pdf/671.pdf]
(http://www.lrec-conf.org/proceedings/lrec2018/pdf/671.pdf).
See the paper for details.


```{r message = FALSE}
library(tidyverse)
library(tidytext)
library(magrittr)
```

We download the file (the original URL contained curly brackets {}, which had to
be removed for the `download.file` function to work.)

```{r}
download.file(url = 
                "https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-2884/czech_text_document_corpus_v20.tgz", destfile = "czechcorpus.tgz")
```

It is a tarball zipped folder. We use the `untar` function to unzip it. NB it is large! It contains almost 60,000 files, a folder with additional 15,000 files, a README file and a license agreement file. All files are small text files, however. 

```{r}
untar(tarfile = "czechcorpus.tgz")
```

NB: Do not click on the folder in the file browser window of RStudio Jupyter. The folder takes time to display and it may paralyze your RStudio session. If you are interested in details, read the README
file (uncomment code snippet below). 

```{r}
#readLines("czech_text_document_corpus_v20/README")
```
# File names - topic labels
Now let's inspect the folder
```{r}
list.files("czech_text_document_corpus_v20/", recursive = TRUE) %>% 
  length() %>% str_c("This is the total number of files, recursively: ", .) %>% 
  writeLines()
  
```

A sample of file names

```{r}
files_with_suffixes <- list.files("czech_text_document_corpus_v20/", 
                                  recursive = FALSE, 
                                  pattern = "\\..+")
sample(files_with_suffixes, 10)
```


```{r}
head(files_with_suffixes, 10)
```
We have to strip the format suffixes and deduplicate the filenames.

```{r}
files_with_suffixes %<>% str_remove(pattern = "\\..+") %>% unique()
files_with_suffixes %>% head(10)
```

File ID is obviously a number. The stuff between the numeric ID and the suffix 
is the labels. One text can obviously have several labels. 
How to get a vector of labels and their counts?

Extract everything that is not number, split them all on 
underscores, tabulate to get their frequencies.

```{r}
labels <- files_with_suffixes %>% str_remove(pattern = "[0-9]*") %>% #sample(10) %>%
  str_split(pattern = "_") %>% unlist() %>% #sample(10) %>% 
  table() %>% sort(decreasing = TRUE) %>% .[str_length(names(.)) > 1] #the last 
#step removes the most frequent empty string
print(labels)  
```

To interpret the labels, we would obviously have to look them up in the paper. 


# Co-occurrence of labels
Without selection, we would end up comparing n most 
typical words among 60 labels. That is too much for eyeballing. 
Let us select the potentially most interesting labels - those that do not 
co-occur too much. We will get a plot showing their mutual co-occurrences. 


```{r}
label_corpus <- files_with_suffixes %>% str_remove_all(pattern = "[0-9]") %>% 
  str_replace_all(pattern = "_", replacement = " ")
head(label_corpus)
all_labels <- tibble(document = files_with_suffixes, content = label_corpus)
head(all_labels)
```


```{r}
all_labels %<>% tidytext::unnest_tokens(output = content, input = content, format = "text")
head(all_labels)
```

Compute phi - pairwise correlation (for binary correlation)

```{r}
library(widyr)
all_labels %<>% group_by(content) %>% #this is the least frequent label
  widyr::pairwise_cor(item = content, feature = document, ) %>% arrange(desc(correlation)) %>%
  mutate(correlation = round(correlation, 2))
head(all_labels)
```

We need to deduplicate the rows

```{r}
alphabetize <- function(x) {sort(x) %>% str_c(collapse = " ")}
all_labels %<>%  rowwise() %>% mutate(ab = alphabetize(x = c(item1,item2)))
head(all_labels)
```

```{r}
all_labels %<>% distinct(.keep_all = TRUE, ab) %>% select(-ab) 
all_labels %>% head()
```

Now the label pairs are deduplicated. 

```{r}
library(ggraph)
library(igraph)
set.seed(222)
all_labels %<>% ungroup()

bigram_graph <- all_labels %>%
  filter(correlation > 0.05) %>% # if too low, it will be illegible; 
  #if too high, labels will be missing. Check later by as_data_frame(what = "vertices)
  graph_from_data_frame(directed = FALSE)

bigram_graph
  
```
```{r}
as_data_frame(bigram_graph, what = "vertices") %>% nrow()
```

The 60 means that we have all labels in the graph. 

```{r}
bigram_graph %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation)) + 
  geom_node_point() +
    geom_node_text(aes(label = name), repel = TRUE) + 
  scale_edge_alpha_continuous(breaks = seq(0,1, 0.1)) +  
  theme_void() 
#ggsave("bigram_graph.png")  
```

Clusters: 
__bua, bur, bup, buk, sur, che, ene__: stocks; chem&pharma, energy food industry

__pol, pod, for__: politics, domestic politics, parliaments and governments

__str, aut__: machinery, automobile industry

__slz, tur__: services, tourism

__fot, hok, spo__: football, hockey, sports reporting

__spl, mag, met, kul__: society, magazine mix, weather, culture

__den, reg, zah, cen, eko__: daily news, regional, abroad, - , ecology

__prm, tok, bsk__: light industry, textile, glass industry

__ptr, zdr__: food industry, health service

Not clustered:
__eko, ekl__: ecology, environment


# Vocabulary of the texts
We will be looking at the 1) most common 2) most typical words (lemmatized)

## Formats
Each text comes in five different formats. Which format names?
```{r}
files_with_suffixes <- list.files("czech_text_document_corpus_v20/", recursive = FALSE, 
           pattern = "\\..+") 
str_extract(files_with_suffixes, pattern = "\\..+") %>% unique() 
```
Let's inspect the formats. 

#### `txt`
```{r}
readLines("czech_text_document_corpus_v20/00001_zak_pol_pod.txt", warn = FALSE)
```
Just plain text

#### `tok`

```{r}
readLines("czech_text_document_corpus_v20/00001_zak_pol_pod.tok", warn = FALSE)
```
tokenized txt - recognized tokens are separated by space. Look at punctuation!

### `pos`
parts of speech
```{r}
readLines("czech_text_document_corpus_v20/00001_zak_pol_pod.pos", warn = FALSE)
```
... but only those!

### `lemma`
```{r}
readLines("czech_text_document_corpus_v20/00001_zak_pol_pod.lemma", warn = FALSE)
```
all tokens are automatically lemmatized (i.e. in their dictionary form)

### `conll`
a common format in NLP tasks. Let's explore e.g. here: https://universaldependencies.org/format.html

```{r}
readLines("czech_text_document_corpus_v20/00001_zak_pol_pod.conll", n = 10)
```

The `.conll` file is in a tab-separated file (`read_tsv`!) with some commented rows storing the IDs of    
The conll-u format. It is a tabular format containing each token on one row, 
columns containing the token id, word form, lemma, coarse part of speech 
(*universal parts of speech*), national-tagset part-of-speech, more fine-grained morphosyntactic categories 
(*universal features*), token id of the token's syntactic parent, and the 
syntactic relation between the given token and its parent. 
The tags are from the crosslingual Universal Dependencies annotation scheme. 

Let's have a closer look at the 6th row:

```{r}
readLines("czech_text_document_corpus_v20/00001_zak_pol_pod.conll", n = 6) %>%
  "[["(5) %>% writeLines(sep = "\t", )
```
The token with the ID `1` has the form *Schwarzenberg*, 
lemmatized as *Schwarzenberg*. It is a *proper noun* (`PROPN`), in the 
traditional Czech tagset expressed by `NNMS1-----A----`. It is also an *animate*
*nominative case* *masculine* , *a surname*, in *singular*, not negated 
(*polarity* *positive*), syntactically it depends on the token with ID *3*, 
and it is its *nominal subject.* (`nsubj`). There are two more columns, unused, 
just filled out with underscores. The `\t` are tab separators. 

## Which format to use?
We compare vocabularies: lemmas. Parts of speech could be nice, too. 
  1) we can reduce the data size by deleting stop words (mostly function words)
  2) we can also look at the most typical words for a topic within a given part
  of speech (e.g. *20 adjectives most typical of politics*). 

Let's use the
`.conll` formatted files then!


## Data frame of token frequencies in texts
We want a data frame with one token on each row. We will only consider 
content words. These are:  proper nouns (PROPN), nouns (NOUN), adjectives 
(ADJ), adverbs (ADVP), and verbs (VERB). We filter the rows accordingly. 

We remove the `upos` column. Then we group the data frame by topic and count the
frequency (column `n`) of each lemma (column `lemma`) within one text (column 
`id`). 
 
We read in each file and add an "id" column that contains the name of the label
that the given text represents. 
 
 sandbox
```{r}
sample_filenames <- sample(files_with_suffixes, 200)
```
 
```{r}
names(labels)
```

```{r} 
pol <- sample_filenames[str_which(sample_filenames, pattern = "pol")]
pol <- str_c("czech_text_document_corpus_v20/", pol, ".conll")
head(pol)
```


```{r message=FALSE, warning = FALSE}
pod_df <- purrr::map(pol, read_tsv, comment = "#", 
                     col_names = c("id", "form", "lemma", "upos", "xpos", "ufeat", "parent_id", "rel", "misc1", "misc2")) #remember the paragraph and sentence ids in conll as comments
cat(class(pod_df), # each tsv file has become an element in the pod_df list. Each of these elements is a tibble.  
length(pod_df) )#this many tsv files read in
```
 
```{r}
pod_df[[1]] %>% head()
```
 
 We need to do two things: remove stopwords and count content words

Here we keep just nouns, proper nouns, adjectives, adverbs, and verbs.
```{r}
pod_df[[1]] %>% head() %>% filter(upos %in% c("NOUN", "PROPN", "ADJ", "ADVP", "VERB"))
```
Here we delete superfluous columns and summarize the table into a frequency count 

```{r}
pod_df[[1]] %>% 
  filter(upos %in% c("NOUN", "PROPN", "ADJ", "ADVP", "VERB")) %>%
  select(lemma) %>%
  group_by(lemma) %>% 
  summarize(n = n()) %>% arrange(desc(n))
```

Now we filter content words and select just the lemma column for all elements of
the list at once.

```{r}
pod_df %<>% map(filter, upos %in% c("NOUN", "PROPN", "ADJ", "ADVP", "VERB")) %>%
  map(select, lemma)
```

```{r}
pod_df[[1]] %>% head()
```
```{r}
pod_df[[2]] %>% head()
```
 
 We row-bind all elements of the list of one single tibble:
 
```{r}
pod_df %<>% map_df(dplyr::bind_rows)
```

```{r}
glimpse(pod_df)
```
 
```{r}
pod_df %<>% group_by(lemma) %>% count() %>% arrange(desc(n)) %>% mutate(label = 'pod')
```

```{r}
head(pod_df)
```

We write a function `lemma_count_labeller` that takes a label and does all this.

```{r}
lemma_count_labeller <- function(label, filenames) {
  pb$tick()$print()
  pol <- filenames[str_which(filenames, pattern = label)]
  pol <- str_c("czech_text_document_corpus_v20/", pol, ".conll")
  pod_df <- map(pol, read_tsv, comment = "#", 
                     col_names = c("id", "form", "lemma", "upos", "xpos", 
                                   "ufeat", "parent_id", "rel", "misc1", 
                                   "misc2")) %>%
  map(filter, upos %in% c("NOUN", "PROPN", "ADJ", "ADVP", "VERB")) %>%
  map(select, lemma) %>%
  map_df(dplyr::bind_rows) %>%
  group_by(lemma) %>% count() %>% arrange(desc(n)) %>% mutate(label = label)  
  return(pod_df)
  
}
```

Try the function out on a few labels... Mind that this still eats just one label
```{r message=FALSE, warning=FALSE, error=FALSE}
wholecorpus <- lemma_count_labeller(label = names(labels)[46], filenames = files_with_suffixes) 
```

Now we will run the function on all label names at once, using the `map_df` function.
It is designed to create a data frame of the function it runs on a vector or a 
list. 
It takes approx 10 min to run!!! use the rds file instead of running it in 
real time if you are in a hurry. The first approx. 10 labels go horrendously slow because
they occur in many file names. Then it goes very fast. 

```{r warning=FALSE,error=FALSE, message=FALSE}
# pb <- progress_estimated(length(names(labels)))
#  wholecorpus <- map_df(names(labels), lemma_count_labeller, 
#  filenames = files_with_suffixes)
# write_rds(wholecorpus, "wholecorpus.rds")
```

```{r}
#read_rds("wholecorpus.rds") #run if you have not run the preceding chunk
```

Now we know of each word in the press corpus in how many times it occurs in 
documents with a given label. From this, we can even compute how typical it is
of the given label compared to other labels in the text collection. 
The metrics is called tf*idf (term frequency times inverted document frequency)
and is implemented in several packages. We take the one from `tidytext`. Read 
more about this metric. The implementation in `tidytext` uses the 
relative frequency of the word with the given label 
(`absolute freq of the word with the given label / frequency of all words with 
the given label`). 
The inverse document frequency of a word is implemented as 
`log(total number of documents / number of documents containing the given word)`


```{r}
wholecorpus2 <- wholecorpus %>% ungroup() %>% 
  bind_tf_idf(term = lemma, document = label, n = n)
```


```{r}
wholecorpus2 %>% filter(label %in% c("pol", "pod", "for"), lemma == "rok")
```


```{r}
head(wholecorpus2)
```

Now we easily filter the 20 most typical words for each label (20 is our ad-hoc 
decision)

```{r}
typical_50 <- wholecorpus2 %>% group_by(label) %>% top_n(n = 50, wt = tf_idf) 
```
Remember this means that we got *at least* 50 words per label, since the `top_n`
function takes all ties of the `n = 20` top values of `wt = tf_idf`! Indeed, 
the tibble has 3020 rows - slightly more than 50 words * 60 labels!


```{r}
library(ggwordcloud)
typical_50 %>% filter(label %in% c("pol", "pod", "for")) %>% 
  ggplot() + geom_text_wordcloud(aes(label = lemma, size = tf_idf)) +
  facet_wrap(~label)
```

```{r}
ggsave("politics50_ggwordcloud_plot.png")
```


The ecological labels `eko` and `ekl` have no correlation. Are also the texts
different from each other?


```{r}
typical_50 %>% filter(label %in% c("eko", "ekl")) %>% 
  ggplot() + geom_text_wordcloud(aes(label = lemma, size = tf_idf)) +
  facet_wrap(~label)
```
```{r}
ggsave("ekolabels50_ggwordcloud.png")
```

How are the arms, disasters and metallurgy connected through machinery and automobile 
industry?
```{r}
typical_50 %>% filter(label %in% c("zbr", "kat", "str", "aut", "hut")) %>% 
  ggplot() + geom_text_wordcloud(aes(label = lemma, size = tf_idf)) +
  facet_wrap(~label)
```

```{r}
ggsave("machinery50_ggwordcloud.png")
```