Chris Culy

Work in Progress: Key Words

This page is a continuation of some of my current work in progess, in particular exploring what letters are about. The first part looked at word frequences, but simple word frequencies don’t tell the whole story. We used variance to get one idea of the variability of word frequencies. Another idea is that of key words, words which “clump” in certain letters. A typical measure of key words is the ratio of the relative frequency of the word in the letter to the number of documents it occurs in (called tfidf). The higher the ratio, the more special the word is for that letter.

EBB and RB key words

As with word frequencies, and with many of the other examples in the demo videos, I am using the letters between Elizabeth Barrett (EBB, or just E) and Robert Browning (RB, or just R). And as with word frequencies, we will be looking at nouns, verbs, etc individually, and we will be using lemmas (the dictionary forms).

Comparing key words

Nouns

First up are nouns as key words. We’ll use a slope graph to compare the key words for EBB and RB. As usual, I’m using just the top 20 nouns for each. Lines connect the same word (lemma) for E and R; where there is no line, that means that word does not occur in the other person’s top 20 noun key words. We can see that EBB and RB have few key words in common — just book and happiness.

We can also see a glitch in the automatic processing: &c (“etc”) is considered to be a noun, even though we probably wouldn’t consider it to be one.

We can also compare key words versus frequent words. One thing that is interesting is that Kenyon (a relative of EBB and a friend of RB) is a frequent word for EBB, but not a key word. This is because she mentions Mr. Kenyon in lots of letters, so he is not a special mention: key words are words that are concentrated in just some letters.

Finally, we can also look at the key words across time. Here the size of the dots indicates the TFIDF measure. Only book and happiness have dots for both EBB and RB, since those are the only two key words they have in common. One interesting word to note is Flush, EBB’s dog. Flush was kidnapped and RB helped recover him. This was important to EBB, but apparently not as important to RB (Flush actually bit RB at one point early on).

Verbs

When we look at verbs, we see that EBB and RB have no overlap at all in their key word verbs.

And looking at them over time … (Note R’s use of kiss)

TIFIDF for individual words

We can also look at the TFIDF scores for individual words, not just the top key words. Here we look at love, poem, and poetry, since EBB and RB were romantic poets. Since these are not key words, there is much more overlap in their usage. There is actually even more overlap, but I am not showing the letters with low TFIDF scores (anything less than 0.1).

---
title: "Work in Progress: Key Words"
output: html_notebook
---

## Chris Culy
#### info@chrisculy.net 
#### [chrisculy.net](http://chrisculy.net)
#### [Back to the demo videos](../index.html)
#### [Back to Work in Progress: Word Frequencies](wip.html)
##### Revised 27 August 2017

# Work in Progress: Key Words

This page is a continuation of some of my current work in progess, in particular exploring what letters are about. The [first part](wip.html) looked at word frequences, but simple word frequencies don't tell the whole story. We used _variance_ to get one idea of the variability of word frequencies. Another idea is that of _key words_, words which "clump" in certain letters. A typical measure of key words is the ratio of the relative frequency of the word in the letter to the number of documents it occurs in (called _tfidf_). The higher the ratio, the more special the word is for that letter.

# EBB and RB key words

As with word frequencies, and with many of the other examples in the demo videos, I am using the letters between Elizabeth Barrett (EBB, or just E) and Robert Browning (RB, or just R). And as with word frequencies, we will be looking at nouns, verbs, etc individually, and we will be using _lemmas_ (the dictionary forms).


```{r setup, echo=FALSE, message=FALSE}

library(readr)
library(tidyverse)
library(reshape2)


info <- read_csv("~/Dropbox/python_text/gensim_topics/ebbrb_lemmas_pos.csv", 
    col_types = cols(date = col_date(format = "%Y-%m-%d"), 
        from = col_factor(levels = c("E", "R"))))

#tfidfs

tfidfs_All_E <- read_csv("~/Dropbox/python_text/gensim_topics/tfidfs_All_E.csv", col_types = cols(from = col_factor(levels = c("E", "R"))))
tfidfs_All_R <- read_csv("~/Dropbox/python_text/gensim_topics/tfidfs_All_R.csv", col_types = cols(from = col_factor(levels = c("E", "R"))))
tfidfs_Ns_E <- read_csv("~/Dropbox/python_text/gensim_topics/tfidfs_Ns_E.csv", col_types = cols(from = col_factor(levels = c("E", "R"))))
tfidfs_Ns_R <- read_csv("~/Dropbox/python_text/gensim_topics/tfidfs_Ns_R.csv", col_types = cols(from = col_factor(levels = c("E", "R"))))
tfidfs_Vs_E <- read_csv("~/Dropbox/python_text/gensim_topics/tfidfs_Vs_E.csv", col_types = cols(from = col_factor(levels = c("E", "R"))))
tfidfs_Vs_R <- read_csv("~/Dropbox/python_text/gensim_topics/tfidfs_Vs_R.csv", col_types = cols(from = col_factor(levels = c("E", "R"))))
```

```{r echo=FALSE, message=FALSE}
nouns <- c("NN","NNS","NP","NPS")
verbs <- c('VBZ', 'VBP', 'VB', 'VBG', 'VBN', 'VBD')
modals <- c("MD")
pronouns <- c("PP","PP$")
adjectives <- c("JJ") #no comparatives, superlatives ?
adverbs <- c("RB") #no comparatives, superlatives ?
```

```{r echo=FALSE, message=FALSE}

filterLemmas <- function(infoF,removeLemmas=NULL, includeLemmas=NULL) {
   infoX <- infoF
   if (! is.null(removeLemmas)) {
      infoX <- filter(infoX, ! (lemma %in% removeLemmas))
    } else if (! is.null(includeLemmas)) {
      infoX <- filter(infoX, lemma %in% includeLemmas)
    } 
    return(infoX)
}

makeVars <- function(partsOfSpeech, topn, removeLemmas=NULL, includeLemmas=NULL) {
    infoX <- filter(info, pos %in% partsOfSpeech)
    
    XByDoc <- infoX %>% count(id,lemma)
    XCounts <- left_join(group_by(infoX,from),XByDoc)
    XByPerson <- infoX %>% count(from) %>% select(from=from,total=n)
    XCounts <- left_join(XCounts,XByPerson) %>%
      filterLemmas(removeLemmas=removeLemmas,includeLemmas=includeLemmas)


    XRanks <- infoX %>% 
      filterLemmas(removeLemmas=removeLemmas,includeLemmas=includeLemmas) %>%
      count(from,lemma, sort=TRUE) %>% 
      group_by(from) %>% 
      mutate(rank=row_number())
    
    XRanks <- left_join(XRanks, XByPerson) %>% mutate(tf=n/total)
    
    XRanksTopN <- filter(XRanks, rank<=topn)
    
    XCountsTopN <- filter(XCounts, lemma %in% unique(XRanksTopN$lemma))
    
    return( list("counts" = XCountsTopN, "ranks" = XRanksTopN) )
}


ggSideBySide <- function(counts, title, maxY=NA) {
  counts %>% ggplot() + 
    geom_line(aes(date,n/total,color=from)) + 
    xlim(range(counts$date)) +
    ylim(0,maxY) +
    facet_wrap(lemma~from,ncol=2, scales="free_x", labeller = label_wrap_gen(multi_line=FALSE)) + 
    theme_classic() + 
    labs(title=title, y="Relative Frequency") +
    theme(plot.title = element_text(hjust = 0.5))
}

ggSlope <- function(ranks, title) {
  ranks %>% ggplot() + 
    geom_line(aes(from,rank,group=lemma, color=lemma), show.legend=FALSE) + 
    geom_text(data=filter(ranks,from=="E"),aes(from,rank,label=lemma,color=lemma), hjust="right", nudge_x=-0.05, show.legend = FALSE) + 
    geom_text(data=filter(ranks,from=="R"),aes(from,rank,label=lemma,color=lemma), hjust="left", nudge_x=0.05, show.legend = FALSE) + 
    scale_x_discrete(position = "top") +
    scale_y_continuous(trans = "reverse", breaks = unique(ranks$rank), sec.axis = dup_axis()) + 
    theme_classic() + 
    theme(axis.line=element_blank(), axis.ticks = element_blank(), plot.title = element_text(hjust = 0.5)) + 
    labs(title=title, x="")
}

#for comparing ranks of frequencies with ranks to tfidfs
#xRanks need lemma,from,rank
ggSlopeFT <- function(freqRanks, tfidfRanks, title) {
  zFR <- select(freqRanks, lemma,from,rank) %>% mutate(which="Frequency")
  zTR <- select(tfidfRanks, lemma,from,rank) %>% mutate(which="TFIDF")
  ranks <- bind_rows(zFR,zTR)
  
  ranks %>% ggplot() + 
    geom_line(aes(which,rank,group=lemma, color=lemma), show.legend=FALSE) + 
    geom_text(data=filter(ranks,which=="Frequency"),aes(which,rank,label=lemma,color=lemma), hjust="right", nudge_x=-0.05, show.legend = FALSE) + 
    geom_text(data=filter(ranks,which=="TFIDF"),aes(which,rank,label=lemma,color=lemma), hjust="left", nudge_x=0.05, show.legend = FALSE) + 
    scale_x_discrete(position = "top") +
    scale_y_continuous(trans = "reverse", breaks = unique(ranks$rank), sec.axis = dup_axis()) + 
    theme_classic() + 
    theme(axis.line=element_blank(), axis.ticks = element_blank(), plot.title = element_text(hjust = 0.5)) + 
    labs(title=title, x="") + facet_wrap(~from,ncol=2)
}
```

```{r echo=FALSE, message=FALSE}
makeVariance <- function(partsOfSpeech, removeLemmas=NULL, includeLemmas=NULL) {
    infoX <- filter(info, pos %in% partsOfSpeech)
    
    XCounts <- infoX %>% count(id,from,lemma)
    XByPerson <- infoX %>% count(from) %>% select(from=from,total=n)
    XCounts <- left_join(XCounts,XByPerson) 
    
    XByLemma <- infoX %>% count(from,lemma)
    XRelFreq <- left_join(XByLemma,XByPerson) %>% mutate(rf=n/total)
    
    XVar <- XCounts %>% group_by(lemma,from) %>% summarize(var=var(n/total)) %>% filter(! is.na(var)) #NAs are hapax
    
    left_join(XVar, XRelFreq) %>% 
      filterLemmas(removeLemmas = removeLemmas, includeLemmas = includeLemmas)

}
```

```{r echo=FALSE, message=FALSE}
rank_tfidfs <- function(tfidfs,dlens,topn,doc_percent) {
  ndocs <- tfidfs %>% select(docid) %>% unique() %>% nrow()
  thresh <- doc_percent * ndocs
  
  rtfidfs <- tfidfs %>% arrange(docid,desc(tfidf)) %>% group_by(docid) %>% mutate(rank=row_number())
  rtfidfs <- left_join(rtfidfs,dlens, by=c("docid"="id")) %>% filter(rank<=keywds) %>% group_by(lemma) %>% summarise(ndocs=n()) %>% filter(ndocs >= thresh) %>% arrange(desc(ndocs))
  rtfidfs %>% mutate(from=unique(tfidfs$from), rank=row_number()) %>% filter(rank <= topn) 
}

top_tfidfs2 <- function(which,n,words_per_topic=50,doc_percent=0.02) {
    doc_lens <- info %>% group_by(id) %>% summarise(dlen=n(),keywds=round(dlen/words_per_topic))

    if (which == "All") {
      tfidfs_ranked_E <- rank_tfidfs(tfidfs_All_E,doc_lens,n,doc_percent)
      tfidfs_ranked_R <- rank_tfidfs(tfidfs_All_R,doc_lens,n,doc_percent)
    } else if (which == "N") {
      nopes <- c("�") #"&c"   #really should filter before final ranking
      tfidfs_ranked_E <- rank_tfidfs(tfidfs_Ns_E,doc_lens,n,doc_percent) %>% filter(!(lemma %in% nopes))
      tfidfs_ranked_R <- rank_tfidfs(tfidfs_Ns_R,doc_lens,n,doc_percent) %>% filter(!(lemma %in% nopes))
    } else if (which == "V") {
      tfidfs_ranked_E <- rank_tfidfs(tfidfs_Vs_E,doc_lens,n,doc_percent)
      tfidfs_ranked_R <- rank_tfidfs(tfidfs_Vs_R,doc_lens,n,doc_percent)
    }
    
    
    bind_rows(tfidfs_ranked_E, tfidfs_ranked_R)
}
```

```{r echo=FALSE, message=FALSE}

show_topKeyWds <- function(which,n,words_per_topic=50,doc_percent=0.02, threshold=0.01, title=" ") {
  top_keywords <- top_tfidfs2(which,n,words_per_topic=50,doc_percent=0.02)
  
  if (which == "All") {
      tfidfs_E = tfidfs_All_E
      tfidfs_R <- tfidfs_All_R
    } else if (which == "N") {
      tfidfs_E <- tfidfs_Ns_E
      tfidfs_R <- tfidfs_Ns_R
    } else if (which == "V") {
      tfidfs_E <- tfidfs_Vs_E
      tfidfs_R <- tfidfs_Vs_R
    }
  
  tfidfs <- bind_rows(tfidfs_E, tfidfs_R)
  
  combo <- semi_join(tfidfs,top_keywords) %>% filter(tfidf >= threshold)
  
  (ggScatterTFIDF(combo,threshold = threshold, title=title))
}

show_tfidfs <- function(lemmas,threshold=0.0,title="") {
  tfidfs_E <- tfidfs_All_E %>% filter(lemma %in% lemmas,tfidf>=threshold)
  tfidfs_R <- tfidfs_All_R %>% filter(lemma %in% lemmas,tfidf>=threshold)
  tfidfs <- bind_rows(tfidfs_E, tfidfs_R)
  ggScatterTFIDF(tfidfs,threshold=threshold,title=title)
}

ggScatterTFIDF <- function(tfidfs,threshold=0.0, title="") {
  tfidfs %>% 
    ggplot() + geom_point(aes(date,lemma,size=tfidf,color=from),alpha=0.3) + 
    scale_y_discrete(limits = rev(levels(factor(tfidfs$lemma)))) +
    #scale_y_discrete(limits = levels(tfidfs$lemma)) +
    theme_classic() + 
    theme(plot.title = element_text(hjust = 0.5)) + 
    labs(title=title, x="", y="",size="tfidf") 
}

```
# Comparing key words

## Nouns

First up are nouns as key words. We'll use a _slope graph_ to compare the key words for EBB and RB. As usual, I'm using just the top 20 nouns for _each_. Lines connect the same word (lemma) for E and R; where there is no line, that means that word does not occur in the other person's top 20 noun key words. We can see that EBB and RB have few key words in common &mdash; just _book_ and _happiness_.

We can also see a glitch in the automatic processing: _&amp;c_ ("etc") is considered to be a noun, even though we probably wouldn't consider it to be one.

```{r echo=FALSE, message=FALSE, fig.height=6, fig.width=4 }

topn <- 20
topNs <- top_tfidfs2("N",topn)
t <- paste("EBB & RB Top",topn,"key word nouns", sep=" ")
(ggSlope(topNs, t))

```

We can also compare key words versus frequent words. One thing that is interesting is that _Kenyon_ (a relative of EBB and a friend of RB) is a frequent word for EBB, but not a key word. This is because she mentions Mr. Kenyon in lots of letters, so he is not a special mention: key words are words that are concentrated in just some letters.

```{r echo=FALSE, message=FALSE, fig.height=8, fig.width=10 }

topn <- 20
topNs_key <- top_tfidfs2("N",topn)
topNs_freq <- makeVars(nouns,topn)$rank
t = paste("EBB & RB Top",topn,"nouns: frequency vs key word (tfidf score)", sep=" ")
(ggSlopeFT(topNs_freq,topNs_key, t))

```

Finally, we can also look at the key words across time. Here the size of the dots indicates the TFIDF measure. Only _book_ and _happiness_ have dots for both EBB and RB, since those are the only two key words they have in common. One interesting word to note is _Flush_, EBB's dog. Flush was kidnapped and RB helped recover him. This was important to EBB, but apparently not as important to RB (Flush actually bit RB at one point early on).


```{r echo=FALSE, message=FALSE, fig.width=10, fig.height=10}
topn = 10
t = t = paste("EBB & RB Top",topn,"key word nouns", sep=" ")
show_topKeyWds("N",topn,words_per_topic=50,doc_percent=0.02, threshold=0.175, title=t)
```


## Verbs

When we look at verbs, we see that EBB and RB have no overlap at all in their key word verbs.

```{r echo=FALSE, message=FALSE, fig.height=6, fig.width=4 }

topn <- 20
topVs <- top_tfidfs2("V",topn)
t <- paste("EBB & RB Top",topn,"key word verbs", sep=" ")
(ggSlope(topVs, t))

```

And looking at them over time ... (Note R's use of _kiss_)

```{r echo=FALSE, message=FALSE, fig.width=10, fig.height=10}
n = 10
t <- paste("EBB & RB Top",topn,"key word verbs", sep=" ")
show_topKeyWds("V",n,words_per_topic=50,doc_percent=0.02, threshold=0.25, title=t)
```

## TIFIDF for individual words

We can also look at the TFIDF scores for individual words, not just the top key words. Here we look at _love_, _poem_, and _poetry_, since EBB and RB were romantic poets. Since these are not key words, there is much more overlap in their usage. There is actually even more overlap, but I am not showing the letters with low TFIDF scores (anything less than 0.1).

```{r echo=FALSE, message=FALSE, fig.width=10, fig.height=3}
lemmas <- c("love","poem","poetry")
show_tfidfs(lemmas,title="The Romantic Poets",threshold=0.1)
```

### [Back to Work in Progress: Word Frequencies](wip.html)
### [Back to the demos](../index.html)
