Work in Progress: Word frequencies
This page is about some of my current work in progress on word frequencies. The visualizations are some experiments concerning the general types of visualizations that I might use. I have included here only the ones that have made the first cut. As I elaborate these static visualizations, I also keep in mind what kinds of interactions would be relevant.
EBB and RB word frequencies
One of the main things we’re interested in when we have a collection of letters is exploring what they are talking about. Word frequencies are a first step on the path in that exploration.
Of course, “small” words like the, a, etc. are common. But with natural language processing, we can identify nouns, pronouns, verbs, etc., and look at them separately. We can also group together the different forms of words (singulars and plurals, past and present tense, …) under their dictionary form (the lemma). That’s what I’ve done here.
As with many of the other examples in the demo videos, I am using the letters between Elizabeth Barrett (EBB, or just E) and Robert Browning (RB, or just R).
Comparing word rankings
The first type of visualization is a slope graph popularized by Edward Tufte. It lets us easily compare the rankings (by frequency) between EBB and RB.
Nouns
First up are nouns. I’m using just the top 20 nouns for each. Lines connect the same word (lemma) for E and R; where there is no line, that means that word does not occur in the other person’s top 20 nouns. For example, Ba (a nickname for EBB) is R’s top noun, but it is not one of EBB’s top 20, even though she signs many of her letters “Ba”. So looking for missing lines, we can see that there is far from complete overlap in EBB and RB’s top 20 nouns.
When we do have a line between EBB’s and RB’s sides, the slope of the line indicates how different their rankings are. For example Mr. is the most frequent noun for EBB, but only 12th for RB. On the other hand, they both have letter in second place.
Main Verbs
Main verbs show a slightly different pattern: here there is much more overlap between the top 20 verbs between them, as seen by the relatively few missing lines. However, their rankings do differ.
Modal Verbs
The modal verbs ( can, could, may, might, …) show yet another pattern. There aren’t very many of them, so it’s not surpising that there is complete overlap. More surprising is that their rankings are very similar, with 4 out the 10 having identical rankings (compared to 1 verb and no nouns with identical rankings).
Pronouns
Pronouns show a mixed pattern, with the top 8 pronouns being very similar, and the rest more different.
Pronouns across time
In addition to looking at the overall rankings of words, we can also look at how the frequency in letters varies over time. In particular, we might want to find the words that vary the most, or the words that vary the least, in terms of their frequency. To do this, we can use the statistical notion of variance to measure variabilty.
When we plot relative frequency vs variance for the pronouns as below, we see that I and you both have relatively high variance, in addition to being frequent. However, we can see that RB uses I much more variably than EBB does: his I is higher in the chart than hers. (We can also see that for both of them, I is more variable than you.)
Of course, we can also see the details of the variability in EBB and RB’s use of I and you over the course of their letters:
In the chart of variance above he, his, and your were the least variable. Here’s what they look like over time.
At first glance, it looks they are just as variable as I and you, but that’s because the scale of the vertical axis covers a narrower range, effectively magnifying the differences. However, if we use the same vertical range as we had for I and you, we see that the variability is in indeed much smaller, though of course it still exists.
In the next part of Work in Progess, I look at another aspect of figuring out what the letters are about: key words.
---
title: "Work in Progress: Word frequencies"
output: html_notebook
---

## Chris Culy
#### info@chrisculy.net 
#### [chrisculy.net](http://chrisculy.net)
#### [Back to the demo videos](../index.html)
##### Revised 27 August 2017

# Work in Progress: Word frequencies

This page is about some of my current work in progress on word frequencies. The visualizations are some experiments concerning the general types of visualizations that I might use. I have included here only the ones that have made the first cut. As I elaborate these *static* visualizations, I also keep in mind what kinds of interactions would be relevant.

# EBB and RB word frequencies

One of the main things we're interested in when we have a collection of letters is exploring what they are talking about. Word frequencies are a first step on the path in that exploration.

Of course, "small" words like _the_, _a_, etc. are common. But with natural language processing, we can identify nouns, pronouns, verbs, etc., and look at them separately. We can also group together the different forms of words (singulars and plurals, past and present tense, ...) under their dictionary form (the *lemma*). That's what I've done here.

As with many of the other examples in the demo videos, I am using the letters between Elizabeth Barrett (EBB, or just E) and Robert Browning (RB, or just R).


```{r setup, echo=FALSE, message=FALSE}

library(readr)
library(tidyverse)
library(reshape2)


info <- read_csv("~/Dropbox/python_text/gensim_topics/ebbrb_lemmas_pos.csv", 
    col_types = cols(date = col_date(format = "%Y-%m-%d"), 
        from = col_factor(levels = c("E", "R"))))

```

```{r echo=FALSE, message=FALSE}
nouns <- c("NN","NNS","NP","NPS")
verbs <- c('VBZ', 'VBP', 'VB', 'VBG', 'VBN', 'VBD')
modals <- c("MD")
pronouns <- c("PP","PP$")
adjectives <- c("JJ") #no comparatives, superlatives ?
adverbs <- c("RB") #no comparatives, superlatives ?



```

```{r echo=FALSE, message=FALSE}

filterLemmas <- function(infoF,removeLemmas=NULL, includeLemmas=NULL) {
   infoX <- infoF
   if (! is.null(removeLemmas)) {
      infoX <- filter(infoX, ! (lemma %in% removeLemmas))
    } else if (! is.null(includeLemmas)) {
      infoX <- filter(infoX, lemma %in% includeLemmas)
    } 
    return(infoX)
}

makeVars <- function(partsOfSpeech, topn, removeLemmas=NULL, includeLemmas=NULL) {
    infoX <- filter(info, pos %in% partsOfSpeech)
    
    XByDoc <- infoX %>% count(id,lemma)
    XCounts <- left_join(group_by(infoX,from),XByDoc)
    XByPerson <- infoX %>% count(from) %>% select(from=from,total=n)
    XCounts <- left_join(XCounts,XByPerson) %>%
      filterLemmas(removeLemmas=removeLemmas,includeLemmas=includeLemmas)


    XRanks <- infoX %>% 
      filterLemmas(removeLemmas=removeLemmas,includeLemmas=includeLemmas) %>%
      count(from,lemma, sort=TRUE) %>% 
      group_by(from) %>% 
      mutate(rank=row_number())
    
    XRanks <- left_join(XRanks, XByPerson) %>% mutate(tf=n/total)
    
    XRanksTopN <- filter(XRanks, rank<=topn)
    
    XCountsTopN <- filter(XCounts, lemma %in% unique(XRanksTopN$lemma))
    
    return( list("counts" = XCountsTopN, "ranks" = XRanksTopN) )
}

makeVariance <- function(partsOfSpeech, removeLemmas=NULL, includeLemmas=NULL) {
    infoX <- filter(info, pos %in% partsOfSpeech)
    
    XCounts <- infoX %>% count(id,from,lemma)
    XByPerson <- infoX %>% count(from) %>% select(from=from,total=n)
    XCounts <- left_join(XCounts,XByPerson) 
    
    XByLemma <- infoX %>% count(from,lemma)
    XRelFreq <- left_join(XByLemma,XByPerson) %>% mutate(rf=n/total)
    
    XVar <- XCounts %>% group_by(lemma,from) %>% summarize(var=var(n/total)) %>% filter(! is.na(var)) #NAs are hapax
    
    left_join(XVar, XRelFreq) %>% 
      filterLemmas(removeLemmas = removeLemmas, includeLemmas = includeLemmas)
}

ggSideBySide <- function(counts, title, maxY=NA) {
  counts %>% ggplot() + 
    geom_line(aes(date,n/total,color=from)) + 
    xlim(range(counts$date)) +
    ylim(0,maxY) +
    facet_wrap(lemma~from,ncol=2, scales="free_x", labeller = label_wrap_gen(multi_line=FALSE)) + 
    theme_classic() + 
    labs(title=title, y="Relative Frequency") +
    theme(plot.title = element_text(hjust = 0.5))
}

ggSlope <- function(ranks, title) {
  ranks %>% ggplot() + 
    geom_line(aes(from,rank,group=lemma, color=lemma), show.legend=FALSE) + 
    geom_text(data=filter(ranks,from=="E"),aes(from,rank,label=lemma,color=lemma), hjust="right", nudge_x=-0.05, show.legend = FALSE) + 
    geom_text(data=filter(ranks,from=="R"),aes(from,rank,label=lemma,color=lemma), hjust="left", nudge_x=0.05, show.legend = FALSE) + 
    scale_x_discrete(position = "top") +
    scale_y_continuous(trans = "reverse", breaks = unique(ranks$rank), sec.axis = dup_axis()) + 
    theme_classic() + 
    theme(axis.line=element_blank(), axis.ticks = element_blank(), plot.title = element_text(hjust = 0.5)) + 
    labs(title=title, x="")
}
```

# Comparing word rankings

The first type of visualization is a _slope graph_ popularized by Edward Tufte. It lets us easily compare the rankings (by frequency) between EBB and RB.

## Nouns

First up are nouns. I'm using just the top 20 nouns for _each_. Lines connect the same word (lemma) for E and R; where there is no line, that means that word does not occur in the other person's top 20 nouns. For example, _Ba_ (a nickname for EBB) is R's top noun, but it is not one of EBB's top 20, even though she signs many of her letters "Ba". So looking for missing lines, we can see that there is far from complete overlap in EBB and RB's top 20 nouns.

When we _do_ have a line between EBB's and RB's sides, the slope of the line indicates how different their rankings are. For example _Mr._ is the most frequent noun for EBB, but only 12th for RB. On the other hand, they both have _letter_ in second place.


```{r, echo=FALSE, message=FALSE}
topn = 20
v <- makeVars(nouns,topn=topn, removeLemmas=c("—","–","&c"))
```

```{r echo=FALSE, message=FALSE, fig.height=6, fig.width=4 }
(ggSlope(v$ranks,paste("EBB & RB Top",topn,"nouns", sep=" ")))
```

## Main Verbs

Main verbs show a slightly different pattern: here there is much more overlap between the top 20 verbs between them, as seen by the relatively few missing lines. However, their rankings do differ.

```{r, echo=FALSE, message=FALSE}
topn = 20
v <- makeVars(verbs,topn=topn, removeLemmas=c("be","do","have")) #can't tell do,have apart from auxiliaries
```

```{r echo=FALSE, message=FALSE, fig.height=6, fig.width=4}
(ggSlope(v$ranks,paste("EBB & RB Top",topn,"main verbs", sep=" ")))
```


## Modal Verbs

The modal verbs ( _can_, _could_, _may_, _might_, ...) show yet another pattern. There aren't very many of them, so it's not surpising that there is complete overlap. More surprising is that their rankings are very similar, with 4 out the 10 having identical rankings (compared to 1 verb and no nouns with identical rankings).

```{r, echo=FALSE, message=FALSE}
topn = 10
v <- makeVars(modals,topn=topn, removeLemmas = c("ca","wo"))
```


```{r echo=FALSE, message=FALSE, fig.height=6, fig.width=4}
(ggSlope(v$ranks,paste("EBB & RB Top",topn,"Modal Verbs", sep=" ")))
```

## Pronouns

Pronouns show a mixed pattern, with the top 8 pronouns being very similar, and the rest more different.

```{r, echo=FALSE, message=FALSE}
topn = 20
v <- makeVars(pronouns,topn=topn)
```

```{r echo=FALSE, message=FALSE, fig.height=6, fig.width=4}
(ggSlope(v$ranks,paste("EBB & RB Top",topn,"Pronouns", sep=" ")))
```

# Pronouns across time

In addition to looking at the overall rankings of words, we can also look at how the frequency in letters varies over time. In particular, we might want to find the words that vary the most, or the words that vary the least, in terms of their frequency. To do this, we can use the statistical notion of _variance_ to measure variabilty. 

When we plot relative frequency vs variance for the pronouns as below, we see that _I_ and _you_ both have relatively high variance, in addition to being frequent. However, we can see that RB uses _I_ much more variably than EBB does: his _I_ is higher in the chart than hers. (We can also see that for both of them, _I_ is more variable than _you_.)

```{r echo=FALSE, message=FALSE}
pToUse <- c("I","you","it","me","my","your","he","his")
pVar <- makeVariance(pronouns, includeLemmas = pToUse)
(pVar %>%  ggplot() + geom_text(aes(rf,var,label=lemma,color=from)) + theme_classic() + labs(title="EBB & RB: variability of top pronouns", x="Overall Relative Frequency", y="Variance") + theme(plot.title = element_text(hjust = 0.5)))
```

Of course, we can also see the details of the variability in EBB and RB's use of _I_ and _you_ over the course of their letters:

```{r, echo=FALSE, message=FALSE}
topn = 20
v <- makeVars(pronouns,topn=topn)
```


```{r, fig.width = 8.5, fig.height = 6, echo=FALSE, message=FALSE}
pToUse = c("I","you")
(ggSideBySide(filter(v$counts, lemma %in% pToUse), "EBB & RB: I and you"))

```

In the chart of variance above _he_, _his_, and _your_ were the least variable. Here's what they look like over time. 

```{r, echo=FALSE, message=FALSE}
topn = 10
v <- makeVars(pronouns,topn=topn, includeLemmas=c("his","he","your"))
```


```{r, fig.width = 8.5, fig.height = 8, echo=FALSE, message=FALSE}
(ggSideBySide(v$counts, "EBB & RB: he, his, your"))

```

At first glance, it looks they are just as variable as _I_ and _you_, but that's because the _scale_ of the vertical axis covers a narrower range, effectively magnifying the differences. However, if we use the same vertical range as we had for _I_ and _you_, we see that the variability is in indeed much smaller, though of course it still exists.


```{r, fig.width = 8.5, fig.height = 8, echo=FALSE, message=FALSE}
(ggSideBySide(v$counts, "EBB & RB: he, his, your", maxY=0.004))

```

In the [next part](wip_tfidf.html) of Work in Progess, I look at another aspect of figuring out what the letters are about: key words. 

### [Back to the demos](../index.html)
