In our previous post, we noted how 1 and 2 grams from a query mission can be matched to donor mission statements to find an appropriate donor organization. We performed some text cleaning and used tfidf as a metric for weighting the more important words. The final results showed some issues with this method – like the word ‘heart’ in “Isreal at Heart” being matched to “heart conditions”. When we know that most words will have a context associated with their usage and the meaning of the word changes according to its context, how do we still match missions by matching words?

What if, instead of computing simple word matching statistics, we could encode a lot more information about the word – rather than a few numerical values that we compute, we store each word as some n-dimensional vector, encoding various properties of its meaning, and usage?Word embeddings are a way of representing a word along with the meaning/context it is found in. In this representation a word becomes a vector – or a series of numbers – that signify the word meaning. Now if one word vector ‘matches’ another word vector we can say that the two words match in meaning and context.

In this phase we will use prediction based word embeddings using word2vec, and in order to find ‘matches’ between word vectors we use a measure called cosine similarity. Let’s see how this works through our Donor Matching service:

# Loading Libraries
library(tidyverse)
library(tidytext)
library(wordVectors)
library(SnowballC)
# read the reference missions
guidestar <- read_csv("data/guidestar_full.csv")

# tokenizing and removing stopwords
guidestar_words <- guidestar %>%
    mutate(Mission = paste(Organization, Mission)) %>%
    unnest_tokens(word, Mission, drop = FALSE) %>%
    anti_join(stop_words) %>%
    mutate(word = wordStem(word, language = "english")) 
# get in text format for training
guidestar_clean <- guidestar_words %>%
    select(EIN, Organization, word) %>%
    group_by(Organization) %>%
    summarise(Mission = paste(unique(word), collapse = " ")) 

full_text <- tolower(paste(guidestar_clean$Mission, collapse = "\n"))

writeLines(full_text, "training_temp.txt")

tmp_file_txt <- "training_temp.txt"

We create the model with the default params. The model here does a prediction for finding the probability that a word appears along with other words in the text. The weights that the model learns in this process become our word representation (or word embedding). To view the results and see whether the model has learnt what words are similar and different, we can plot these word vectors using a dimension reduction technique (plot in two dimensions to see which words end up close to each other).

# prep the doc
prep_word2vec("training_temp.txt", destination = "training_complete.txt", 
              lowercase = TRUE, bundle_ngrams = 2)
word2vec_model1 <- train_word2vec("training_complete.txt", "word2vec_vectors1.bin", 
                                vectors = 200, min_count = 2, threads = 4,
                                 window = 12, iter = 5,
                                 negative_samples = 0,
                                 force = TRUE)

close_words <- closest_to(word2vec_model1,
                        word2vec_model1[[c("diabet")]],
                        n = 10)
close_words_vec <- word2vec_model1[[close_words$word, average=F]]
plot(close_words_vec, method="pca")
Words close to “diabetic” stemmed to “diabet”

In the plot we see that the words coming up near diabetic are mitig (mitigate) predat (predator) which don’t seem to make a lot of sense. Seems like it gets some words right like afflict but not having enough missions it’s not perfect. Word2Vec would work better when we have a lot of text data relevant to the domain. So for now, we can try to reduce the size represented by the vectors parameter (implying there are only as many dimensions the word appears in) and increase the number of passes through the dataset by increasing iterations.

Let us also try to reduce the window size and see how that affects. A smaller window would tend to give us more related words, in theory.

word2vec_model2 <- train_word2vec("training_complete.txt", "word2vec_vectors2.bin", 
                                 vectors = 30, threads = 4,
                                 window = 7, min_count = 2, 
                                 negative_samples = 4, iter = 10,
                                 force = TRUE)
close_words <- closest_to(word2vec_model2,
                        word2vec_model2[[c("diabet")]],
                        n = 10)
close_words_vec <- word2vec_model2[[close_words$word, average=F]]
plot(close_words_vec, method="pca")

These words look much better! We see “obes” for obesity, “afflict”, and “eye”. So let’s choose model 2 and build the word representation for words in the query mission.

query <- "We aim to promote awareness of serious heart conditions and work to provide treatment for those with heart disease, high blood pressure, diabetes, and other cardiovascular-related diseases who are unable to afford it"

query_words <- query %>%
    as.data.frame %>%
    select(Query = 1) %>%
    mutate(Query = tolower(as.character(Query))) %>%
    unnest_tokens(word, Query) %>%
    anti_join(stop_words) %>%
    mutate(word = wordStem(word, language = "english"))
 
# vector representations for the query and references
mat1 <- word2vec_model2[[unique(query_words$word), average = FALSE]] 
mat2 <- word2vec_model2[[guidestar_words$word, average = FALSE]]

similarities <- cosineSimilarity(mat1, mat2)

# taking an approx measure of the similarity
highest_matching_words <- colSums(similarities)

matching_df <- data.frame(word = names(highest_matching_words), sim = as.numeric(highest_matching_words), stringsAsFactors = FALSE)

# viewing results
res <- guidestar_words %>%
    select(EIN, Mission, word) %>%
    left_join(matching_df) %>%
    group_by(EIN, word) %>%
    group_by(EIN) %>%
    summarise(Mission = Mission[!is.na(Mission)][1],
              Score = mean(sim, na.rm = TRUE)) %>%
    arrange(desc(Score)) 

res %>%
    slice(1:5) %>%
    knitr::kable()

Our results are much better! We see that “Israel at heart” is not a match any more as the model realized it was a unique use of the word and not generalizable. We do see that some of our earlier results from direct matching (blog here) were still good like Pulse3 Foundation, and this brings us to the next part of this series where we will build our final app with a combination of the two techniques and another POC where we test out syntactical meanings using a technique called BERT!