Using NLP to find the perfect donor match - 1 - Omni Analytics Innovative Technologies Initiative

One of the difficulties of starting a non-profit is finding donors and funding agencies. We’ve seen first hand the challenges involved! With so many different foundations with a wide variety of mission statements, how do we find one that is most closely aligned with our goals?

As statisticians and data scientists, we decided to tackle this problem using data. We collected a set of private foundation mission statements from GuideStar. And pre-processed the data into the following format:

library(tidyverse)
library(stringdist)
library(tidytext)

guidestar <- read_csv("guidestar_full.csv")

guidestar

# A tibble: 1,056 x 3
   EIN      Organization                  Mission                                      
   <chr>    <chr>                         <chr>                                        
 1 91-1663… Bill & Melinda Gates Foundat… The foundation conducts all operations and g…
 2 56-2618… BILL & MELINDA GATES FOUNDAT… Guided by the belief that every life has equ…
 3 23-7093… John D. and Catherine T. Mac… The John D. and Catherine T. MacArthur Found…
 4 91-0793… Casey Family Programs         Casey Family Programs is the nation's larges…
 5 65-0464… John S. and James L. Knight … Knight Foundation is a national foundation w…

For this post, we’ll work to match this (fake) non-profit mission statement or query with possible donors.

We aim to promote awareness of serious heart conditions and work to provide treatment for those with heart disease, high blood pressure, diabetes, and other cardiovascular-related diseases who are unable to afford it.

query <- "We aim to promote awareness of serious heart conditions and work to provide treatment for those with heart disease, high blood pressure, diabetes, and other cardiovascular-related diseases who are unable to afford it."

We will now explore two methods for matching a non-profit mission statement to one of the mission statements from this dataset.

We look for exact matches in words from the donor’s mission and the recipient’s mission using tidytext text manipulation routines.
We use a word2vec representation model where words are represented by vectors containing some embedded information about it’s meaning.

Let’s start with approach one. We’ll use tidytext and extract the words from the mission statements. We’ll do an anti_join operation on stop words. These are words such as articles “the”, “of”, etc. which have little relevant information in terms of statistical matching.

guidestar_words <- guidestar %>%
     mutate(Mission = paste(Organization, Mission)) %>%
     unnest_tokens(word, Mission, drop = FALSE) %>%
     anti_join(get_stopwords()) %>%
     add_count(Organization, word, sort = TRUE) %>%
     bind_tf_idf(word, Organization, n)

guidestar_words %>% select(Organization, word, n, tf_idf) %>% sample_n(5)

# A tibble: 5 x 4
  Organization                                   word           n tf_idf
  <chr>                                          <chr>      <int>  <dbl>
1 BELLWETHER FOUNDATION                          bellwether     2 0.228 
2 Center for Cultural Judaism Inc                can            1 0.0219
3 The Moore Foundation                           necessary      2 0.0186
4 Farm-to-Consumer Legal Defense Fund Foundation state          1 0.0439
5 JAYDEN LAMB MEMORIAL FOUNDATION                jayden         2 0.522

Note that we’ve also binded the tf–idf (Term Frequency-Inverse Document Frequency) value for each word. This is a measure of the uniqueness of that word to that particular organization in consideration of all the words that are associated with all organizations.

We’ll also include 2-grams – Using the built-in funcionality of tidytext to tokenize these 2-grams from each review.

guidestar_twograms <- guidestar_words %>%
    select(EIN, Organization, word) %>%
    group_by(Organization) %>%
    summarise(Mission = paste(unique(word), collapse = " ")) %>%
    unnest_tokens(twogram, Mission, token = "ngrams", n = 2)

guidestar_twograms %>% sample_n(5)

# A tibble: 5 x 2
  Organization                 twogram          
  <chr>                        <chr>            
1 DEANGELO WILLIAMS FOUNDATION focus children's 
2 CHOSEN AND LOVED             actively advocate
3 VOICE OF ELIJAH INC          12 nasb          
4 Kimbell Art Museum           contained follows
5 The Moore Foundation         heart conditions

We now perform the same routine on our query object from above, extracting words and 2-grams.

query_words <- query %>%
    as.data.frame %>%
    select(Query = 1) %>%
    mutate(Query = as.character(Query)) %>%
    unnest_tokens(word, Query) %>%
    anti_join(stopwords)

query_twograms <- query_words %>%
    summarise(Mission = paste(unique(word), collapse = " ")) %>%
    unnest_tokens(twogram, Mission, token = "ngrams", n = 2)

Now that we have a similar table for the query mission, we can compare this table to the ones for all the donor missions and create a score to find the best matching donor mission!
The score is calculated as:

 Score = MatchingWords + MatchingTwoGrams - LengthDiff + AverageTFIDF

where,
MatchingWords is the number of times the exact same word occurs
MatchingTwoGrams is the number of times the exact same 2 gram word occurs
LengthDiff is the difference in the length of the two mission statements (penalizes it being longer than the foundation’s statement)
AverageTFIDF is the average tf–idf value for the words in the mission (makes sure the words are uniquely important for describing the mission – e.g, if you matched a mission on a bunch of common words like “Empower”, that’s not as meaningful as if you match on “Statistics”)

full_list <- query_twograms %>%
    inner_join(guidestar_twograms) %>%
    group_by(Organization, twogram) %>%
    summarise(MatchingTwoGrams = n()) %>%
    group_by(Organization) %>%
    summarise(MatchingTwoGrams = sum(MatchingTwoGrams),
              UniqueTwoGrams = length(unique(twogram))) %>%
    full_join(
        query_words %>%
            inner_join(guidestar_words) %>%
            group_by(Organization, word) %>%
            summarise(MatchingWords = n(),
                      tf_idf = tf_idf[1],
                      Mission = Mission[1]) %>%
            group_by(Organization) %>%
            summarise(MatchingWords = sum(MatchingWords),
                      UniqueWords = length(unique(word)),
                      Mission = Mission[1],
                      LengthDiff = abs(nchar(query) - nchar(Mission[1])),
                      AverageTFIDF = mean(tf_idf),
                      MaxTFIDF = max(tf_idf))
    )

full_ranks <- full_list %>%
    mutate_if(is.numeric, function(.) (. - min(., na.rm = TRUE)) / (max(., na.rm = TRUE) - min(., na.rm = TRUE))) %>%
    mutate(MatchingTwoGrams = ifelse(is.na(MatchingTwoGrams), 0, MatchingTwoGrams)) %>%
    mutate(UniqueTwoGrams = ifelse(is.na(UniqueTwoGrams), 0, UniqueTwoGrams)) %>%
    mutate(MatchingWords = ifelse(is.na(MatchingWords), 0, MatchingWords)) %>%
    mutate(UniqueWords = ifelse(is.na(UniqueWords), 0, UniqueWords)) %>%
    mutate(Score = MatchingWords + MatchingTwoGrams - LengthDiff + AverageTFIDF) %>%
    arrange(desc(Score)) %>%
    slice(1:5)

full_ranks %>%
    select(Organization, Mission) %>%
    as.data.frame

Using this ranking, our top 5 matching organizations are:

                    Organization
1 Rosenfeld Heart Foundation Inc
2      Blood Brothers Foundation
3              PULSE3 FOUNDATION
4                Israel at Heart
5         Robey Charitable Trust

And their respective missions:

1. To Support Research In, And The Disseminatin And Publication Of Knowledge Concerning The Causes, Treatment And Prevention Of Diseases Of The Heart And Circulation.

2. Blood Brothers is dedicated to transforming the lives of individuals living with spinal cord injuries by restoring physical freedom, mobility, and independence.

3. To create a community that is healthy and vibrant by championing causes related to the education, prevention and treatment of cardiovascular-related disease.

4. To Promote Better Understanding Of Israel

5. The overall objective of our Trust is to help improve the conditions of all people.

Thus we have seen how a simple matching of words and thier uniqueness can help us find the right donor organizations with similar missions. There are some obvious problems using this approach (3rd and 4th missions) and we will see why and how to overcome these using word2vec in the next part of this blog series. Stay tuned!

Using NLP to find the perfect donor match – 1

Leave a Reply