2018 - Omni Analytics Innovative Technologies Initiative

Speeding up Leaflet-based Shiny Apps with Polygon Simplification

Optimizing the performance of Shiny applications has a number of sub-topics that deserve attention. One particular class of applications that we've observed significant performance issues are those that involve the use of interactive maps with polygons overlaid from shapefiles. An example of this type of application can be seen in our Bullying Application. This map draws a Leaflet map and overlays shapes representing the school districts in the state of Iowa (as of 2016):

Reported Bullying Incidents in Iowa School Districts

One of our fundamental goals of OAITI is to empower organizations, communities, and even individuals to collect and analyze their data.

Analyzing Unique Ingredients in World Cuisines

Certain ingredients are often staples of particular world cuisines. The use of hard cheeses in Italian cooking, and the use of masalas in Indian cooking are two particularly well-known examples. We sought out to discover what ingredients are most uniquely associated with other various cuisines.

Visualizing Online Calendar Data on Plotly Graphs

Background

A very common use of line charts is to visualize some quantity as a function of time. These are commonly called time series graphs and they allow trends in that quantity to be analyzed for changes. Sometimes these changes result from behavior. For instance,

Introduction To Web Scraping and Data Cleaning

There are many instances where we find a list of useful items or a table placed on a web page that can help us enhance our analysis or even form the data for our projects. Most often, copy pasting off of the web page does not work very well and can take hours to complete. This is the situation we were faced with when trying to a list of organizations for one of our clients’ projects.

We are working with Dave Grace, who is the Director of Christian education at a prominent church in Washington DC, in order to create a program to assess the energy efficiency of churches in DC, Maryland and Virginia. In the process of planning this program, one task was to get a list of churches in these areas from the National Capital Presbytery website. The site has the list of churches in the following format:

The churches are listed out alphabetically on the website as links with the address and phone number as text below each link. In this post we describe the steps to go about creating a routine for scraping such data from a website.

Calling libraries for web scraping

The first thing we do is to load the necessary packages required for webscraping in R. We use the rvest package for scraping and stringr to clean the data.

library(rvest)

library(stringr)

Accessing relevant data

We then define the url containing the information we need, and read_html() function reads in the url and returns the information as an xml document.

First we want to retrieve the church names from the page. In order to do this we need to pull out the tags/ elements /nodes containing the church name in the XML. The function html_nodes() helps us do this, and we need to pass in the element name, which can be retrieved using a CSS selector tool called SelectorGadget. It can easily be installed as an extension on chrome browsers. In order to use it, we go to the url, click on the SelectorGadget icon and then right click on the element that we want to pull out. The box to the bottom right of your browser shows the selector. Right clicking again can deselect the element. We discover that the selector is called “p :nth-child(1)” for the church names, and this is passed into the function as – html_nodes(“p :nth-child(1)”)

Once the nodes/elements are accessed using html_nodes() function, the actual content is retrieved using html_text().

url <- "https://www.thepresbytery.org/about-us/churches/alphabetical-church-list"
church_names_temp <- url %>% 
  read_html() %>%
  html_nodes("p :nth-child(1)") %>%
  html_text()

Now we move on to scrape the actual contact information for each church. The contacts paragraph is selected using SelectorGadget again which gives us the selector name as “.church-info p”

church_info_temp <- url %>%
  read_html() %>%
  html_nodes(".church-info p") %>%
  html_text()

Data Cleaning

This part is specific to the kind of resulting text your scraping retrieves. We faced the issues of having blank rows/churches and duplicated rows for each church, that need to be removed.

# Remove blanks
church_names <- church_names_temp[church_names_temp != ""]

# We notice all names are duplicated 
names_table <- table(church_names)

# A couple of Churches have the same name (seen as 4 duplicates) which we do not want to remove, save these
dont_remove <- names(names_table)[names_table == 4]

church_names <- church_names[!duplicated(church_names)]
church_names <- sort(c(church_names, dont_remove))

Splitting the paragraphs of content

Regular expressions are useful in pulling out the appropriate parts from the messy text containing escape sequences. In this case, we needed to clean up the text and extract the phone numbers, addresses and state zip codes separately to put them in separate columns.

# separate on the carriage returns
church_info_split <- strsplit(church_info_temp, "\n")

# some phone numbers had ".", some had "-" separators
phone_numbers <- sapply(church_info_split, function(x) {
  gsub("\\.", "-", gsub(".*Phone: ([0-9]+[.-][0-9]+[.-][0-9]+).*", "\\1", paste(x, collapse = " ")))
})

fax_numbers <- sapply(church_info_split, function(x) {
  gsub("\\.", "-", gsub(".*Fax:([0-9]+[.-][0-9]+[.-][0-9]+).*", "\\1", paste(x, collapse = " ")))
})
fax_numbers[nchar(fax_numbers) > 12] <- NA

addresses <- sapply(church_info_split, function(x) {
  gsub("  ", " ", str_trim(gsub("\t", "", gsub("(.*)Phone:.*", "\\1", paste(x, collapse = " ")))))
})

addresses_split <- strsplit(addresses, ", ")
state_zip <- sapply(addresses_split, function(x) { x[length(x)] })
state_zip_split <- strsplit(state_zip, " ")


final_zips <- sapply(state_zip_split, `[`, 2)
final_states <- sapply(state_zip_split, `[`, 1)

city <- sapply(addresses_split, function(x) { x[length(x) - 1] })

addresses_temp <- sapply(addresses_split, function(x) {
  return(str_trim(paste(x[1:(length(x) - 2)], collapse = " ")))
})

Final Dataset!

The individual vectors of data are finally merged as columns in a dataframe. This can be written onto disk with a write.csv()

final_df <- data.frame(
  Name = church_names,
  Address = addresses_temp,
  City = city,
  State = final_states,
  Zip = final_zips,
  Phone = phone_numbers,
  Fax = fax_numbers
)


head(final_df)

##                            Name                  Address         City
## 1   Adelphi Presbyterian Church          9401 Riggs Road      Adelphi
## 2     Aldie Presbyterian Church 32260 Meeting House Lane        Aldie
## 3 Arlington Presbyterian Church          P. O. Box 41810    Arlington
## 4          Ashburn Presbyterian       20962 Ashburn Road      Ashburn
## 5         Bealeton Presbyterian    6415 Schoolhouse Road     Bealeton
## 6           Berwyn Presbyterian      6301 Greenbelt Road College Park
##   State        Zip        Phone          Fax
## 1    MD      20783 301-434-6337         <NA>
## 2    VA      22001 703-327-3090         <NA>
## 3    VA      22204 703-920-5660 703-920-8474
## 4    VA      20147 703-729-2021 703-729-0051
## 5    VA 22712-0166 540-439-2375         <NA>
## 6    MD      20740 301-474-7573         <NA>

Here is the full dataset for you to explore:

	Name	Address	City	State	Zip	Phone	Fax
1	Adelphi Presbyterian Church	9401 Riggs Road	Adelphi	MD	20783	301-434-6337	NA
2	Aldie Presbyterian Church	32260 Meeting House Lane	Aldie	VA	22001	703-327-3090	NA
3	Arlington Presbyterian Church	P. O. Box 41810	Arlington	VA	22204	703-920-5660	703-920-8474
4	Ashburn Presbyterian	20962 Ashburn Road	Ashburn	VA	20147	703-729-2021	703-729-0051
5	Bealeton Presbyterian	6415 Schoolhouse Road	Bealeton	VA	22712-0166	540-439-2375	NA
6	Berwyn Presbyterian	6301 Greenbelt Road	College Park	MD	20740	301-474-7573	NA
7	Bethesda Presbyterian	7611 Clarendon Road	Bethesda	MD	20814	301-986-1137	301-986-1230
8	Boyds Presbyterian	19901 White Ground Road	Boyds	MD	20841	301-540-2544	301-540-4975
9	Bradley Hills Presbyterian	6601 Bradley Boulevard	Bethesda	MD	20817	301-365-2850	301-365-6218
10	Brambleton Presbyterian	42395 Ryan Road Suite 112B #633	Brambleton	VA	20148-4858	703-542-8530	NA
11	Brazilian Bible Church	20701 Frederick Road	Germantown	MD	20876	301-802-1743	NA
12	Brentsville Presbyterian	12305 Bristow Road	Bristow	VA	20136	703-368-2546	NA
13	Burke Presbyterian	5690 Oak Leather Drive	Burke	VA	22015	703-764-0456	703-764-1853
14	Bush Hill Presbyterian	4916 Franconia Road	Alexandria	VA	22310	703-971-1171	703-971-9007
15	Calvary Presbyterian	6120 North Kings Highway	Alexandria	VA	22303	703-768-8510	703-768-7690
16	Capitol Hill Presbyterian Church	201 Fourth Street SE	Washington	DC	20003	202-547-8676	202-547-2182
17	Catoctin (The) Presbyterian Church	15565 High Street	Waterford	VA	20197	540-882-3058	540-882-4683
18	Centreville Presbyterian	15450 Lee Highway	Centreville	VA	20120	703-830-0098	703-830-8375
19	Chesterbrook Taiwanese Presbyterian	2036 Westmoreland Street	Falls Church	VA	22043	703-241-2433	NA
20	Chevy Chase Presbyterian	One Chevy Chase Circle NW	Washington	DC	20015	202-363-2202	202-537-2916
21	Christ Presbyterian	12410 Lee-Jackson Highway	Fairfax	VA	22033	703-278-8365	NA
22	Christ the King Presbyterian Church	6301 Greenbelt Road	Berwyn Heights	MD	20787	240-217-9960	NA
23	Christian Community Presbyterian	3120 Belair Drive	Bowie	MD	20715	301-262-6008	NA
24	Church of the Covenant	2666 Military Road	Arlington	VA	22207	703-524-4115	703-524-4248
25	Church of the Pilgrims	2201 P Street NW	Washington	DC	20037	202-387-6612	202-387-6614
26	Church of the Redeemer Presbyterian	1423 Girard Street NE	Washington	DC	20017	202-832-0095	NA
27	Clarendon Presbyterian	1305 North Jackson Street	Arlington	VA	22201	703-527-9513	703-524-4511
28	Clifton Presbyterian	12748 Richards Lane	Clifton	VA	20124	703-830-3175	703-830-6618
29	Colesville Presbyterian	12800 New Hampshire Avenue	Silver Spring	MD	20904	301-622-4555	301-625-3095
30	Community Presbyterian	1122 Oronoco Street	Alexandria	VA	22313	703-683-4164	NA
31	Covenant Presbyterian	12700 Black Forest Lane #204	Woodbridge	VA	22192	703-583-4090	NA
32	Darnestown Presbyterian	15120 Turkey Foot Road	Darnestown	MD	20878	301-948-9127	301-948-9135
33	Eastminster Presbyterian	5601 Randolph Street	Hyattsville	MD	20784	301-864-1149	NA
34	Ebenezer Presbyterian	14508 Telegraph Road	Woodbridge	VA	22182	703-492-7172	703-492-7174
35	Emmanuel Indonesian Presbyterian Church	215 Montgomery Avenue	Rockville	MD	20850	301-500-4018	NA
36	Ewe Church of America	1700 Spencerville Road	Spencerville	MD	20914	240-669-9286	NA
37	Fairfax Presbyterian	10723 Main Street	Fairfax	VA	22030	703-273-5300	703-591-4246
38	Fairlington Presbyterian	3846 King Street	Alexandria	VA	22302	703-931-7344	703-931-6062
39	Faith Presbyterian	4161 South Capitol St SW	Washington	DC	20023	202-562-2035	NA
40	Falls Church Presbyterian	225 East Broad Street	Falls Church	VA	22046	703-532-6518	703-532-6594
41	Fifteenth Street Presbyterian	1701 15th Street NW	Washington	DC	20009	202-234-0300	NA
42	First Korean Presbyterian	7610 Newcastle Drive	Annandale	VA	22003	703-354-9223	NA
43	First Presbyterian	7610 Newcastle Drive	Annandale	VA	22003	703-941-3300	703-941-0845
44	First Presbyterian	601 North Vermont Street	Arlington	VA	22203	703-527-4766	703-527-2262
45	First United of Dale City	14391 Minnieville Road	Woodbridge	VA	22193	703-670-7834	703-670-7834
46	Furance Mountain Presbyterian Church	12946 James Monroe Hwy	Leesburg	VA	20176	12946 James Monroe Hwy, Leesburg, VA 20176 Phone:	NA
47	Gaithersburg Presbyterian	610 South Frederick Avenue	Gaithersburg	MD	20877	301-948-9418	301-869-3043
48	Garden Memorial Presbyterian	1720 Minnesota Avenue SE	Washington	DC	20020	202-678-0772	NA
49	Geneva Presbyterian	11931 Seven Locks Road	Rockville	MD	20854	301-424-4346	301-340-0265
50	Georgetown Presbyterian	3115 P Street NW	Washington	DC	20007	202-338-1644	202-338-4797
51	Good Samaritan Presbyterian	PO Box 925	Waldorf	MD	20604-0925	301-843-1335	301-645-4134
52	Grace Presbyterian	5924 Princess Garden Parkway	Lanham Seabrook	MD	20706	301-577-1092	301-577-7483
53	Grace Presbyterian	7434 Bath Street	Springfield	VA	22150	703-451-2900	703-451-3313
54	Greenwich Presbyterian	15305 Vint Hill Road	Nokesville	VA	20181	703-754-7933	703-753-3683
55	Heritage Presbyterian	8503 Fort Hunt Road	Alexandria	VA	22308	703-360-9546	703-360-7389
56	Hermon Presbyterian Church	7801 Persimmon Tree Lane	Bethesda	MD	20817	301-365-4454	NA
57	Hope Presbyterian	1100 Enterprise Road	Mitchellville	MD	20721	301-249-7774	301-249-9606
58	Idylwood Presbyterian	7617 Idylwood Road	Falls Church	VA	22043	703-573-3027	NA
59	Immanuel Presbyterian	1125 Savile Lane	McLean	VA	22101	703-356-3042	703-790-0756
60	Indo Pak Presbyterian	641 Dranesville Road	Herndon	VA	20170	703-787-0275	NA
61	Indonesian-American Presbyterian	3211 Paul Drive	Silver Spring	MD	20902	240-505-5446	NA
62	John Calvin Presbyterian	6531 Columbia Pike	Annandale	VA	22003	703-256-3644	703-941-3341
63	Kirkwood Presbyterian	8336 Carrleigh Parkway	Springfield	VA	22152	703-451-5320	703-451-1959
64	Knox Presbyterian	7416 Arlington Boulevard	Falls Church	VA	22042	703-560-5288	703-560-6603
65	Korean Presbyterian	800 Hurley Avenue	Rockville	MD	20850	301-838-0766	301-838-3060
66	Laurel Presbyterian	7610 Sandy Spring Road	Laurel	MD	20707	301-776-6665	301-776-6665
67	Leesburg Presbyterian	207 West Market Street	Leesburg	VA	20176	703-777-4163	703-777-4666
68	Lewinsville Presbyterian	1724 Chain Bridge Road	McLean	VA	22101	703-356-7200	703-356-7334
69	Litchfield Presbyterian	135 West Bowen Street	Remington	VA	22734	135 West Bowen Street, Remington, VA 22734 Phone:	NA
70	Little Falls Presbyterian	6025 Little Falls Road	Arlington	VA	22207	703-538-5230	703-538-6725
71	Manassas Presbyterian	8201 Ashton Avenue	Manassas	VA	20109	703-369-2058	703-330-8827
72	Mizo Presbyterian	610 South Frederick Avenue	Gaithersburg	MD	20877	610 South Frederick Avenue, Gaithersburg, MD 20877 Phone:	NA
73	Mount Vernon Presbyterian	2001 Sherwood Hall Lane	Alexandria	VA	22306	703-765-6118	NA
74	National Presbyterian	4101 Nebraska Avenue NW	Washington	DC	20016	202-537-0800	202-686-0031
75	Neelsville Presbyterian	20701 Frederick Road	Germantown	MD	20876	301-972-3916	301-972-5563
76	New Hope Presbyterian	17930 Bowie Mill Road	Olney	MD	20855	301-987-8989	301-987-9010
77	New York Avenue Presbyterian	1313 New York Avenue NW	Washington	DC	20005	202-393-3700	202-393-3705
78	Northeastern Presbyterian	2112 Varnum Street NE	Washington	DC	20018	202-526-1730	202-526-5900
79	Northern Virginia Korean Presbyterian	4211 Evergreen Lane	Annandale	VA	22003	703-941-3338	NA
80	Northminster Presbyterian	7720 Alaska Avenue NW	Washington	DC	20012	202-829-5311	NA
81	Northwood Presbyterian	1200 University Blvd West	Silver Spring	MD	20902	301-593-1180	301-649-1155
82	Oaklands Presbyterian	14301 Laurel Bowie Road	Laurel	MD	20708	301-776-5833	NA
83	Old Presbyterian Meeting House	323 South Fairfax Street	Alexandria	VA	22314	703-549-6670	703-549-9425
84	Patuxent Presbyterian	23421 Kingston Creek Road	California	MD	20619	301-863-2033	301-863-8004
85	Poolesville Presbyterian	17800 Elgin Road	Poolesville	MD	20837	301-972-7452	NA
86	Potomac Presbyterian	10301 River Road	Potomac	MD	20854	301-299-6007	301-299-9438
87	Prince Georges Community Church	10111 Martin Luther King Jr. Highway Suite 200A	Bowie	MD	20720	301-218-4802	NA
88	Providence Presbyterian	9019 Little River Turnpike	Fairfax	VA	22031	703-978-3934	703-978-4306
89	Riverdale Presbyterian	6513 Queens Chapel Road	University Park	MD	20782	301-927-0477	301-699-2156
90	Riverside Presbyterian	20 Pidgeon Hill Drive Suite 109	Sterling	VA	20165	703-444-3528	703-444-8660
91	Rock (The) Presbyterian Church	800 Hurley Ave	Rockville	MD	20850	301-838-0766	NA
92	Rockville Presbyterian	215 W. Montgomery Avenue	Rockville	MD	20850	301-762-3363	301-762-5823
93	Rockville United Church	355 Linthicum Street	Rockville	MD	20851	301-424-6733	301-738-7695
94	Saint Mark Presbyterian	10701 Old Georgetown Road	Rockville	MD	20852	301-530-0600	301-530-2613
95	Sargent Memorial Presbyterian	5109 N.H. Burroughs Ave NE	Washington	DC	20019	202-396-1710	202-396-0708
96	Silver Spring Presbyterian	580 University Blvd. East	Silver Spring	MD	20901	301-439-4646	301-439-4647
97	Sixth Presbyterian	5413 16th Street NW	Washington	DC	20011	202-723-5377	202-723-8416
98	Southminster Presbyterian Church	7801 Livingston Road	Oxon Hill	MD	20745	301-567-1510	NA
99	St. Andrew Presbyterian	711 West Main Street	Purcellville	VA	20132	540-338-4332	540-338-4333
100	St. Matthew Presbyterian	4001 Bel Pre Road	Silver Spring	MD	20906	301-598-4400	301-598-4401
101	Taiwanese Presbyterian	7410 Needwood Road	Derwood	MD	20855	301-942-1133	NA
102	Takoma Park Presbyterian	310 Tulip Avenue	Takoma Park	MD	20912	301-270-5550	301-270-8405
103	Trinity Presbyterian	651 Dranesville Road	Herndon	VA	20170	703-437-5500	703-437-4861
104	Trinity Presbyterian Church	5533 North 16th Street	Arlington	VA	22205	703-536-5600	703-536-2815
105	United Christian Parish of Reston	11508 North Shore Drive	Reston	VA	20190	703-620-3065	703-707-0622
106	United Korean Presbyterian	7009 Wilson Lane	Bethesda	MD	20817	301-229-0000	301-229-0200
107	United Parish of Bowie	PO Box 1571	Bowie	MD	20717-0171	301-249-6411	301-249-6411
108	Unity Presbyterian	4401 Brinkley Road	Temple Hills	MD	20748	301-449-7686	NA
109	Universal Evangelical Church	1523 Forest Glen Road	Silver Spring	MD	20910	301-593-0861	NA
110	Vienna Presbyterian	124 Park Street NE	Vienna	VA	22180	703-938-9050	703-938-8264
111	Warner Memorial Presbyterian	10123 Connecticut Avenue	Kensington	MD	20895	301-949-2900	301-933-7704
112	Western Presbyterian	2401 Virginia Ave. NW	Washington	DC	20037	202-835-8383	202-835-8376
113	Westminster Presbyterian	400 Eye Street SW	Washington	DC	20024	202-484-7700	202-484-8544
114	Westminster Presbyterian Church	2701 Cameron Mills Road	Alexandria	VA	22302	703-549-4766	703-548-1505
115	Wheaton Community Church	3211 Paul Drive	Silver Spring	MD	20902	301-949-2742	NA

August 29, 2018

Background

Calling libraries for web scraping

Accessing relevant data

Data Cleaning

Splitting the paragraphs of content

Final Dataset!