The goal of RefineGeo is to facilitate the comparison between geocoding services coordinates. It uses the methodology from the study of the Geocoding process from the Single Registry (Cadastro Único) of Campinas.
You can install the development version of RefineGeo from GitHub with:
# install.packages("devtools")
devtools::install_github("PedroTL/RefineGeo")
At present, the toolkit comprises five primary functions:
clean_address
: Cleanses address strings by removing
punctuation, accents, double white spaces, and converting all characters
to uppercase.extr_cep
: Extracts the CEP (equivalent to Zip Code)
from an address string, accommodating both 8 and 5-digit patterns.compare_cep
: Verifies CEP matches between two
strings.points_out
: Identifies coordinates located outside a
municipality’s polygon boundaries.get_best_coords
:
Consider a scenario where you have an address string that contains pronunciation variations, double white spaces, and other typical input address issues.
library(RefineGeo)
library(ggplot2)
library(sf)
library(geobr)
# Sample address
address <- c("Rua JosÉ DôIS - - 12345678", "R. DR, JÔSÉ DÓIS AV: 1 -- (12345)")
clean_address(address)
#> [1] "RUA JOSE DOIS 12345678" "R DR JOSE DOIS AV 1 12345"
A noteworthy aspect when analyzing addresses and Geocoding is the
CEP, known in Brazil as the Código de Endereçamento Postal
,
equivalent to the Zip Code. It encapsulates address information into a
standardized code.
The CEP consists of two parts. The first five digits represent:
The five-digit segment provides polygonal information, indicating a larger area that encompasses the full eight-digit CEP.
The final three suffixes following the “-” symbol identify individual streets, specific codes, and Correios units. For further details, refer to the official Correios website
CEP serves as a useful tool for evaluating the quality metrics of
various Geocoding services. Each service offers a quality metric for the
Geocoded address and output coordinate. For example, the
ArcGIS
service provides a score ranging from 0 to 100,
while Google
and Bing
categorize their metrics
into ROOFTOP
, Low
, Medium
, and
High.
When attempting to compare coordinates across different services, working with individual metrics might pose challenges. In such instances, CEP can establish a standardized method that remains consistent across services.
The fundamental premise here is: if the provided CEP in the input address matches the CEP in the output address provided by the Geocoding service, we can assume a high level of accuracy for that coordinate.
For example, given two strings—one representing the input address and
the other the Geocoding service’s output—extr_cep()
can be
utilized to isolate and extract the CEP patterns from these strings.
# Sample address
input_address <- c("RUA JOSE DOIS 12345678", "R DR JOSE DOIS AV 1 12345")
output_address <- c("RUA JOSE DOIS 12345678", "R DR JOSE DOIS AV 1 12345")
# Extracting CEP
extr_cep(input_address) # Input address CEP
#> [1] "12345678" NA
extr_cep(output_address, subsector_as_zip = TRUE) # Output address CEP
#> [1] "12345678" "12345"
The extr_cep()
function can explicitly consider 5-digit
(subsector) patterns as CEP if indicated by
subsector_as_zip = TRUE
. Otherwise, it will return NA for
5-digit patterns and focus solely on locating 8-digit CEPs.
Meanwhile, the compare_cep()
function operates within a
data frame structure and generates a column with binary values (1
indicating a match between two columns’ CEPs and 0 denoting no
match).
# Sample data frame
df <- data.frame(cep1 = c("12345", "12345678", "12345", "12345", "12345"),
cep2 = c("12345", "12345678", "54321", "123", "12345"))
# Comparing CEPs
compare_cep(df, "cep1", "cep2")
#> [1] NA 1 NA NA NA
compare_cep(df, "cep1", "cep2", strict_check = TRUE)
#> [1] 1 1 0 NA 1
The subsector_as_zip = TRUE
option functions effectively
when used alongside strict_check = TRUE
. With this setup,
if CEPs are extracted in both 5 and 8 digits, the
compare_cep()
function enables specification via
strict_check = TRUE
to include 5-digit CEPs in the equality
comparison.
Another valuable function is points_out()
. Typically,
when handling Geocoded addresses, you might want to visualize and work
with spatial features. This function operates at the municipality level.
Essentially, points_out()
determines whether a provided
coordinate (point) from a Geocoding service, for a given address, lies
inside or outside a municipality’s boundary (polygon).
For example, this function takes several latitude and longitude pairs
from the Campinas-SP municipality and verifies if they fall within or
outside the municipality’s boundary. The resulting column in the data
frame can be specified; here, I used out_camp
. Using an out
prefix
is considered a good practice for easy
interpretation. If the value is TRUE
, the coordinate indeed
lies outside the polygon.
Under the hood, the function utilizes
read_municipality()
from geobr
. Therefore, a
municipality code, along with the year and a
CRS projection
, must be provided.
# Sample data
df <- data.frame(lat = c(-22.71704, -22.71258, -22.84277, -22.73391, -22.77165),
lon = c(-46.91200, -46.90435, -47.07650, -47.00500, -46.98793))
# Getting the Campinas SHP.
shp_campinas <- read_municipality(code_muni = 3509502, year = 2020) %>%
st_transform(4326)
# Transforming Latitude and Longitude in a geometry column
df_sf_point <- st_as_sf(df,
coords = c("lon", "lat"),
na.fail = FALSE,
crs = st_crs(shp_campinas)) # Make the CRS of the points the same as the polygon
# Plot the polygon
ggplot() +
geom_sf(data = shp_campinas, fill = "blue", color = "black") +
# Add points from df_sf_point
geom_point(data = df, aes(x = lon, y = lat), color = "red", size = 3) +
# Customize plot labels and theme (if needed)
labs(x = "Longitude", y = "Latitude", title = "Polygon and Points Map") +
theme_minimal()
In the example, we notice a problem where some points are found outside Campinas-SP. To identify these outside points, this function proves to be quite useful.
# Finding if the coordinates are inside or not the polygon
df <- points_out(df, "lat", "lon", "out_camp", code_muni = 3509502, year = 2020, polygon_crs = 4326)
knitr::kable(head(df), align = "c")
lat | lon | out_camp |
---|---|---|
NA | NA | TRUE |
NA | NA | TRUE |
-22.84277 | -47.07650 | FALSE |
NA | NA | TRUE |
-22.77165 | -46.98793 | FALSE |
In this scenario, three points are found outside Campinas-SP. It’s worth noting that currently, the function marks outside points as NA while retaining the observation in the dataset. This feature could be beneficial when aiming for more accuracy in analysis.
Finally, we can utilize get_best_coords()
. This function
aims to determine the best coordinates among the three Geocoding
services. The optimal structure of a data frame for these functions is
as follows:
# Sample data frame
df <- data.frame(
input_addr = c("Rua X 12345-123", "Rua Y 54321-321", "Rua Y 99999", "Rua Z 11111-888", "Rua Z 22211-888", "Rua Z 11111-832", "Rua Z 11111-328"),
output_addr_1 = c("Rua X 12345-123", "Rua Y 52321-321", "Rua Y 39999", "Rua Z 11111-888", NA, NA, NA),
output_addr_2 = c("Rua X 12345-113", "Rua YY 54321-321", "Rua 99999", "Rua G 11111", NA, NA, NA),
output_addr_3 = c("Rua X 12345-123", "Rua Y 54321-321", "Rua YY 19999", "Rua K 11111", NA, NA, NA),
lat1 = c(-22.71704, -22.71258, -22.77704, -22.74704, NA, NA, NA),
lon1 = c(-46.91200, -46.90435, -46.97200, -46.91200, NA, NA, NA),
lat2 = c(-22.72704, -22.71268, -22.72304, NA, -22.72704, NA, NA),
lon2 = c(-46.93200, -46.90435, -46.99200, NA, -46.92435, NA, NA),
lat3 = c(-22.75704, -22.71258, NA, NA, NA, -22.77704, NA),
lon3 = c(-46.92430, -46.90435, NA, NA, NA, -46.92439, NA)
)
knitr::kable(head(df), align = "c")
input_addr | output_addr_1 | output_addr_2 | output_addr_3 | lat1 | lon1 | lat2 | lon2 | lat3 | lon3 |
---|---|---|---|---|---|---|---|---|---|
Rua X 12345-123 | Rua X 12345-123 | Rua X 12345-113 | Rua X 12345-123 | -22.71704 | -46.91200 | -22.72704 | -46.93200 | -22.75704 | -46.92430 |
Rua Y 54321-321 | Rua Y 52321-321 | Rua YY 54321-321 | Rua Y 54321-321 | -22.71258 | -46.90435 | -22.71268 | -46.90435 | -22.71258 | -46.90435 |
Rua Y 99999 | Rua Y 39999 | Rua 99999 | Rua YY 19999 | -22.77704 | -46.97200 | -22.72304 | -46.99200 | NA | NA |
Rua Z 11111-888 | Rua Z 11111-888 | Rua G 11111 | Rua K 11111 | -22.74704 | -46.91200 | NA | NA | NA | NA |
Rua Z 22211-888 | NA | NA | NA | NA | NA | -22.72704 | -46.92435 | NA | NA |
Rua Z 11111-832 | NA | NA | NA | NA | NA | NA | NA | -22.77704 | -46.92439 |
To obtain coordinates from three distinct services, the
tidygeocoder
package proves to be valuable. The
combine_geocode()
function accepts a list of arguments,
allowing users to input addresses and obtain coordinates for each
address from three services. It retrieves the output addresses found by
each service. Depending on the Geocoding service, an API key might be
required, and fees may apply. An illustrative example demonstrating
Geocoding of Campinas-SP addresses in Arcgis
,
Bing
, and Here
using
tidygeocoder::combine_geocode()
can be found Here. This
example showcases how to structure the sample data frame.
The function workflow is grounded in the previously discussed methodology and operates in two primary parts.
Firstly, it can consider columns representing three different sets of
latitude and longitude coordinates for each service. To facilitate this,
a dist prefix should be provided, such as dist
.
The goal is to calculate the distance in kilometers between each pair
of points across services. For instance, the resulting data frame would
contain three additional columns: dist_prefix_1_2
,
dist_prefix_1_3
, and dist_prefix_2_3.
Here,
1_2
denotes the distance between lat1
lon1
and lat2
lon2
.
The short_distance
parameter assesses the three new
distance columns dist_prefix_*_*
and selects the shortest
value. This parameter corresponds to the name of the column with the
shortest distance. For instance, if dist_prefix_1_2
has a
lower value compared to dist_prefix_1_3
and
dist_prefix_2_3
, the short_distance
for that
observation will be dist_prefix_1_2
. This step also
addresses scenarios where insufficient latitude and longitude
coordinates are available for a given address. This happens when only
one service found a point for an address, or no coordinates were found
by any service. In such cases, the short_distance
identifies the only available latitude and longitude. If none exist, it
assigns a standard text. For instance, if only Geocoding service 1
provided a coordinate, it will appear in the short_distance
observation as just lat1
and lon1
.
The short_distance
is needed to calculate the
mdc
. This process verifies if a service’s identified
coordinate aligns with those confirmed by other services. It looks for
validation based on the shortest distance. If three services provide
coordinates for an address, the two closest points each receive a
score point
. When only two services offer coordinates, both
receive a score point
, regardless of distance. No score
points are awarded if only one service provides coordinates.
The summarize_mdc
option tallies the score points
awarded in the mdc
process, grouped by Geocoding service.
This creates a quality ranking for each region. Literature suggests that
Geocoding services exhibit varying performance across regions. Higher
values indicate a service has a higher frequency of coordinate matches
with other services.
This ranking acts as a tiebreaker in the final selection of
coordinates for an address. When summarize_mdc = TRUE
, the
return is a list that includes the original data frame with appended
columns and a summarized data frame displaying the mdc
frequencies.
result <- get_best_coords(df, "lat1", "lon1", "lat2", "lon2", "lat3", "lon3", "dis",
short_distance = TRUE,
mdc = TRUE,
summarize_mdc = TRUE)
mdc_summary <- result$mdc_summary
knitr::kable(head(mdc_summary), align = "c")
API_dis_NAME1 | API_dis_NAME2 | API_dis_NAME3 |
---|---|---|
3 | 2 | 1 |
df <- result$original_data
knitr::kable(head(df), align = "c")
input_addr | output_addr_1 | output_addr_2 | output_addr_3 | lat1 | lon1 | lat2 | lon2 | lat3 | lon3 | dis_1_2 | dis_1_3 | dis_2_3 | shortest_distance | dis_1 | dis_2 | dis_3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Rua X 12345-123 | Rua X 12345-123 | Rua X 12345-113 | Rua X 12345-123 | -22.71704 | -46.91200 | -22.72704 | -46.93200 | -22.75704 | -46.92430 | 2.3359090 | 4.628388 | 3.4318723 | dis_1_2 | 1 | 1 | 0 |
Rua Y 54321-321 | Rua Y 52321-321 | Rua YY 54321-321 | Rua Y 54321-321 | -22.71258 | -46.90435 | -22.71268 | -46.90435 | -22.71258 | -46.90435 | 0.0111319 | 0.000000 | 0.0111319 | dis_1_3 | 1 | 0 | 1 |
Rua Y 99999 | Rua Y 39999 | Rua 99999 | Rua YY 19999 | -22.77704 | -46.97200 | -22.72304 | -46.99200 | NA | NA | 6.3522199 | NA | NA | dis_1_2 | 1 | 1 | 0 |
Rua Z 11111-888 | Rua Z 11111-888 | Rua G 11111 | Rua K 11111 | -22.74704 | -46.91200 | NA | NA | NA | NA | NA | NA | NA | just lat1 and lon1 | 0 | 0 | 0 |
Rua Z 22211-888 | NA | NA | NA | NA | NA | -22.72704 | -46.92435 | NA | NA | NA | NA | NA | just lat2 and lon2 | 0 | 0 | 0 |
Rua Z 11111-832 | NA | NA | NA | NA | NA | NA | NA | -22.77704 | -46.92439 | NA | NA | NA | just lat3 and lon3 | 0 | 0 | 0 |
The second segment of the function focuses on determining the final coordinate based on two key measures of quality:
The cep_confirmation
functionality utilizes
extr_cep()
internally for the provided address columns.
It’s important to note that when cep_confimation = TRUE
,
four address strings are required. One corresponds to the
input_addr
utilized in Geocoding, which must contain CEP
information. The other three represent the output_addr_*
,
each corresponding to a different Geocoding service. Additionally, the
subsector_as_zip
parameter can be specified to consider
five-digit patterns as CEP during extraction. The outcome generates four
new columns containing the CEPs for each provided address.
On the other hand, cep_comparison
relies on
compare_cep()
and necessitates
cep_confirmation = TRUE
, along with all address columns. It
compares the CEP extracted from the input_addr
in the
cep_confirmation.
Subsequently, it generates three new
columns to verify if the CEP from the input_addr
matches
the CEP from the output_addr
for each of the three
services. Setting strict_check
to TRUE
enables
the evaluation of equality between CEPs with five digits. It’s advisable
to match subsector_as_zip = TRUE
with
strict_check = TRUE
for consistency.
result <- get_best_coords(df, "lat1", "lon1", "lat2", "lon2", "lat3", "lon3", "dis",
short_distance = TRUE,
mdc = TRUE,
summarize_mdc = TRUE,
cep_confirmation = TRUE,
subsector_as_zip = TRUE,
cep_comparison = TRUE,
strict_check = TRUE,
input_addr = "input_addr",
output_addr_1 = "output_addr_1",
output_addr_2 = "output_addr_2",
output_addr_3 = "output_addr_3")
df <- result$original_data
knitr::kable(head(df), align = "c")
input_addr | output_addr_1 | output_addr_2 | output_addr_3 | lat1 | lon1 | lat2 | lon2 | lat3 | lon3 | dis_1_2 | dis_1_3 | dis_2_3 | shortest_distance | dis_1 | dis_2 | dis_3 | input_addr_cep | output_addr_cep_1 | output_addr_cep_2 | output_addr_cep_3 | comparison_cep_input_output_1 | comparison_cep_input_output_2 | comparison_cep_input_output_3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Rua X 12345-123 | Rua X 12345-123 | Rua X 12345-113 | Rua X 12345-123 | -22.71704 | -46.91200 | -22.72704 | -46.93200 | -22.75704 | -46.92430 | 2.3359090 | 4.628388 | 3.4318723 | dis_1_2 | 1 | 1 | 0 | 12345123 | 12345123 | 12345113 | 12345123 | 1 | 0 | 1 |
Rua Y 54321-321 | Rua Y 52321-321 | Rua YY 54321-321 | Rua Y 54321-321 | -22.71258 | -46.90435 | -22.71268 | -46.90435 | -22.71258 | -46.90435 | 0.0111319 | 0.000000 | 0.0111319 | dis_1_3 | 1 | 0 | 1 | 54321321 | 52321321 | 54321321 | 54321321 | 0 | 1 | 1 |
Rua Y 99999 | Rua Y 39999 | Rua 99999 | Rua YY 19999 | -22.77704 | -46.97200 | -22.72304 | -46.99200 | NA | NA | 6.3522199 | NA | NA | dis_1_2 | 1 | 1 | 0 | 99999 | 39999 | 99999 | 19999 | 0 | 1 | 0 |
Rua Z 11111-888 | Rua Z 11111-888 | Rua G 11111 | Rua K 11111 | -22.74704 | -46.91200 | NA | NA | NA | NA | NA | NA | NA | just lat1 and lon1 | 0 | 0 | 0 | 11111888 | 11111888 | 11111 | 11111 | 1 | NA | NA |
Rua Z 22211-888 | NA | NA | NA | NA | NA | -22.72704 | -46.92435 | NA | NA | NA | NA | NA | just lat2 and lon2 | 0 | 0 | 0 | 22211888 | NA | NA | NA | NA | NA | NA |
Rua Z 11111-832 | NA | NA | NA | NA | NA | NA | NA | -22.77704 | -46.92439 | NA | NA | NA | just lat3 and lon3 | 0 | 0 | 0 | 11111832 | NA | NA | NA | NA | NA | NA |
The entire process aims to identify the best coordinate for each
address, following the workflow previously outlined. Firstly, it
examines the equality between the CEPs of input_addr
and
output_addr_*
. If all three match (1), it proceeds to
assess the short_distance. Then, it utilizes the mdc ranking generated
by summarize_mdc as a tiebreaker.
When two output_addr_*
instances share CEP equality with
the input_addr
, the decision primarily relies on the
mdc
rank, omitting distance comparison.
If only one output_addr_*
shares CEP equality with the
input_addr
, it is selected as the final coordinate.
The process maintains a consistent logic even when no CEP equality
exists between any output_addr_*
and the
input_addr.
In such cases, it examines the short_distance
when coordinates are available for three output_addr_*
. It
selects the closest one, followed by a tiebreaker based on the mdc. When
only two coordinates are available with no CEP equality, it relies
solely on the mdc
rank to pick the better coordinate. If
there’s only one output_addr_*
available without CEP
equality, that coordinate is chosen.
The flowchart succinctly illustrates the entire process of selecting the final coordinate.
Note: Final coordinate selection is still in development
These methods are valuable in establishing metrics and measures for evaluating Geocoding quality and coverage across different municipalities. Additionally, they serve when precision in coordinate selection is essential for the task at hand.