| Title: | Fast Fuzzy String Joins for Data Frames |
|---|---|
| Description: | Perform fuzzy joins on data frames using approximate string matching. Implements inner, left, right, full, semi, and anti joins with string distance metrics from the 'stringdist' package, including Optimal String Alignment, Levenshtein, Damerau-Levenshtein, Jaro-Winkler, q-gram, cosine, Jaccard, and Soundex. Uses a 'data.table' backend plus compiled 'C++' result assembly to reduce overhead in large joins, while adaptive candidate planning avoids unnecessary distance evaluations in single-column string joins. Suitable for reconciling misspellings, inconsistent labels, and other near-match identifiers while optionally returning the computed distance for each match. Bibliographic references include Van der Loo, M. P. J. (2014) <https://CRAN.R-project.org/package=stringdist> and Robinson, D. (2015) <https://github.com/dgrtwo/fuzzyjoin>. |
| Authors: | Paul E. Santos Andrade [aut, cre, cph] (ORCID: <https://orcid.org/0000-0002-6635-0375>), David Robinson [ctb] (aut of fuzzyjoin) |
| Maintainer: | Paul E. Santos Andrade <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.5 |
| Built: | 2026-05-25 03:05:04 UTC |
| Source: | https://github.com/PaulESantos/fuzzystring |
fuzzystring provides fuzzy inner, left, right, full, semi, and anti joins
for data.frame and data.table objects using approximate string matching.
It combines stringdist metrics with a data.table backend and compiled 'C++'
result assembly to reduce overhead in large joins while preserving familiar
join semantics.
Main entry points are fuzzystring_join() and the convenience wrappers
fuzzystring_inner_join(), fuzzystring_left_join(),
fuzzystring_right_join(), fuzzystring_full_join(),
fuzzystring_semi_join(), and fuzzystring_anti_join().
The package also includes the example dataset misspellings.
Maintainer: Paul E. Santos Andrade [email protected] (ORCID) [copyright holder]
Authors:
Paul E. Santos Andrade [email protected] (ORCID) [copyright holder]
Other contributors:
David Robinson [email protected] (aut of fuzzyjoin) [contributor]
Useful links:
Report bugs at https://github.com/PaulESantos/fuzzystring/issues
Uses stringdist::stringdist() to compute distances and a
data.table-orchestrated backend with compiled 'C++' assembly to produce
the final result. This is the main user-facing entry point for fuzzy joins on
strings.
fuzzystring_join( x, y, by = NULL, max_dist = 2, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), mode = "inner", ignore_case = FALSE, distance_col = NULL, ... ) fuzzystring_inner_join(x, y, by = NULL, distance_col = NULL, ...) fuzzystring_left_join(x, y, by = NULL, distance_col = NULL, ...) fuzzystring_right_join(x, y, by = NULL, distance_col = NULL, ...) fuzzystring_full_join(x, y, by = NULL, distance_col = NULL, ...) fuzzystring_semi_join(x, y, by = NULL, distance_col = NULL, ...) fuzzystring_anti_join(x, y, by = NULL, distance_col = NULL, ...)fuzzystring_join( x, y, by = NULL, max_dist = 2, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), mode = "inner", ignore_case = FALSE, distance_col = NULL, ... ) fuzzystring_inner_join(x, y, by = NULL, distance_col = NULL, ...) fuzzystring_left_join(x, y, by = NULL, distance_col = NULL, ...) fuzzystring_right_join(x, y, by = NULL, distance_col = NULL, ...) fuzzystring_full_join(x, y, by = NULL, distance_col = NULL, ...) fuzzystring_semi_join(x, y, by = NULL, distance_col = NULL, ...) fuzzystring_anti_join(x, y, by = NULL, distance_col = NULL, ...)
x |
A |
y |
A |
by |
Columns by which to join the two tables. You can supply a character
vector of common names (e.g. |
max_dist |
Maximum distance to use for joining. Smaller values are stricter. |
method |
Method for computing string distance, see
|
mode |
One of |
ignore_case |
Logical; if |
distance_col |
If not |
... |
Additional arguments passed to |
If method = "soundex", max_dist is automatically set to 0.5,
since Soundex distance is 0 (match) or 1 (no match).
When by maps multiple columns, the same method,
max_dist, and any additional stringdist arguments are applied
independently to each mapped column, and a row pair is kept only when all
mapped columns satisfy the distance threshold.
For single-column joins, fuzzystring uses adaptive candidate planning before
calling stringdist::stringdist(). For Levenshtein-like methods
("osa", "lv", "dl"), a fast prefilter is applied: if
abs(nchar(v1) - nchar(v2)) > max_dist, the pair cannot match, so
distance is not computed for that pair. For low-duplication workloads, the
planner can also evaluate larger dense blocks of unique values to reduce
orchestration overhead while preserving the same matching semantics.
A joined table that preserves the container class of x:
data.table inputs return data.table, tibble inputs return
tibble, and plain data.frame inputs return plain
data.frame. See fuzzystring_join_backend for details
on output structure.
if (requireNamespace("ggplot2", quietly = TRUE)) { d <- data.table::data.table(approximate_name = c("Idea", "Premiom")) # Match diamonds$cut to d$approximate_name res <- fuzzystring_inner_join(ggplot2::diamonds, d, by = c(cut = "approximate_name"), max_dist = 1 ) head(res) }if (requireNamespace("ggplot2", quietly = TRUE)) { d <- data.table::data.table(approximate_name = c("Idea", "Premiom")) # Match diamonds$cut to d$approximate_name res <- fuzzystring_inner_join(ggplot2::diamonds, d, by = c(cut = "approximate_name"), max_dist = 1 ) head(res) }
This is a tbl_df mapping misspellings of their words, compiled by
Wikipedia, where it is licensed under the CC-BY SA license. (Three words with
non-ASCII characters were filtered out). If you'd like to reproduce this
dataset from Wikipedia, see the example code below.
misspellingsmisspellings
An object of class tbl_df (inherits from tbl, data.frame) with 4505 rows and 2 columns.
https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines
if (interactive()) { library(rvest) library(readr) library(dplyr) library(stringr) library(tidyr) u <- "https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines" h <- read_html(u) misspellings <- h %>% html_nodes("pre") %>% html_text() %>% read_delim(col_names = c("misspelling", "correct"), delim = ">", skip = 1) %>% mutate(misspelling = str_sub(misspelling, 1, -2)) |> separate_rows(correct, sep = ", ") |> filter(Encoding(correct) != "UTF-8") }if (interactive()) { library(rvest) library(readr) library(dplyr) library(stringr) library(tidyr) u <- "https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines" h <- read_html(u) misspellings <- h %>% html_nodes("pre") %>% html_text() %>% read_delim(col_names = c("misspelling", "correct"), delim = ">", skip = 1) %>% mutate(misspelling = str_sub(misspelling, 1, -2)) |> separate_rows(correct, sep = ", ") |> filter(Encoding(correct) != "UTF-8") }