| Title: | Taxonomic Name Reconciliation Against the 'WCVP' Backbone |
|---|---|
| Description: | Standardizes and reconciles scientific plant names against a World Checklist of Vascular Plants ('WCVP')-style taxonomic backbone. The package parses names into taxonomic components and applies staged exact and fuzzy matching for binomial and trinomial inputs, including infraspecific rank-aware checks. It also returns accepted-name context and row-level matching flags to support reproducible, auditable preprocessing for downstream biodiversity, spatial, and trait analyses. A user-supplied backbone can be passed through 'target_df'; when the optional companion package 'wcvpdata' is installed, its default checklist can also be used. |
| Authors: | Paul Efren Santos Andrade [aut, cre, cph] |
| Maintainer: | Paul Efren Santos Andrade <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.1 |
| Built: | 2026-05-31 10:39:31 UTC |
| Source: | https://github.com/PaulESantos/wcvpmatch |
Creates a compact genus-level index from the target backbone. The index stores
one row per genus and a list-column with candidate plant_name_id values
associated with each genus.
If plant_name_id is not present in target_df, a surrogate integer ID is
created to keep the index usable with custom backbones.
build_genus_index(target_df = NULL)build_genus_index(target_df = NULL)
target_df |
Optional custom target table. If |
A tibble with columns:
Genus name (character).
List-column of unique IDs per genus.
Number of IDs per genus.
Number of characters in the genus name.
library(wcvpmatch) build_genus_index()library(wcvpmatch) build_genus_index()
Parse and classify scientific plant names into taxonomic components: genus, specific epithet, infraspecific rank, infraspecific epithet, and author.
Output is aligned to a backbone convention:
Orig.Genus in Title Case (first letter uppercase, rest lowercase).
Orig.Species and Orig.Infraspecies epithets in lowercase.
Infra.Rank in lowercase (subsp., var., subvar., f., subf.).
Author is recovered from the input and preserved in its original
casing/punctuation (no forced uppercasing).
Orig.Name is reconstructed as: genus + species + (rank + infra) + author.
Robustness rules:
cf. / aff. are removed from parsing but preserved as flags (has_cf, has_aff).
Hybrid markers (x/\u00D7) as standalone tokens are removed with had_hybrid = TRUE.
sp. / spp. triggers genus-only classification (Rank = 1, Orig.Species = NA)
and sets is_sp/is_spp.
If an infraspecific rank is present but the infraspecific epithet is missing,
sets rank_missing_infra = TRUE and keeps Infra.Rank while Orig.Infraspecies = NA.
If rank appears "late" (after author-like tokens), parsing is best-effort and
rank_late = TRUE.
If there is no explicit rank and a third token exists, the function can infer an
unranked infraspecific epithet when the third token looks epithet-like (all lowercase),
and does not look like the start of an author. In that case implied_infra = TRUE,
Orig.Infraspecies is filled, Infra.Rank = NA, and Rank = 3.
classify_spnames(splist)classify_spnames(splist)
splist |
Character vector. Scientific plant names. |
A tibble with one row per input name and standardized columns/flags:
Numeric index of original order.
Original input string as provided by user.
Reconstructed standardized name aligned to backbone + original-cased author.
Genus in Title Case.
Specific epithet in lowercase, or NA for genus-only (sp./spp.).
Recovered author string (original casing/punctuation) or "".
Infraspecific epithet in lowercase (ranked or implied), or NA.
Infraspecific rank in lowercase (subsp., var., subvar., f., subf.), or NA.
Numeric level: 1 genus-only, 2 genus+species, 3 includes infraspecific epithet.
Logical flags.
library(wcvpmatch) classify_spnames(c("Opuntia sp.", "Rosa canina subsp. coriifolia (Fr.) Leffler")) classify_spnames(c("Cydonia japonica tricolor")) # implied unranked infra epithetlibrary(wcvpmatch) classify_spnames(c("Opuntia sp.", "Rosa canina subsp. coriifolia (Fr.) Leffler")) classify_spnames(c("Cydonia japonica tricolor")) # implied unranked infra epithet
A cleaned dataset containing tree species recorded by the
Forest Inventory and Analysis (FIA) program of the U.S. Forest Service.
This dataset is used in the examples and README of the wcvpmatch
package. The data was downloaded in November 2022 from the official
webpage of the Forest Inventory and Analysis National Program, available
at the following link,
and was originally used during the development of the treemendous
package. For wcvpmatch, the variable names have been standardized
to Orig.Genus and Orig.Species.
fiafia
A data frame with 2169 rows and 2 variables:
Genus name of the species binomial
Specific epithet of the species binomial
Reduces the target backbone to genera relevant for the current input names.
This is designed as a pre-step before wcvp_matching() to reduce search space.
Strategy:
Exact genus candidates are always included.
Optional fuzzy genus candidates are included when include_fuzzy = TRUE.
Returned object preserves the standard target schema used by the package.
prefilter_target_by_genus( df, target_df = NULL, genus_index = NULL, include_fuzzy = TRUE, max_dist = 1, method = "osa" )prefilter_target_by_genus( df, target_df = NULL, genus_index = NULL, include_fuzzy = TRUE, max_dist = 1, method = "osa" )
df |
Input tibble/data.frame with either |
target_df |
Optional custom target table. If |
genus_index |
Optional pre-built index from |
include_fuzzy |
Logical. If |
max_dist |
Maximum fuzzy distance for genus matching (used when |
method |
String distance method passed to |
A prefiltered target_df tibble compatible with wcvp_matching(target_df = ...).
Attributes:
Character vector of selected genera.
Character vector of exact matched genera.
Character vector of fuzzy matched genera.
library(wcvpmatch) df <- data.frame(Genus = "Opuntia", Species = "yanganucensis") prefilter_target_by_genus(df)library(wcvpmatch) df <- data.frame(Genus = "Opuntia", Species = "yanganucensis") prefilter_target_by_genus(df)
Tries to directly match Genus + Species | Genus + Species + Rank + Infraspecies to WCVP data.
wcvp_direct_match(df, target_df = NULL)wcvp_direct_match(df, target_df = NULL)
df |
|
target_df |
Optional custom target table. If |
Returns a tibble with the additional logical column direct_match, indicating whether the binomial was successfully matched (TRUE) or not (FALSE).
Returns original columns plus Matched.Genus, Matched.Species, Matched.Infra.Rank, and Matched.Infraspecies.
library(wcvpmatch) # Simple binomial match df_parsed <- classify_spnames("Opuntia yanganucensis") wcvp_direct_match(df_parsed)library(wcvpmatch) # Simple binomial match df_parsed <- classify_spnames("Opuntia yanganucensis") wcvp_direct_match(df_parsed)
Tries to directly match the specific epithet within an already matched genus in 'WCVP'.
wcvp_direct_match_species_within_genus(df, target_df = NULL)wcvp_direct_match_species_within_genus(df, target_df = NULL)
df |
|
target_df |
Optional custom target table. If |
Returns a tibble with the additional logical column direct_match_species_within_genus, indicating whether the specific epithet was successfully matched within the matched genus (TRUE) or not (FALSE).
wcvp_distribution( taxon, taxon_rank = c("species", "genus", "family"), native = TRUE, introduced = TRUE, extinct = TRUE, location_doubtful = TRUE, wcvp_names = NULL, wcvp_distributions = NULL, prefilter_genus = TRUE, fallback_to_genus = TRUE, summarise_by_input = FALSE, max_dist = NULL, method = "osa" )wcvp_distribution( taxon, taxon_rank = c("species", "genus", "family"), native = TRUE, introduced = TRUE, extinct = TRUE, location_doubtful = TRUE, wcvp_names = NULL, wcvp_distributions = NULL, prefilter_genus = TRUE, fallback_to_genus = TRUE, summarise_by_input = FALSE, max_dist = NULL, method = "osa" )
taxon |
Character vector of taxa to query. |
taxon_rank |
Character scalar. One of |
native |
Logical. Include native occurrences? Defaults to |
introduced |
Logical. Include introduced occurrences? Defaults to
|
extinct |
Logical. Include extinct occurrences? Defaults to |
location_doubtful |
Logical. Include doubtful occurrences? Defaults to
|
wcvp_names |
Optional WCVP names table. If |
wcvp_distributions |
Optional WCVP distribution table. If |
prefilter_genus |
Logical. Forwarded to |
fallback_to_genus |
Logical. If |
summarise_by_input |
Logical. If |
max_dist |
Maximum string distance. If |
method |
String distance method passed to |
Queries distribution records by matching a taxon name against the WCVP names
table and then resolving the corresponding rows in the WCVP distribution
table. The function is designed around wcvpdata::wcvp_checklist_names and
wcvpdata::wcvp_checklist_distribution, but custom tables with the same
schema can also be supplied.
Matching is performed with fozziejoin, using compact lookup tables and
length-based prefiltering to keep the candidate set small. Species queries are
resolved in two stages: genus candidates are matched first, then species
names are searched only within those candidate genera.
If species-level matches resolve to synonyms and the names table contains
accepted_plant_name_id, distribution is recovered from the accepted taxon.
For genus- and family-level queries, accepted names are preferred to avoid
double counting synonym records.
By default, a tibble with one row per matched query-area combination.
If summarise_by_input = TRUE, returns one row per input taxon with
collapsed text fields such as distribution, areas, area_codes,
regions, continents, and n_areas.
library(wcvpmatch) wcvp_distribution("Opuntia ficus-indica", taxon_rank = "species") wcvp_distribution("Opuntia", taxon_rank = "genus") wcvp_distribution("Cactaceae", taxon_rank = "family")library(wcvpmatch) wcvp_distribution("Opuntia ficus-indica", taxon_rank = "species") wcvp_distribution("Opuntia", taxon_rank = "genus") wcvp_distribution("Cactaceae", taxon_rank = "family")
Tries to fuzzy match the genus name to the 'WCVP' table (using the optional wcvpdata checklist by default when available).
wcvp_fuzzy_match_genus(df, target_df = NULL, max_dist = 1, method = "osa")wcvp_fuzzy_match_genus(df, target_df = NULL, max_dist = 1, method = "osa")
df |
|
target_df |
Optional custom target table. If |
max_dist |
Maximum edit distance used for fuzzy genus matching. |
method |
String distance method passed to |
Returns a tibble with the additional logical column fuzzy_match_genus, indicating whether the genus was successfully matched (TRUE) or not (FALSE).
Further, the additional column fuzzy_genus_dist returns the distance for every match.
library(wcvpmatch) df <- data.frame(Orig.Genus = "Opuntiaa", Orig.Species = "yanganucensis") wcvp_fuzzy_match_genus(df)library(wcvpmatch) df <- data.frame(Orig.Genus = "Opuntiaa", Orig.Species = "yanganucensis") wcvp_fuzzy_match_genus(df)
Tries to fuzzy match the species epithet within a matched genus against 'WCVP' (using the optional wcvpdata checklist by default when available).
wcvp_fuzzy_match_species_within_genus( df, target_df = NULL, max_dist = 1, method = "osa" )wcvp_fuzzy_match_species_within_genus( df, target_df = NULL, max_dist = 1, method = "osa" )
df |
|
target_df |
Optional custom target table. If |
max_dist |
Maximum edit distance used for fuzzy species matching within genus. |
method |
String distance method passed to |
Returns a tibble with the additional logical column fuzzy_match_species_within_genus, indicating whether the specific epithet was successfully fuzzy matched within the matched genus (TRUE) or not (FALSE).
Tries to match the genus name to the 'WCVP' table (using the optional wcvpdata checklist by default when available).
wcvp_genus_match(df, target_df = NULL)wcvp_genus_match(df, target_df = NULL)
df |
|
target_df |
Optional custom target table. If |
Returns a tibble with the additional logical column genus_match, indicating whether the genus was successfully matched (TRUE) or not (FALSE).
Runs a matching pipeline with exact and partial matching for binomial and trinomial names, including infraspecific rank validation.
wcvp_matching( df, target_df = NULL, prefilter_genus = TRUE, allow_duplicates = FALSE, max_dist = 1, method = "osa", add_name_distance = FALSE, name_distance_method = "osa", profile = FALSE, output_name_style = c("snake_case", "legacy") )wcvp_matching( df, target_df = NULL, prefilter_genus = TRUE, allow_duplicates = FALSE, max_dist = 1, method = "osa", add_name_distance = FALSE, name_distance_method = "osa", profile = FALSE, output_name_style = c("snake_case", "legacy") )
df |
Input tibble/data.frame with either |
target_df |
Optional custom target table. If |
prefilter_genus |
Logical. If |
allow_duplicates |
Logical. If |
max_dist |
Maximum distance used in all fuzzy matching stages (genus, species, infraspecies). |
method |
A string indicating the fuzzy matching method (passed to
|
add_name_distance |
Logical. If |
name_distance_method |
Method passed to |
profile |
Logical. If |
output_name_style |
Naming style for output columns:
|
Tibble with matched names, process flags, and taxonomic context
columns: matched_plant_name_id, matched_taxon_name, taxon_status,
accepted_plant_name_id, accepted_taxon_name, is_accepted_name.
library(wcvpmatch) # Match a single name wcvp_matching(data.frame(Genus = "Opuntia", Species = "yanganucensis")) # Match multiple names with snake_case output names <- c("Aniba heterotepala", "Anthurium quipuscoae") df <- classify_spnames(names) wcvp_matching(df, output_name_style = "snake_case") # Attach per-stage timings for profiling out <- wcvp_matching(df, output_name_style = "snake_case", profile = TRUE) attr(out, "timings")library(wcvpmatch) # Match a single name wcvp_matching(data.frame(Genus = "Opuntia", Species = "yanganucensis")) # Match multiple names with snake_case output names <- c("Aniba heterotepala", "Anthurium quipuscoae") df <- classify_spnames(names) wcvp_matching(df, output_name_style = "snake_case") # Attach per-stage timings for profiling out <- wcvp_matching(df, output_name_style = "snake_case", profile = TRUE) attr(out, "timings")
Reports whether the optional companion package wcvpdata is available for
use as the default WCVP backbone and, if not, explains how to install it
from r-universe.
wcvp_setup_info(inform = TRUE)wcvp_setup_info(inform = TRUE)
inform |
Logical. If |
Invisibly returns a named list with setup status fields:
default_backbone_available, wcvpdata_installed,
wcvpdata_has_backbone, wcvpdata_version, repository, and
install_command.
library(wcvpmatch) wcvp_setup_info()library(wcvpmatch) wcvp_setup_info()
Tries to match the specific epithet by exchanging common suffixes within an already matched genus in 'WCVP'.
The following suffixes are captured: c("a", "i", "is", "um", "us", "ae").
wcvp_suffix_match_species_within_genus(df, target_df = NULL)wcvp_suffix_match_species_within_genus(df, target_df = NULL)
df |
|
target_df |
Optional custom target table. If |
Returns a tibble with the additional logical column suffix_match_species_within_genus, indicating whether the specific epithet was successfully matched within the matched genus (TRUE) or not (FALSE).
library(wcvpmatch) df <- data.frame(Orig.Genus = "Opuntia", Orig.Species = "yanganucensa", Matched.Genus = "Opuntia") wcvp_suffix_match_species_within_genus(df)library(wcvpmatch) df <- data.frame(Orig.Genus = "Opuntia", Orig.Species = "yanganucensa", Matched.Genus = "Opuntia") wcvp_suffix_match_species_within_genus(df)