Changelog • dbGaPCheckup

dbGaP version 1.2.0

New function duplicated_id_check() checks for duplicated subject IDs in the data set (returns a warning, as this is allowed in longitudinal studies)
New function ascii_check() scans both the data dictionary and data set for (1) non-ASCII characters (e.g., é, ñ) and (2) newline () and carriage return ( characters
New helper ascii_cleaner() cleans a data frame by (1) converting smart quotes to straight quotes, replacing accented characters with ASCII equivalents, removing newline and carriage return characters
complete_check(): updated to include both duplicated_id_check() and ascii_check()
values_check(): updated Check 1 to require each VALUES cell contain exactly one equals sign (=) (e.g., 1=Yes vs. 1=Yes; 0=No), in alignment with dbGaP formatting requirements; added new Check 5 to detect duplicated MEANINGs in VALUES=MEANING entries.
misc_format_check(): now prevents errors when the VALUES column is the last column in the data dictionary (i.e., no columns follow) - this does return a WARNING, however, as this structure is valid but expected to be uncommon
integer_check(): resolved a rare error when the TYPE column contains malformed or unexpected values
Documentation: Updated to emphasize the importance of reading CSVs using readr::read_csv(…, na = c(““,”NA”)) or read.csv(…, na.strings = c(““,”NA”)) to correctly interpret missing cells, particularly in the VALUES column. (See GitHub Issue #16 for discussion)

minmax_check: adjusted to return a sorted list of out of range values and polished documentation to be more informative
values_check: corrected bug in code that was not detecting leading/trailing zeros in VALUES columns
name_correct: when a user runs name_correct when it is not needed a new message informing them of no discrepancies detected will print

CRAN release: 2023-09-27

added an informative error message when the required VALUES column is missing
adjusted values_check to temporarily create dummy names for blank-named columns beyond VALUES to prevent function from dying if column names after VALUES are blank strings
corrected minmax_check and integer_check bugs that occurred when SUBJECT_ID was a character vector
adjusted misc_format_check to return a WARNING that alerts users if they read in a data set and R automatically fills in column names after VALUES (which is allowed by the package, but not dbGaP itself)
adjusted NA_check to correctly capture NA=N/A VALUES
corrected bug in type_check that was allowing some non-allowable TYPE entries to pass
corrected but in missing_value_check that was flagging some variables even when they had properly encoded NA=N/A VALUES
made complete_check more robust to errors by wrapping functions in tryCatch
used seealso to link utility functions to relevant check functions

CRAN release: 2023-02-22

removed row numbers from data set files
renamed data dictionary files by removing “SSM” acronym (done to avoid confusion as this means “subject sample mapping” and is intended for use with other dbGaP data files)
updated id_check() to include a check for missing SUBJECT_IDs (not allowed by dbGaP)
updated row_check() to check for duplicate and empty rows in the data dictionary (and not just the data set)
updated misc_format_check() to check that there are no missing VARNAME cells

CRAN release: 2022-12-22

apply na_if() to one column at a time (vs. entire data frame at once) to maintain compatibility with next version of dplyr

CRAN release: 2022-11-14