Title: | Interface to the Penn Machine Learning Benchmarks Data Repository |
---|---|
Description: | Check available classification and regression data sets from the PMLB repository and download them. The PMLB repository (<https://github.com/EpistasisLab/pmlbr>) contains a curated collection of data sets for evaluating and comparing machine learning algorithms. These data sets cover a range of applications, and include binary/multi-class classification problems and regression problems, as well as combinations of categorical, ordinal, and continuous features. There are currently over 150 datasets included in the PMLB repository. |
Authors: | Trang Le [aut, cre] (https://trang.page/), makeyourownmaker [aut] (https://github.com/makeyourownmaker), Jason Moore [aut] (http://www.epistasisblog.org/), University of Pennsylvania [cph] |
Maintainer: | Trang Le <[email protected]> |
License: | GPL-2 | file LICENSE |
Version: | 0.3.0.9000 |
Built: | 2025-03-11 15:30:14 UTC |
Source: | https://github.com/epistasislab/pmlbr |
Classification datasets
classification_datasets()
classification_datasets()
A character vector of classification dataset names.
if (interactive()) { sample(classification_datasets(), 10) }
if (interactive()) { sample(classification_datasets(), 10) }
Computes imbalance value for a given dataset.
compute_imbalance(target_col)
compute_imbalance(target_col)
target_col |
Factor or character vector of target column. |
A value of imbalance metric, where zero means that the dataset is perfectly balanced and the higher the value, the more imbalanced the dataset.
All available datasets
dataset_names()
dataset_names()
A character vector of all dataset names.
if (interactive()) { sample(dataset_names(), 10) }
if (interactive()) { sample(dataset_names(), 10) }
Download a data set from the PMLB repository, (optionally) store it locally, and return the data set. You must be connected to the internet if you are fetching a data set that is not cached locally.
fetch_data( dataset_name, return_X_y = FALSE, local_cache_dir = NA, dropna = TRUE )
fetch_data( dataset_name, return_X_y = FALSE, local_cache_dir = NA, dropna = TRUE )
dataset_name |
The name of the data set to load from PMLB |
return_X_y |
Boolean. Whether to return the data with the features and labels stored in separate data structures or a single structure (can be TRUE or FALSE, defaults to FALSE) |
local_cache_dir |
The directory on your local machine to store the data files in (defaults to NA, indicating cache will not be used) |
dropna |
Boolean. Whether rows with NAs should be automatically dropped. Default to TRUE. |
# Features and labels in single data frame if (interactive()){ penguins <- fetch_data("penguins") head(penguins) # Features and labels stored in separate data structures penguins <- fetch_data("penguins", return_X_y = TRUE) penguins$x # data frame penguins$y # vector }
# Features and labels in single data frame if (interactive()){ penguins <- fetch_data("penguins") head(penguins) # Features and labels stored in separate data structures penguins <- fetch_data("penguins", return_X_y = TRUE) penguins$x # data frame penguins$y # vector }
Get type/class of given vector.
get_type(x, include_binary = FALSE)
get_type(x, include_binary = FALSE)
x |
Input vector. |
include_binary |
Boolean. Whether binary should be counted separately from categorical. |
Type/class of 'x'.
Attempts to download a file from a specified URL, retrying a set number of times if the download fails. This function meets CRAN's requirement for gracefully handling the use of internet resources by catching errors and returning a warning message if the download ultimately fails.
graceful_download(url, destfile, retries = 3)
graceful_download(url, destfile, retries = 3)
url |
Character. The URL of the file to download. |
destfile |
Character. The path to the destination file where the downloaded content will be saved. |
retries |
Integer. The maximum number of download attempts (default is 3). |
Logical. Returns 'TRUE' if the download succeeds, 'FALSE' otherwise.
## Not run: dataset_url <- "https://example.com/dataset.csv" tmp <- tempfile(fileext = ".csv") success <- download_file_gracefully(dataset_url, tmp) if (!success) { message("Continuing gracefully without the dataset.") } ## End(Not run)
## Not run: dataset_url <- "https://example.com/dataset.csv" tmp <- tempfile(fileext = ".csv") success <- download_file_gracefully(dataset_url, tmp) if (!success) { message("Continuing gracefully without the dataset.") } ## End(Not run)
If 'x' is a data.frame object, computes dataset characteristics. If 'x' is a character object specifying dataset name from PMLB, use the already computed dataset statistics/characteristics in 'summary_stats'.
nearest_datasets(x, ...) ## Default S3 method: nearest_datasets(x, ...) ## S3 method for class 'character' nearest_datasets( x, n_neighbors = 5, dimensions = c("n_instances", "n_features"), target_name = "target", ... ) ## S3 method for class 'data.frame' nearest_datasets( x, y = NULL, n_neighbors = 5, dimensions = c("n_instances", "n_features"), task = c("classification", "regression"), target_name = "target", ... )
nearest_datasets(x, ...) ## Default S3 method: nearest_datasets(x, ...) ## S3 method for class 'character' nearest_datasets( x, n_neighbors = 5, dimensions = c("n_instances", "n_features"), target_name = "target", ... ) ## S3 method for class 'data.frame' nearest_datasets( x, y = NULL, n_neighbors = 5, dimensions = c("n_instances", "n_features"), task = c("classification", "regression"), target_name = "target", ... )
x |
Character string of dataset name from PMLB, or data.frame of n_samples x n_features(or n_features+1 with a target column) |
... |
Further arguments passed to each method. |
n_neighbors |
Integer. The number of dataset names to return as neighbors. |
dimensions |
Character vector specifying dataset characteristics to include in similarity calculation. Dimensions must correspond to numeric columns of [all_summary_stats.tsv](https://github.com/EpistasisLab/pmlb/blob/master/pmlb/all_summarystats.tsv). If 'all' (default), uses all numeric columns. |
target_name |
Character string specifying column of target/dependent variable. |
y |
Vector of target column. Required when 'x“ does not contain the target column. |
task |
Character string specifying classification or regression for summary stat generation. |
Character string of names of most similar datasets to df, most similar dataset first.
if (interactive()){ nearest_datasets('penguins') nearest_datasets(fetch_data('penguins')) }
if (interactive()){ nearest_datasets('penguins') nearest_datasets(fetch_data('penguins')) }
The PMLB repository contains a curated collection of data sets for evaluating and comparing machine learning algorithms. These data sets cover a range of applications, and include binary/multi-class classification problems and regression problems, as well as combinations of categorical, ordinal, and continuous features. There are approximately 290 data sets included in the PMLB repository and there are no missing values in these data sets.
This R library includes summaries of the classification and regression data sets but does NOT
include any of the PMLB data sets. The data sets can be downloaded using the fetch_data
function which
is similar to the corresponding PMLB python function.
See fetch_data
, pmlb_metadata
for usage examples and further information.
If you use PMLB in a scientific publication, please consider citing the following paper:
Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore (2017).
PMLB: a large benchmark suite for machine learning evaluation and comparison
https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0154-4
BioData Mining 10, page 36.
Maintainer: Trang Le [email protected] (https://trang.page/)
Authors:
makeyourownmaker [email protected] (https://github.com/makeyourownmaker)
Jason Moore [email protected] (http://www.epistasisblog.org/)
Other contributors:
University of Pennsylvania [copyright holder]
Useful links:
Metadata like summary statistics and names of available datasets on the PMLB repository.
pmlb_metadata()
pmlb_metadata()
A list containing summary_stats, dataset_names, classification_datasets, and regression_datasets
if (interactive()) { sample(pmlb_metadata()$dataset_names, 10) }
if (interactive()) { sample(pmlb_metadata()$dataset_names, 10) }
Regression datasets
regression_datasets()
regression_datasets()
A character vector of regression dataset names.
if (interactive()) { sample(regression_datasets(), 10) }
if (interactive()) { sample(regression_datasets(), 10) }
Summary statistics
summary_stats()
summary_stats()
A dataframe of summary statistics of all available datasets, including number of instances/rows, number of columns/features, task, etc.
if (interactive()) { head(summary_stats()) }
if (interactive()) { head(summary_stats()) }