Title: | Interface to the Penn Machine Learning Benchmarks Data Repository |
---|---|
Description: | Check available classification and regression data sets from the PMLB repository and download them. The PMLB repository (<https://github.com/EpistasisLab/pmlbr>) contains a curated collection of data sets for evaluating and comparing machine learning algorithms. These data sets cover a range of applications, and include binary/multi-class classification problems and regression problems, as well as combinations of categorical, ordinal, and continuous features. There are currently over 150 datasets included in the PMLB repository. |
Authors: | Trang Le [aut, cre] (https://trang.page/), makeyourownmaker [aut] (https://github.com/makeyourownmaker), Jason Moore [aut] (http://www.epistasisblog.org/), University of Pennsylvania [cph] |
Maintainer: | Trang Le <[email protected]> |
License: | GPL-2 | file LICENSE |
Version: | 0.2.2 |
Built: | 2024-11-04 14:59:22 UTC |
Source: | https://github.com/epistasislab/pmlbr |
A list of the names of available classification datasets
classification_dataset_names
classification_dataset_names
An object of class character
of length 162.
https://github.com/EpistasisLab/pmlb
Computes imbalance value for a given dataset.
compute_imbalance(target_col)
compute_imbalance(target_col)
target_col |
Factor or character vector of target column. |
A value of imbalance metric, where zero means that the dataset is perfectly balanced and the higher the value, the more imbalanced the dataset.
A list of the names of available datasets
dataset_names
dataset_names
An object of class character
of length 284.
https://github.com/EpistasisLab/pmlb
Download a data set from the PMLB repository, (optionally) store it locally, and return the data set. You must be connected to the internet if you are fetching a data set that is not cached locally.
fetch_data( dataset_name, return_X_y = FALSE, local_cache_dir = NA, dropna = TRUE )
fetch_data( dataset_name, return_X_y = FALSE, local_cache_dir = NA, dropna = TRUE )
dataset_name |
The name of the data set to load from PMLB |
return_X_y |
Boolean. Whether to return the data with the features and labels stored in separate data structures or a single structure (can be TRUE or FALSE, defaults to FALSE) |
local_cache_dir |
The directory on your local machine to store the data files in (defaults to NA, indicating cache will not be used) |
dropna |
Boolean. Whether rows with NAs should be automatically dropped. Default to TRUE. |
# Features and labels in single data frame penguins <- fetch_data("penguins") head(penguins) # Features and labels stored in separate data structures penguins <- fetch_data("penguins", return_X_y = TRUE) penguins$x # data frame penguins$y # vector
# Features and labels in single data frame penguins <- fetch_data("penguins") head(penguins) # Features and labels stored in separate data structures penguins <- fetch_data("penguins", return_X_y = TRUE) penguins$x # data frame penguins$y # vector
Get type/class of given vector.
get_type(x, include_binary = FALSE)
get_type(x, include_binary = FALSE)
x |
Input vector. |
include_binary |
Boolean. Whether binary should be counted separately from categorical. |
Type/class of 'x'.
Attempts to download a file from a specified URL, retrying a set number of times if the download fails. This function meets CRAN's requirement for gracefully handling the use of internet resources by catching errors and returning a warning message if the download ultimately fails.
graceful_download(url, destfile, retries = 3)
graceful_download(url, destfile, retries = 3)
url |
Character. The URL of the file to download. |
destfile |
Character. The path to the destination file where the downloaded content will be saved. |
retries |
Integer. The maximum number of download attempts (default is 3). |
Logical. Returns 'TRUE' if the download succeeds, 'FALSE' otherwise.
## Not run: dataset_url <- "https://example.com/dataset.csv" tmp <- tempfile(fileext = ".csv") success <- download_file_gracefully(dataset_url, tmp) if (!success) { message("Continuing gracefully without the dataset.") } ## End(Not run)
## Not run: dataset_url <- "https://example.com/dataset.csv" tmp <- tempfile(fileext = ".csv") success <- download_file_gracefully(dataset_url, tmp) if (!success) { message("Continuing gracefully without the dataset.") } ## End(Not run)
If 'x' is a data.frame object, computes dataset characteristics. If 'x' is a character object specifying dataset name from PMLB, use the already computed dataset statistics/characteristics in 'summary_stats'.
nearest_datasets(x, ...) ## Default S3 method: nearest_datasets(x, ...) ## S3 method for class 'character' nearest_datasets( x, n_neighbors = 5, dimensions = c("n_instances", "n_features"), target_name = "target", ... ) ## S3 method for class 'data.frame' nearest_datasets( x, y = NULL, n_neighbors = 5, dimensions = c("n_instances", "n_features"), task = c("classification", "regression"), target_name = "target", ... )
nearest_datasets(x, ...) ## Default S3 method: nearest_datasets(x, ...) ## S3 method for class 'character' nearest_datasets( x, n_neighbors = 5, dimensions = c("n_instances", "n_features"), target_name = "target", ... ) ## S3 method for class 'data.frame' nearest_datasets( x, y = NULL, n_neighbors = 5, dimensions = c("n_instances", "n_features"), task = c("classification", "regression"), target_name = "target", ... )
x |
Character string of dataset name from PMLB, or data.frame of n_samples x n_features(or n_features+1 with a target column) |
... |
Further arguments passed to each method. |
n_neighbors |
Integer. The number of dataset names to return as neighbors. |
dimensions |
Character vector specifying dataset characteristics to include in similarity calculation. Dimensions must correspond to numeric columns of [all_summary_stats.tsv](https://github.com/EpistasisLab/pmlb/blob/master/pmlb/all_summary_stats.tsv). If 'all' (default), uses all numeric columns. |
target_name |
Character string specifying column of target/dependent variable. |
y |
Vector of target column. Required when 'x“ does not contain the target column. |
task |
Character string specifying classification or regression for summary stat generation. |
Character string of names of most similar datasets to df, most similar dataset first.
nearest_datasets('penguins') nearest_datasets(fetch_data('penguins'))
nearest_datasets('penguins') nearest_datasets(fetch_data('penguins'))
The PMLB repository contains a curated collection of data sets for evaluating and comparing machine learning algorithms. These data sets cover a range of applications, and include binary/multi-class classification problems and regression problems, as well as combinations of categorical, ordinal, and continuous features. There are approximately 290 data sets included in the PMLB repository and there are no missing values in these data sets.
This R library includes summaries of the classification and regression data sets but does NOT
include any of the PMLB data sets. The data sets can be downloaded using the fetch_data
function which
is similar to the corresponding PMLB python function.
See fetch_data
, summary_stats
for usage examples and further information.
If you use PMLB in a scientific publication, please consider citing the following paper:
Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore (2017).
PMLB: a large benchmark suite for machine learning evaluation and comparison
https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0154-4
BioData Mining 10, page 36.
Maintainer: Trang Le [email protected] (https://trang.page/)
Authors:
makeyourownmaker [email protected] (https://github.com/makeyourownmaker)
Jason Moore [email protected] (http://www.epistasisblog.org/)
Other contributors:
University of Pennsylvania [copyright holder]
Useful links:
A list of the names of available regression datasets
regression_dataset_names
regression_dataset_names
An object of class character
of length 122.
https://github.com/EpistasisLab/pmlb
Summary statistics for the all datasets
summary_stats
summary_stats
A data frame with 10 variables:
Dataset name
Number of data observations (equal to number of rows)
Total number of features (number of columns - 1)
Number of binary features
Number of categorical features
Number of continuous features
Number of classes in target variable
Value type of endpoint/target (can be binary, categorical or continuous)
Imbalance metric, where zero means that the dataset is perfectly balanced and the higher the value, the more imbalanced the dataset
Type of problem/task. Can be classification or regression.
https://github.com/EpistasisLab/pmlb