Models
- class dnamite.models.DNAMiteRegressor(n_embed=32, n_hidden=32, n_layers=2, max_bins=32, min_samples_per_bin=None, validation_size=0.2, n_val_splits=5, learning_rate=0.0005, max_epochs=100, batch_size=128, device='cpu', kernel_size=5, kernel_weight=3, pair_kernel_size=3, pair_kernel_weight=3, monotone_constraints=None, num_pairs=0, verbosity=0, random_state=None)
DNAMiteRegressor is a model for regression using the DNAMite architecture.
- Parameters:
n_embed (int, optional (default=32)) – The size of the embedding layer.
n_hidden (int, optional (default=32)) – The number of hidden units in the hidden layers.
n_layers (int, default=2) – Number of hidden layers in the model.
max_bins (int, optional (default=32)) – The maximum number of bins for discretizing continuous features.
min_samples_per_bin (int, default=None) – Minimum number of samples required in each bin. Default is None which sets it to min(n_train_samples / 100, 50).
validation_size (float, optional (default=0.2)) – The proportion of the dataset to include in the validation split.
n_val_splits (int, optional (default=5)) – The number of validation splits for cross-validation.
learning_rate (float, optional (default=5e-4)) – The learning rate for the optimizer.
max_epochs (int, optional (default=100)) – The maximum number of epochs for training.
batch_size (int, optional (default=128)) – The batch size for training.
device (str, optional (default="cpu")) – The device to run the model on (“cpu” or “cuda”).
kernel_size (int, optional (default=5)) – The size of the kernel in convolutional layers for single features.
kernel_weight (float, optional (default=3)) – The weight of the kernel for single feature convolutional layers.
pair_kernel_size (int, optional (default=3)) – The size of the kernel for pairwise convolutional layers.
pair_kernel_weight (float, optional (default=3)) – The weight of the kernel for pairwise convolutional layers.
monotone_constraints (list or None, optional (default=None)) – The monotonic constraints for the features. 0 indicates no constraint, 1 indicates increasing, -1 indicates decreasing. None means no constraints.
num_pairs (int, default=0) – Number of pairwise interactions to use in the model.
verbosity (int, default=0) – Level of verbosity for logging. 0: Warning, 1: Info, 2: Debug
random_state (int, optional) – Random seed for reproducibility.
- fit(X, y, pairs_list=None, partialed_feats=None)
Train model.
- Parameters:
X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for training. Missing values should be encoded as np.nan. Categorical features will automatically be detected as all columns with dtype “object” or “category”.
y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The labels, should be floats in (-inf, inf).
pairs_list (list of tuple[str, str] or None, optional) – List of feature interactions to include; if None, no specific pairs are used.
partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.
- get_feature_importances(missing_bin='include')
Get the feature importance scores for all features in the model.
- Parameters:
missing_bin (str, default="include") –
How to handle missing bin when calculating feature importances:
”include” - include the missing bin.
”ignore” - ignore the missing bin.
”stratify” - calculate separate importances for missing and non-missing bins.
- Returns:
A DataFrame containing the feature importance scores for each feature.
- Return type:
pandas.DataFrame
- get_pair_shape_function(feat1_name, feat2_name)
Get the shape function data for an interaction affect.
- Parameters:
feat1_name (str) – The name of the first feature in the pair/interaction.
feat2_name (str) – The name of the second feature in the pair/interaction.
- Returns:
A DataFrame containing the shape function data for the interaction effect.
- Return type:
pandas.DataFrame
- get_regularization_path(X, y, init_reg_param, partialed_feats=None)
Get the regularization path for the model.
- Parameters:
X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for the model.
y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable.
init_reg_param (float) – Initial regularization parameter.
partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.
- Returns:
A DataFrame containing the regularization path.
- Return type:
pandas.DataFrame
- get_shape_function(feature_name)
Get the shape function data, i.e. the bin scores, for a given feature.
- Parameters:
feature_name (str) – The name of the feature.
- Returns:
A DataFrame containing the bin scores for the feature.
- Return type:
pandas.DataFrame
- plot_feature_importances(n_features=10, missing_bin='include')
Plot a bar plot with the importance score for the top k features.
- Parameters:
n_features (int, default=10) – Number of features to plot.
missing_bin (str, default="include") –
How to handle missing bin when calculating feature importances:
”include” - include the missing bin.
”ignore” - ignore the missing bin.
”stratify” - calculate separate importances for missing and non-missing bins.
- plot_pair_shape_function(feat1_name, feat2_name)
Plot a heatmap for an interaction shape function.
- Parameters:
feat1_name (str) – The name of the first feature in the pair/interaction.
feat2_name (str) – The name of the second feature in the pair/interaction.
- plot_shape_function(feature_names, plot_missing_bin=False, axes=None)
Plot the shape function for given feature(s).
- Parameters:
feature_names (str or list of str) – The name of the feature(s) to plot.
plot_missing_bin (bool, default=False) – Whether to plot the missing bin. Only applicable for continuous features.
- predict(X_test)
Predict labels using the trained model.
- Parameters:
X_test (pandas.DataFrame, shape (n_samples, n_features)) – The input features for prediction.
- select_features(X, y, reg_param, select_pairs=False, partialed_feats=None, gamma=None, pair_gamma=None, pair_reg_param=0, entropy_param=0)
Perform feature selection. Selected features and pairs will be stored in
self.selected_feats_andself.selected_pairs_, respectively. Should be called before fit if feature selection is desired.- Parameters:
X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for the model.
y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable.
reg_param (float) – Regularization parameter for feature-level regularization.
select_pairs (bool, default=False) – Whether to select feature pairs in addition to individual features.
partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.
gamma (float, default=1) – Regularization or scaling parameter; purpose depends on specific use in the model.
pair_gamma (float or None, default=None) – Regularization or scaling parameter for feature pairs. If None, defaults to gamma.
pair_reg_param (float, default=0) – Regularization parameter for feature pairs.
entropy_param (float, default=0) – Entropy parameter to control the diversity or uncertainty.
- class dnamite.models.DNAMiteBinaryClassifier(n_embed=32, n_hidden=32, n_layers=2, max_bins=32, min_samples_per_bin=None, validation_size=0.2, n_val_splits=5, learning_rate=0.0005, max_epochs=100, batch_size=128, device='cpu', kernel_size=5, kernel_weight=3, pair_kernel_size=3, pair_kernel_weight=3, monotone_constraints=None, num_pairs=0, verbosity=0, random_state=None)
DNAMiteClassifier is a model for binary classification using the DNAMite architecture.
- Parameters:
n_embed (int, optional (default=32)) – The size of the embedding layer.
n_hidden (int, optional (default=32)) – The number of hidden units in the hidden layers.
n_layers (int, default=2) – Number of hidden layers in the model.
max_bins (int, optional (default=32)) – The maximum number of bins for discretizing continuous features.
min_samples_per_bin (int, default=None) – Minimum number of samples required in each bin. Default is None which sets it to min(n_train_samples / 100, 50).
validation_size (float, optional (default=0.2)) – The proportion of the dataset to include in the validation split.
n_val_splits (int, optional (default=5)) – The number of validation splits for cross-validation.
learning_rate (float, optional (default=5e-4)) – The learning rate for the optimizer.
max_epochs (int, optional (default=100)) – The maximum number of epochs for training.
batch_size (int, optional (default=128)) – The batch size for training.
device (str, optional (default="cpu")) – The device to run the model on (“cpu” or “cuda”).
kernel_size (int, optional (default=5)) – The size of the kernel in convolutional layers for single features.
kernel_weight (float, optional (default=3)) – The weight of the kernel for single feature convolutional layers.
pair_kernel_size (int, optional (default=3)) – The size of the kernel for pairwise convolutional layers.
pair_kernel_weight (float, optional (default=3)) – The weight of the kernel for pairwise convolutional layers.
monotone_constraints (list or None, optional (default=None)) – The monotonic constraints for the features. 0 indicates no constraint, 1 indicates increasing, -1 indicates decreasing. None means no constraints.
num_pairs (int, default=0) – Number of pairwise interactions to use in the model.
verbosity (int, default=0) – Level of verbosity for logging. 0: Warning, 1: Info, 2: Debug
random_state (int, optional) – Random seed for reproducibility.
- fit(X, y, pairs_list=None, partialed_feats=None)
Train model.
- Parameters:
X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for training. Missing values should be encoded as np.nan. Categorical features will automatically be detected as all columns with dtype “object” or “category”.
y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The labels. Should have two unique values.
pairs_list (list of tuple[str, str] or None, optional) – List of feature interactions to include; if None, no specific pairs are used.
partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.
- get_feature_importances(missing_bin='include')
Get the feature importance scores for all features in the model.
- Parameters:
missing_bin (str, default="include") –
How to handle missing bin when calculating feature importances:
”include” - include the missing bin.
”ignore” - ignore the missing bin.
”stratify” - calculate separate importances for missing and non-missing bins.
- Returns:
A DataFrame containing the feature importance scores for each feature.
- Return type:
pandas.DataFrame
- get_pair_shape_function(feat1_name, feat2_name)
Get the shape function data for an interaction affect.
- Parameters:
feat1_name (str) – The name of the first feature in the pair/interaction.
feat2_name (str) – The name of the second feature in the pair/interaction.
- Returns:
A DataFrame containing the shape function data for the interaction effect.
- Return type:
pandas.DataFrame
- get_regularization_path(X, y, init_reg_param, partialed_feats=None)
Get the regularization path for the model.
- Parameters:
X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for the model.
y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable.
init_reg_param (float) – Initial regularization parameter.
partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.
- Returns:
A DataFrame containing the regularization path.
- Return type:
pandas.DataFrame
- get_shape_function(feature_name)
Get the shape function data, i.e. the bin scores, for a given feature.
- Parameters:
feature_name (str) – The name of the feature.
- Returns:
A DataFrame containing the bin scores for the feature.
- Return type:
pandas.DataFrame
- plot_feature_importances(n_features=10, missing_bin='include')
Plot a bar plot with the importance score for the top k features.
- Parameters:
n_features (int, default=10) – Number of features to plot.
missing_bin (str, default="include") –
How to handle missing bin when calculating feature importances:
”include” - include the missing bin.
”ignore” - ignore the missing bin.
”stratify” - calculate separate importances for missing and non-missing bins.
- plot_pair_shape_function(feat1_name, feat2_name)
Plot a heatmap for an interaction shape function.
- Parameters:
feat1_name (str) – The name of the first feature in the pair/interaction.
feat2_name (str) – The name of the second feature in the pair/interaction.
- plot_shape_function(feature_names, plot_missing_bin=False, axes=None)
Plot the shape function for given feature(s).
- Parameters:
feature_names (str or list of str) – The name of the feature(s) to plot.
plot_missing_bin (bool, default=False) – Whether to plot the missing bin. Only applicable for continuous features.
- predict(X_test)
Predict labels using the trained model.
- Parameters:
X_test (pandas.DataFrame, shape (n_samples, n_features)) – The input features for prediction.
- predict_proba(X_test)
Predict probabilities using the trained model.
- Parameters:
X_test (pandas.DataFrame, shape (n_samples, n_features)) – The input features for prediction.
- score(X, y)
Return the mean accuracy on the given test data and labels.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,)) – True labels for X.
- Returns:
score – The AUC score.
- Return type:
float
- select_features(X, y, reg_param, select_pairs=False, partialed_feats=None, gamma=None, pair_gamma=None, pair_reg_param=0, entropy_param=0)
Perform feature selection. Selected features and pairs will be stored in
self.selected_feats_andself.selected_pairs_, respectively. Should be called before fit if feature selection is desired.- Parameters:
X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for the model.
y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable.
reg_param (float) – Regularization parameter for feature-level regularization.
select_pairs (bool, default=False) – Whether to select feature pairs in addition to individual features.
partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.
gamma (float, default=1) – Regularization or scaling parameter; purpose depends on specific use in the model.
pair_gamma (float or None, default=None) – Regularization or scaling parameter for feature pairs. If None, defaults to gamma.
pair_reg_param (float, default=0) – Regularization parameter for feature pairs.
entropy_param (float, default=0) – Entropy parameter to control the diversity or uncertainty.
- class dnamite.models.DNAMiteSurvival(n_eval_times=100, n_embed=32, n_hidden=32, n_layers=2, max_bins=32, min_samples_per_bin=None, validation_size=0.2, n_val_splits=5, learning_rate=0.0005, max_epochs=100, batch_size=128, device='cpu', kernel_size=5, kernel_weight=3, pair_kernel_size=3, pair_kernel_weight=3, monotone_constraints=None, num_pairs=0, verbosity=0, random_state=None, censor_estimator='km')
DNAMiteSurvival is a model for survival analysis using the DNAMite architecture.
- Parameters:
n_eval_times (int, optional (default=100)) – The number of evaluation times for survival analysis.
n_embed (int, optional (default=32)) – The size of the embedding layer.
n_hidden (int, optional (default=32)) – The number of hidden units in the hidden layers.
n_layers (int, optional (default=2)) – The number of layers in the model.
max_bins (int, optional (default=32)) – The maximum number of bins for discretizing continuous features.
validation_size (float, optional (default=0.2)) – The proportion of the dataset to include in the validation split.
n_val_splits (int, optional (default=5)) – The number of validation splits for cross-validation.
learning_rate (float, optional (default=5e-4)) – The learning rate for the optimizer.
max_epochs (int, optional (default=100)) – The maximum number of epochs for training.
min_samples_per_bin (int, default=None) – Minimum number of samples required in each bin. Default is None which sets it to min(n_train_samples / 100, 50).
batch_size (int, optional (default=128)) – The batch size for training.
device (str, optional (default="cpu")) – The device to run the model on (“cpu” or “cuda”).
kernel_size (int, optional (default=5)) – The size of the kernel in convolutional layers.
kernel_weight (float, optional (default=3)) – The weight of the kernel in convolutional layers.
pair_kernel_size (int, optional (default=3)) – The size of the kernel for pairwise convolutional layers.
pair_kernel_weight (float, optional (default=3)) – The weight of the kernel for pairwise convolutional layers.
monotone_constraints (list or None, optional (default=None)) – The monotonic constraints for the features. 0 indicates no constraint, 1 indicates increasing, -1 indicates decreasing. None means no constraints.
num_pairs (int, default=0) – Number of pairwise interactions to use in the model.
verbosity (int, default=0) – Level of verbosity for logging. 0: Warning, 1: Info, 2: Debug
random_state (int, optional) – Random seed for reproducibility.
censor_estimator (str, optional (default="km")) – The estimator to use for estimating the censoring distribution. “km” for Kaplan-Meier, “cox” for Cox proportional hazards.
- fit(X, y, pairs_list=None, partialed_feats=None)
Train model.
- Parameters:
X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for training.
y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable or labels.
pairs_list (list of tuple[str, str] or None, optional) – List of feature interactions to include; if None, no specific pairs are used.
partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.
- get_calibration_data(X, y, eval_time, n_bins=10, binning_method='quantile')
Get calibration data to assess the calibration of the model at a given evaluation time.
- Parameters:
X (array-like) – Input data used to generate prediction. Should usually be a held-out test set.
y (structured np.array of shape (n_samples,) with dtype [("event", bool), ("time", float)]) – Survival labels for corresponding to X.
eval_time (float) – Evaluation time to assess calibration at.
n_bins (int, optional (default=10)) – Number of bins to use for binned Kaplan-Meier estimate.
binning_method (str, optional (default="quantile")) – Method for binning predictions. Options are “quantile” or “uniform”.
- get_feature_importances(eval_time=None, missing_bin='include')
Get the feature importance scores for all features in the model.
- Parameters:
eval_time (float or None, default=None) – The evaluation time for which to compute the feature importance.
missing_bin (str, default="include") –
How to handle missing bin when calculating feature importances:
”include” - include the missing bin.
”ignore” - ignore the missing bin.
”stratify” - calculate separate importances for missing and non-missing bins.
- Returns:
A DataFrame containing the feature importance scores for each feature.
- Return type:
pandas.DataFrame
- get_pair_shape_function(feat1_name, feat2_name, eval_time)
Get the shape function data for an interaction affect.
- Parameters:
feat1_name (str) – The name of the first feature in the pair/interaction.
feat2_name (str) – The name of the second feature in the pair/interaction.
eval_time (float) – The evaluation time to compute the interaction shape function at.
- Returns:
A DataFrame containing the shape function data for the interaction effect.
- Return type:
pandas.DataFrame
- get_regularization_path(X, y, init_reg_param, partialed_feats=None)
Get the regularization path for the model.
- Parameters:
X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for the model.
y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable.
init_reg_param (float) – Initial regularization parameter.
partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.
- Returns:
A DataFrame containing the regularization path.
- Return type:
pandas.DataFrame
- get_shape_function(feature_name, eval_time)
Get the shape function data, i.e. the bin scores, for a given feature.
- Parameters:
feature_name (str) – The name of the feature.
eval_time (float) – The evaluation time for which to compute the shape function.
- Returns:
A DataFrame containing the bin scores for the feature.
- Return type:
pandas.DataFrame
- make_calibration_plot(X, y, eval_times, n_bins=10, binning_method='quantile')
Make a calibration plot to assess the calibration of the model at a given evaluation time.
- Parameters:
X (array-like) – Input data used to generate prediction. Should usually be a held-out test set.
y (structured np.array of shape (n_samples,) with dtype [("event", bool), ("time", float)]) – Survival labels for corresponding to X.
eval_times (float or list of float) – Evaluation time(s) to assess calibration at.
n_bins (int, optional (default=10)) – Number of bins to use for binned Kaplan-Meier estimate.
binning_method (str, optional (default="quantile")) – Method for binning predictions. Options are “quantile” or “uniform”.
- plot_feature_importances(n_features=10, eval_times=None, missing_bin='include')
Plot a bar plot with the importance score for the top k features.
- Parameters:
n_features (int, default=10) – Number of features to plot.
eval_times (float, list or None, default=None) – The evaluation time(s) for which to compute the feature importance. None means to compute the importance over all evaluation times.
missing_bin (str, default="include") –
How to handle missing bin when calculating feature importances:
”include” - include the missing bin.
”ignore” - ignore the missing bin.
”stratify” - calculate separate importances for missing and non-missing bins.
- plot_pair_shape_function(feat1_name, feat2_name, eval_times)
Plot a heatmap for an interaction shape function.
- Parameters:
feat1_name (str) – The name of the first feature in the pair/interaction.
feat2_name (str) – The name of the second feature in the pair/interaction.
eval_times (float or list of float) – The evaluation time(s) to plot the interaction shape function at.
- plot_shape_function(feature_names, eval_times, plot_missing_bin=False)
Plot the shape function for given feature(s).
- Parameters:
feature_names (str or list of str) – The name of the feature(s) to plot.
eval_times (float or list of float) – The evaluation time(s) for which to compute the shape function.
plot_missing_bin (bool, default=False) – Whether to plot the missing bin. Only applicable for continuous features.
- predict(X_test)
Predict labels using the trained model.
- Parameters:
X_test (pandas.DataFrame, shape (n_samples, n_features)) – The input features for prediction.
- predict_survival(X_test, test_times=None)
Predict the survival probability for a set of evaluation times.
- Parameters:
X_test (pandas.DataFrame, shape (n_samples, n_features)) – The input features for prediction.
test_times (array-like, optional) – The evaluation times to predict the survival probability at. If None, the evaluation times used for training will be used.
- Returns:
The predicted survival probabilities for each sample at each evaluation time.
- Return type:
np.ndarray of shape (n_samples, n_eval_times)
- select_features(X, y, reg_param, select_pairs=False, partialed_feats=None, gamma=None, pair_gamma=None, pair_reg_param=0, entropy_param=0)
Perform feature selection. Selected features and pairs will be stored in
self.selected_feats_andself.selected_pairs_, respectively. Should be called before fit if feature selection is desired.- Parameters:
X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for the model.
y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable.
reg_param (float) – Regularization parameter for feature-level regularization.
select_pairs (bool, default=False) – Whether to select feature pairs in addition to individual features.
partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.
gamma (float, default=1) – Regularization or scaling parameter; purpose depends on specific use in the model.
pair_gamma (float or None, default=None) – Regularization or scaling parameter for feature pairs. If None, defaults to gamma.
pair_reg_param (float, default=0) – Regularization parameter for feature pairs.
entropy_param (float, default=0) – Entropy parameter to control the diversity or uncertainty.