Models

class dnamite.models.DNAMiteRegressor(n_embed=32, n_hidden=32, n_layers=2, max_bins=32, min_samples_per_bin=None, validation_size=0.2, n_val_splits=5, learning_rate=0.0005, max_epochs=100, batch_size=128, device='cpu', kernel_size=5, kernel_weight=3, pair_kernel_size=3, pair_kernel_weight=3, monotone_constraints=None, num_pairs=0, verbosity=0, random_state=None)

DNAMiteRegressor is a model for regression using the DNAMite architecture.

Parameters:
  • n_embed (int, optional (default=32)) – The size of the embedding layer.

  • n_hidden (int, optional (default=32)) – The number of hidden units in the hidden layers.

  • n_layers (int, default=2) – Number of hidden layers in the model.

  • max_bins (int, optional (default=32)) – The maximum number of bins for discretizing continuous features.

  • min_samples_per_bin (int, default=None) – Minimum number of samples required in each bin. Default is None which sets it to min(n_train_samples / 100, 50).

  • validation_size (float, optional (default=0.2)) – The proportion of the dataset to include in the validation split.

  • n_val_splits (int, optional (default=5)) – The number of validation splits for cross-validation.

  • learning_rate (float, optional (default=5e-4)) – The learning rate for the optimizer.

  • max_epochs (int, optional (default=100)) – The maximum number of epochs for training.

  • batch_size (int, optional (default=128)) – The batch size for training.

  • device (str, optional (default="cpu")) – The device to run the model on (“cpu” or “cuda”).

  • kernel_size (int, optional (default=5)) – The size of the kernel in convolutional layers for single features.

  • kernel_weight (float, optional (default=3)) – The weight of the kernel for single feature convolutional layers.

  • pair_kernel_size (int, optional (default=3)) – The size of the kernel for pairwise convolutional layers.

  • pair_kernel_weight (float, optional (default=3)) – The weight of the kernel for pairwise convolutional layers.

  • monotone_constraints (list or None, optional (default=None)) – The monotonic constraints for the features. 0 indicates no constraint, 1 indicates increasing, -1 indicates decreasing. None means no constraints.

  • num_pairs (int, default=0) – Number of pairwise interactions to use in the model.

  • verbosity (int, default=0) – Level of verbosity for logging. 0: Warning, 1: Info, 2: Debug

  • random_state (int, optional) – Random seed for reproducibility.

fit(X, y, pairs_list=None, partialed_feats=None)

Train model.

Parameters:
  • X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for training. Missing values should be encoded as np.nan. Categorical features will automatically be detected as all columns with dtype “object” or “category”.

  • y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The labels, should be floats in (-inf, inf).

  • pairs_list (list of tuple[str, str] or None, optional) – List of feature interactions to include; if None, no specific pairs are used.

  • partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.

get_feature_importances(missing_bin='include')

Get the feature importance scores for all features in the model.

Parameters:

missing_bin (str, default="include") –

How to handle missing bin when calculating feature importances:

  • ”include” - include the missing bin.

  • ”ignore” - ignore the missing bin.

  • ”stratify” - calculate separate importances for missing and non-missing bins.

Returns:

A DataFrame containing the feature importance scores for each feature.

Return type:

pandas.DataFrame

get_pair_shape_function(feat1_name, feat2_name)

Get the shape function data for an interaction affect.

Parameters:
  • feat1_name (str) – The name of the first feature in the pair/interaction.

  • feat2_name (str) – The name of the second feature in the pair/interaction.

Returns:

A DataFrame containing the shape function data for the interaction effect.

Return type:

pandas.DataFrame

get_regularization_path(X, y, init_reg_param, partialed_feats=None)

Get the regularization path for the model.

Parameters:
  • X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for the model.

  • y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable.

  • init_reg_param (float) – Initial regularization parameter.

  • partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.

Returns:

A DataFrame containing the regularization path.

Return type:

pandas.DataFrame

get_shape_function(feature_name)

Get the shape function data, i.e. the bin scores, for a given feature.

Parameters:

feature_name (str) – The name of the feature.

Returns:

A DataFrame containing the bin scores for the feature.

Return type:

pandas.DataFrame

plot_feature_importances(n_features=10, missing_bin='include')

Plot a bar plot with the importance score for the top k features.

Parameters:
  • n_features (int, default=10) – Number of features to plot.

  • missing_bin (str, default="include") –

    How to handle missing bin when calculating feature importances:

    • ”include” - include the missing bin.

    • ”ignore” - ignore the missing bin.

    • ”stratify” - calculate separate importances for missing and non-missing bins.

plot_pair_shape_function(feat1_name, feat2_name)

Plot a heatmap for an interaction shape function.

Parameters:
  • feat1_name (str) – The name of the first feature in the pair/interaction.

  • feat2_name (str) – The name of the second feature in the pair/interaction.

plot_shape_function(feature_names, plot_missing_bin=False, axes=None)

Plot the shape function for given feature(s).

Parameters:
  • feature_names (str or list of str) – The name of the feature(s) to plot.

  • plot_missing_bin (bool, default=False) – Whether to plot the missing bin. Only applicable for continuous features.

predict(X_test)

Predict labels using the trained model.

Parameters:

X_test (pandas.DataFrame, shape (n_samples, n_features)) – The input features for prediction.

select_features(X, y, reg_param, select_pairs=False, partialed_feats=None, gamma=None, pair_gamma=None, pair_reg_param=0, entropy_param=0)

Perform feature selection. Selected features and pairs will be stored in self.selected_feats_ and self.selected_pairs_, respectively. Should be called before fit if feature selection is desired.

Parameters:
  • X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for the model.

  • y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable.

  • reg_param (float) – Regularization parameter for feature-level regularization.

  • select_pairs (bool, default=False) – Whether to select feature pairs in addition to individual features.

  • partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.

  • gamma (float, default=1) – Regularization or scaling parameter; purpose depends on specific use in the model.

  • pair_gamma (float or None, default=None) – Regularization or scaling parameter for feature pairs. If None, defaults to gamma.

  • pair_reg_param (float, default=0) – Regularization parameter for feature pairs.

  • entropy_param (float, default=0) – Entropy parameter to control the diversity or uncertainty.

class dnamite.models.DNAMiteBinaryClassifier(n_embed=32, n_hidden=32, n_layers=2, max_bins=32, min_samples_per_bin=None, validation_size=0.2, n_val_splits=5, learning_rate=0.0005, max_epochs=100, batch_size=128, device='cpu', kernel_size=5, kernel_weight=3, pair_kernel_size=3, pair_kernel_weight=3, monotone_constraints=None, num_pairs=0, verbosity=0, random_state=None)

DNAMiteClassifier is a model for binary classification using the DNAMite architecture.

Parameters:
  • n_embed (int, optional (default=32)) – The size of the embedding layer.

  • n_hidden (int, optional (default=32)) – The number of hidden units in the hidden layers.

  • n_layers (int, default=2) – Number of hidden layers in the model.

  • max_bins (int, optional (default=32)) – The maximum number of bins for discretizing continuous features.

  • min_samples_per_bin (int, default=None) – Minimum number of samples required in each bin. Default is None which sets it to min(n_train_samples / 100, 50).

  • validation_size (float, optional (default=0.2)) – The proportion of the dataset to include in the validation split.

  • n_val_splits (int, optional (default=5)) – The number of validation splits for cross-validation.

  • learning_rate (float, optional (default=5e-4)) – The learning rate for the optimizer.

  • max_epochs (int, optional (default=100)) – The maximum number of epochs for training.

  • batch_size (int, optional (default=128)) – The batch size for training.

  • device (str, optional (default="cpu")) – The device to run the model on (“cpu” or “cuda”).

  • kernel_size (int, optional (default=5)) – The size of the kernel in convolutional layers for single features.

  • kernel_weight (float, optional (default=3)) – The weight of the kernel for single feature convolutional layers.

  • pair_kernel_size (int, optional (default=3)) – The size of the kernel for pairwise convolutional layers.

  • pair_kernel_weight (float, optional (default=3)) – The weight of the kernel for pairwise convolutional layers.

  • monotone_constraints (list or None, optional (default=None)) – The monotonic constraints for the features. 0 indicates no constraint, 1 indicates increasing, -1 indicates decreasing. None means no constraints.

  • num_pairs (int, default=0) – Number of pairwise interactions to use in the model.

  • verbosity (int, default=0) – Level of verbosity for logging. 0: Warning, 1: Info, 2: Debug

  • random_state (int, optional) – Random seed for reproducibility.

fit(X, y, pairs_list=None, partialed_feats=None)

Train model.

Parameters:
  • X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for training. Missing values should be encoded as np.nan. Categorical features will automatically be detected as all columns with dtype “object” or “category”.

  • y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The labels. Should have two unique values.

  • pairs_list (list of tuple[str, str] or None, optional) – List of feature interactions to include; if None, no specific pairs are used.

  • partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.

get_feature_importances(missing_bin='include')

Get the feature importance scores for all features in the model.

Parameters:

missing_bin (str, default="include") –

How to handle missing bin when calculating feature importances:

  • ”include” - include the missing bin.

  • ”ignore” - ignore the missing bin.

  • ”stratify” - calculate separate importances for missing and non-missing bins.

Returns:

A DataFrame containing the feature importance scores for each feature.

Return type:

pandas.DataFrame

get_pair_shape_function(feat1_name, feat2_name)

Get the shape function data for an interaction affect.

Parameters:
  • feat1_name (str) – The name of the first feature in the pair/interaction.

  • feat2_name (str) – The name of the second feature in the pair/interaction.

Returns:

A DataFrame containing the shape function data for the interaction effect.

Return type:

pandas.DataFrame

get_regularization_path(X, y, init_reg_param, partialed_feats=None)

Get the regularization path for the model.

Parameters:
  • X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for the model.

  • y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable.

  • init_reg_param (float) – Initial regularization parameter.

  • partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.

Returns:

A DataFrame containing the regularization path.

Return type:

pandas.DataFrame

get_shape_function(feature_name)

Get the shape function data, i.e. the bin scores, for a given feature.

Parameters:

feature_name (str) – The name of the feature.

Returns:

A DataFrame containing the bin scores for the feature.

Return type:

pandas.DataFrame

plot_feature_importances(n_features=10, missing_bin='include')

Plot a bar plot with the importance score for the top k features.

Parameters:
  • n_features (int, default=10) – Number of features to plot.

  • missing_bin (str, default="include") –

    How to handle missing bin when calculating feature importances:

    • ”include” - include the missing bin.

    • ”ignore” - ignore the missing bin.

    • ”stratify” - calculate separate importances for missing and non-missing bins.

plot_pair_shape_function(feat1_name, feat2_name)

Plot a heatmap for an interaction shape function.

Parameters:
  • feat1_name (str) – The name of the first feature in the pair/interaction.

  • feat2_name (str) – The name of the second feature in the pair/interaction.

plot_shape_function(feature_names, plot_missing_bin=False, axes=None)

Plot the shape function for given feature(s).

Parameters:
  • feature_names (str or list of str) – The name of the feature(s) to plot.

  • plot_missing_bin (bool, default=False) – Whether to plot the missing bin. Only applicable for continuous features.

predict(X_test)

Predict labels using the trained model.

Parameters:

X_test (pandas.DataFrame, shape (n_samples, n_features)) – The input features for prediction.

predict_proba(X_test)

Predict probabilities using the trained model.

Parameters:

X_test (pandas.DataFrame, shape (n_samples, n_features)) – The input features for prediction.

score(X, y)

Return the mean accuracy on the given test data and labels.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,)) – True labels for X.

Returns:

score – The AUC score.

Return type:

float

select_features(X, y, reg_param, select_pairs=False, partialed_feats=None, gamma=None, pair_gamma=None, pair_reg_param=0, entropy_param=0)

Perform feature selection. Selected features and pairs will be stored in self.selected_feats_ and self.selected_pairs_, respectively. Should be called before fit if feature selection is desired.

Parameters:
  • X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for the model.

  • y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable.

  • reg_param (float) – Regularization parameter for feature-level regularization.

  • select_pairs (bool, default=False) – Whether to select feature pairs in addition to individual features.

  • partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.

  • gamma (float, default=1) – Regularization or scaling parameter; purpose depends on specific use in the model.

  • pair_gamma (float or None, default=None) – Regularization or scaling parameter for feature pairs. If None, defaults to gamma.

  • pair_reg_param (float, default=0) – Regularization parameter for feature pairs.

  • entropy_param (float, default=0) – Entropy parameter to control the diversity or uncertainty.

class dnamite.models.DNAMiteSurvival(n_eval_times=100, n_embed=32, n_hidden=32, n_layers=2, max_bins=32, min_samples_per_bin=None, validation_size=0.2, n_val_splits=5, learning_rate=0.0005, max_epochs=100, batch_size=128, device='cpu', kernel_size=5, kernel_weight=3, pair_kernel_size=3, pair_kernel_weight=3, monotone_constraints=None, num_pairs=0, verbosity=0, random_state=None, censor_estimator='km')

DNAMiteSurvival is a model for survival analysis using the DNAMite architecture.

Parameters:
  • n_eval_times (int, optional (default=100)) – The number of evaluation times for survival analysis.

  • n_embed (int, optional (default=32)) – The size of the embedding layer.

  • n_hidden (int, optional (default=32)) – The number of hidden units in the hidden layers.

  • n_layers (int, optional (default=2)) – The number of layers in the model.

  • max_bins (int, optional (default=32)) – The maximum number of bins for discretizing continuous features.

  • validation_size (float, optional (default=0.2)) – The proportion of the dataset to include in the validation split.

  • n_val_splits (int, optional (default=5)) – The number of validation splits for cross-validation.

  • learning_rate (float, optional (default=5e-4)) – The learning rate for the optimizer.

  • max_epochs (int, optional (default=100)) – The maximum number of epochs for training.

  • min_samples_per_bin (int, default=None) – Minimum number of samples required in each bin. Default is None which sets it to min(n_train_samples / 100, 50).

  • batch_size (int, optional (default=128)) – The batch size for training.

  • device (str, optional (default="cpu")) – The device to run the model on (“cpu” or “cuda”).

  • kernel_size (int, optional (default=5)) – The size of the kernel in convolutional layers.

  • kernel_weight (float, optional (default=3)) – The weight of the kernel in convolutional layers.

  • pair_kernel_size (int, optional (default=3)) – The size of the kernel for pairwise convolutional layers.

  • pair_kernel_weight (float, optional (default=3)) – The weight of the kernel for pairwise convolutional layers.

  • monotone_constraints (list or None, optional (default=None)) – The monotonic constraints for the features. 0 indicates no constraint, 1 indicates increasing, -1 indicates decreasing. None means no constraints.

  • num_pairs (int, default=0) – Number of pairwise interactions to use in the model.

  • verbosity (int, default=0) – Level of verbosity for logging. 0: Warning, 1: Info, 2: Debug

  • random_state (int, optional) – Random seed for reproducibility.

  • censor_estimator (str, optional (default="km")) – The estimator to use for estimating the censoring distribution. “km” for Kaplan-Meier, “cox” for Cox proportional hazards.

fit(X, y, pairs_list=None, partialed_feats=None)

Train model.

Parameters:
  • X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for training.

  • y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable or labels.

  • pairs_list (list of tuple[str, str] or None, optional) – List of feature interactions to include; if None, no specific pairs are used.

  • partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.

get_calibration_data(X, y, eval_time, n_bins=10, binning_method='quantile')

Get calibration data to assess the calibration of the model at a given evaluation time.

Parameters:
  • X (array-like) – Input data used to generate prediction. Should usually be a held-out test set.

  • y (structured np.array of shape (n_samples,) with dtype [("event", bool), ("time", float)]) – Survival labels for corresponding to X.

  • eval_time (float) – Evaluation time to assess calibration at.

  • n_bins (int, optional (default=10)) – Number of bins to use for binned Kaplan-Meier estimate.

  • binning_method (str, optional (default="quantile")) – Method for binning predictions. Options are “quantile” or “uniform”.

get_feature_importances(eval_time=None, missing_bin='include')

Get the feature importance scores for all features in the model.

Parameters:
  • eval_time (float or None, default=None) – The evaluation time for which to compute the feature importance.

  • missing_bin (str, default="include") –

    How to handle missing bin when calculating feature importances:

    • ”include” - include the missing bin.

    • ”ignore” - ignore the missing bin.

    • ”stratify” - calculate separate importances for missing and non-missing bins.

Returns:

A DataFrame containing the feature importance scores for each feature.

Return type:

pandas.DataFrame

get_pair_shape_function(feat1_name, feat2_name, eval_time)

Get the shape function data for an interaction affect.

Parameters:
  • feat1_name (str) – The name of the first feature in the pair/interaction.

  • feat2_name (str) – The name of the second feature in the pair/interaction.

  • eval_time (float) – The evaluation time to compute the interaction shape function at.

Returns:

A DataFrame containing the shape function data for the interaction effect.

Return type:

pandas.DataFrame

get_regularization_path(X, y, init_reg_param, partialed_feats=None)

Get the regularization path for the model.

Parameters:
  • X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for the model.

  • y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable.

  • init_reg_param (float) – Initial regularization parameter.

  • partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.

Returns:

A DataFrame containing the regularization path.

Return type:

pandas.DataFrame

get_shape_function(feature_name, eval_time)

Get the shape function data, i.e. the bin scores, for a given feature.

Parameters:
  • feature_name (str) – The name of the feature.

  • eval_time (float) – The evaluation time for which to compute the shape function.

Returns:

A DataFrame containing the bin scores for the feature.

Return type:

pandas.DataFrame

make_calibration_plot(X, y, eval_times, n_bins=10, binning_method='quantile')

Make a calibration plot to assess the calibration of the model at a given evaluation time.

Parameters:
  • X (array-like) – Input data used to generate prediction. Should usually be a held-out test set.

  • y (structured np.array of shape (n_samples,) with dtype [("event", bool), ("time", float)]) – Survival labels for corresponding to X.

  • eval_times (float or list of float) – Evaluation time(s) to assess calibration at.

  • n_bins (int, optional (default=10)) – Number of bins to use for binned Kaplan-Meier estimate.

  • binning_method (str, optional (default="quantile")) – Method for binning predictions. Options are “quantile” or “uniform”.

plot_feature_importances(n_features=10, eval_times=None, missing_bin='include')

Plot a bar plot with the importance score for the top k features.

Parameters:
  • n_features (int, default=10) – Number of features to plot.

  • eval_times (float, list or None, default=None) – The evaluation time(s) for which to compute the feature importance. None means to compute the importance over all evaluation times.

  • missing_bin (str, default="include") –

    How to handle missing bin when calculating feature importances:

    • ”include” - include the missing bin.

    • ”ignore” - ignore the missing bin.

    • ”stratify” - calculate separate importances for missing and non-missing bins.

plot_pair_shape_function(feat1_name, feat2_name, eval_times)

Plot a heatmap for an interaction shape function.

Parameters:
  • feat1_name (str) – The name of the first feature in the pair/interaction.

  • feat2_name (str) – The name of the second feature in the pair/interaction.

  • eval_times (float or list of float) – The evaluation time(s) to plot the interaction shape function at.

plot_shape_function(feature_names, eval_times, plot_missing_bin=False)

Plot the shape function for given feature(s).

Parameters:
  • feature_names (str or list of str) – The name of the feature(s) to plot.

  • eval_times (float or list of float) – The evaluation time(s) for which to compute the shape function.

  • plot_missing_bin (bool, default=False) – Whether to plot the missing bin. Only applicable for continuous features.

predict(X_test)

Predict labels using the trained model.

Parameters:

X_test (pandas.DataFrame, shape (n_samples, n_features)) – The input features for prediction.

predict_survival(X_test, test_times=None)

Predict the survival probability for a set of evaluation times.

Parameters:
  • X_test (pandas.DataFrame, shape (n_samples, n_features)) – The input features for prediction.

  • test_times (array-like, optional) – The evaluation times to predict the survival probability at. If None, the evaluation times used for training will be used.

Returns:

The predicted survival probabilities for each sample at each evaluation time.

Return type:

np.ndarray of shape (n_samples, n_eval_times)

select_features(X, y, reg_param, select_pairs=False, partialed_feats=None, gamma=None, pair_gamma=None, pair_reg_param=0, entropy_param=0)

Perform feature selection. Selected features and pairs will be stored in self.selected_feats_ and self.selected_pairs_, respectively. Should be called before fit if feature selection is desired.

Parameters:
  • X (pandas.DataFrame, shape (n_samples, n_features)) – The input features for the model.

  • y (pandas.Series or numpy.ndarray, shape (n_samples,)) – The target variable.

  • reg_param (float) – Regularization parameter for feature-level regularization.

  • select_pairs (bool, default=False) – Whether to select feature pairs in addition to individual features.

  • partialed_feats (list or None, optional) – A list of features that should be fit completely before fitting all other features.

  • gamma (float, default=1) – Regularization or scaling parameter; purpose depends on specific use in the model.

  • pair_gamma (float or None, default=None) – Regularization or scaling parameter for feature pairs. If None, defaults to gamma.

  • pair_reg_param (float, default=0) – Regularization parameter for feature pairs.

  • entropy_param (float, default=0) – Entropy parameter to control the diversity or uncertainty.