Inspecting internals

This example assumes you’ve read advanced.ipynb, and covers:

  • Inspecting useful internal TrainGenerator & DataGenerator attributes
  • Inspecting train / validation interruptions
[1]:
import deeptrain
deeptrain.util.misc.append_examples_dir_to_sys_path()  # for `from utils import`

from utils import make_autoencoder, init_session
from utils import AE_CONFIGS as C

Configure & train

[2]:
C['traingen']['epochs'] = 1  # don't need more
C['traingen']['iter_verbosity'] = 0  # don't need progress printing here
tg = init_session(C, make_autoencoder)
dg = tg.datagen
vdg = tg.val_datagen

tg.train()
WARNING: multiple file extensions found in `path`; only .npy will be used
Discovered 48 files with matching format
48 set nums inferred; if more are expected, ensure file names contain a common substring w/ a number (e.g. 'train1.npy', 'train2.npy', etc)
DataGenerator initiated

WARNING: multiple file extensions found in `path`; only .npy will be used
Discovered 36 files with matching format
36 set nums inferred; if more are expected, ensure file names contain a common substring w/ a number (e.g. 'train1.npy', 'train2.npy', etc)
DataGenerator initiated

NOTE: will exclude `labels` from saving when `input_as_labels=True`; to keep 'labels', add '{labels}'to `saveskip_list` instead
Preloading superbatch ... WARNING: multiple file extensions found in `path`; only .npy will be used
Discovered 48 files with matching format
................................................ finished, w/ 6144 total samples
Train initial data prepared
Preloading superbatch ... WARNING: multiple file extensions found in `path`; only .npy will be used
Discovered 36 files with matching format
.................................... finished, w/ 4608 total samples
Val initial data prepared
Logging ON; directory (new): C:\deeptrain\examples\dir\logs\M5__model-nadam__min999.000

Data set_nums shuffled


_____________________
 EPOCH 1 -- COMPLETE 



Validating...
TrainGenerator state saved
Model report generated and saved
Best model saved to C:\deeptrain\examples\dir\models\M5__model-nadam__min.144
TrainGenerator state saved
Model report generated and saved
../../_images/examples_introspection_internals_3_1.png
Training has concluded.

Arguments passed to TrainGenerator

Can see the arguments passed at __init__; this is saved in the state file, useful for seeing how exactly training was instantiated. Some objects are stored as string to allow pickling

[3]:
from pprint import pprint
pprint(tg._passed_args)
{'best_models_dir': 'C:\\deeptrain\\examples\\dir\\models',
 'datagen': 'DataGenerator',
 'epochs': 1,
 'eval_fn': 'predict',
 'input_as_labels': True,
 'iter_verbosity': 0,
 'logs_dir': 'C:\\deeptrain\\examples\\dir\\logs',
 'max_is_best': False,
 'model': 'Functional',
 'model_configs': {'activation': ['relu', 'relu', 'relu', 'relu', 'relu'],
                   'batch_shape': (128, 28, 28, 1),
                   'filters': [6, 12, 2, 6, 12],
                   'input_dropout': 0.5,
                   'kernel_size': [(3, 3), (3, 3), (3, 3), (3, 3), (3, 3)],
                   'loss': 'mse',
                   'metrics': None,
                   'optimizer': 'nadam',
                   'preout_dropout': 0.4,
                   'strides': [(2, 2), (2, 2), 1, 1, 1],
                   'up_sampling_2d': [None, None, None, (2, 2), (2, 2)]},
 'plot_configs': {'0': {'legend_kw': {'fontsize': 11}}},
 'val_datagen': 'DataGenerator'}

Code used in training & initial attributes

  • TrainGenerator’s attributes at end of __init__ are logged at end of TrainGenerator.__init__
    • savepath: logdir/misc/init_state.json
  • Source code used to run training (__main__) is also logged, assuming ran as a .py file (not IPython excerpt or Jupyter notebook)
    • savepath: logdir/misc/init_script.txt
[4]:
import json
with open(tg.get_last_log('init_state'), 'r') as f:
    j = json.load(f)
    pprint(j)
{'_batches_fit': '0',
 '_batches_validated': '0',
 '_class_labels_cache': '[]',
 '_epoch': '0',
 '_eval_fn': 'fn',
 '_eval_fn_name': 'predict',
 '_fit_fn': 'fn',
 '_fit_fn_name': 'train_on_batch',
 '_fit_iters': '0',
 '_hist_vlines': '[]',
 '_history_fig': 'None',
 '_imports': "{'PIL': 1, 'LZ4F': 1}",
 '_inferred_batch_size': 'None',
 '_init_callbacks_called': 'True',
 '_labels': '[]',
 '_labels_cache': '[]',
 '_max_set_name_chars': '3',
 '_passed_args': 'dict',
 '_preds_cache': '[]',
 '_save_from_on_val_end': 'False',
 '_set_name': '1',
 '_set_name_cache': '[]',
 '_set_num': '1',
 '_sw_cache': '[]',
 '_temp_history_empty': "{'loss': []}",
 '_times_validated': '0',
 '_train_loop_done': 'False',
 '_train_new_batch_notified': 'False',
 '_train_postiter_processed': 'True',
 '_train_val_x_ticks': '[]',
 '_train_x_ticks': '[]',
 '_val_epoch': '0',
 '_val_hist_vlines': '[]',
 '_val_iters': '0',
 '_val_loop_done': 'False',
 '_val_max_set_name_chars': '2',
 '_val_new_batch_notified': 'False',
 '_val_postiter_processed': 'True',
 '_val_set_name': '1',
 '_val_set_name_cache': '[]',
 '_val_set_num': '1',
 '_val_temp_history_empty': "{'loss': []}",
 '_val_train_x_ticks': '[]',
 '_val_x_ticks': '[]',
 'alias_to_metric': "{'acc': 'accuracy', 'mae': 'mean_absolute_error', 'mse': "
                    "'mean_squared_error', 'mape': "
                    "'mean_absolute_percentage_error', 'msle': "
                    "'mean_squared_logarithmic_error', 'kld': "
                    "'kullback_leibler_divergence', 'cosine': "
                    "'cosine_similarity', 'f1': 'f1_score', 'f1-score': "
                    "'f1_score'}",
 'batch_size': '128',
 'best_key_metric': '999',
 'best_models_dir': 'C:\\deeptrain\\examples\\dir\\models',
 'best_subset_nums': '[]',
 'best_subset_size': 'None',
 'callbacks': '[]',
 'check_model_health': 'True',
 'checkpoints_overwrite_duplicates': 'True',
 'class_weights': 'None',
 'custom_metrics': '{}',
 'datagen': 'deeptrain.data_generator.DataGenerator',
 'dynamic_predict_threshold': '0.5',
 'dynamic_predict_threshold_min_max': 'None',
 'epochs': '1',
 'final_fig_dir': 'None',
 'history': "{'loss': []}",
 'input_as_labels': 'True',
 'iter_verbosity': '0',
 'key_metric': 'loss',
 'key_metric_fn': 'mean_squared_error',
 'key_metric_history': '[]',
 'loadpath': 'None',
 'loadskip_list': "['{auto}', 'model_name', 'model_base_name', 'model_num', "
                  "'use_passed_dirs_over_loaded', 'logdir', "
                  "'_init_callbacks_called']",
 'logdir': 'C:\\deeptrain\\examples\\dir\\logs\\M5__model-nadam__min999.000',
 'logs_dir': 'C:\\deeptrain\\examples\\dir\\logs',
 'logs_use_full_model_name': 'True',
 'loss_weighted_slices_range': 'None',
 'max_checkpoints': '5',
 'max_is_best': 'False',
 'max_one_best_save': 'True',
 'metric_printskip_configs': "{'train': [], 'val': []}",
 'metric_to_alias': "{'loss': 'Loss', 'accuracy': 'Acc', 'f1_score': 'F1', "
                    "'tnr': '0-Acc', 'tpr': '1-Acc', 'mean_absolute_error': "
                    "'MAE', 'mean_squared_error': 'MSE'}",
 'model': 'tensorflow.python.keras.engine.functional.Functional',
 'model_base_name': 'model',
 'model_configs': "{'batch_shape': (128, 28, 28, 1), 'loss': 'mse', 'metrics': "
                  "None, 'optimizer': 'nadam', 'activation': ['relu', 'relu', "
                  "'relu', 'relu', 'relu'], 'filters': [6, 12, 2, 6, 12], "
                  "'kernel_size': [(3, 3), (3, 3), (3, 3), (3, 3), (3, 3)], "
                  "'strides': [(2, 2), (2, 2), 1, 1, 1], 'up_sampling_2d': "
                  "[None, None, None, (2, 2), (2, 2)], 'input_dropout': 0.5, "
                  "'preout_dropout': 0.4}",
 'model_name': 'M5__model-nadam__min999.000',
 'model_name_configs': "{'optimizer': '', 'lr': '', 'best_key_metric': "
                       "'__min'}",
 'model_num': '5',
 'model_save_kw': "{'include_optimizer': True, 'save_format': 'h5'}",
 'model_save_weights_kw': "{'save_format': 'h5'}",
 'name_process_key_fn': 'NAME_PROCESS_KEY_FN',
 'new_model_num': 'True',
 'optimizer_load_configs': 'None',
 'optimizer_save_configs': 'None',
 'plot_configs': "{'0': {'legend_kw': {'fontsize': 11}, 'metrics': {'train': "
                 "['loss'], 'val': ['loss']}, 'x_ticks': {'train': "
                 "['_train_x_ticks'], 'val': ['_val_train_x_ticks']}, "
                 "'vhlines': {'v': '_hist_vlines', 'h': 1}, 'mark_best_cfg': "
                 "{'val': 'loss', 'max_is_best': False}, 'ylims': (0, 2), "
                 "'linewidth': [1.5, 1.5], 'linestyle': ['-', '-'], 'color': "
                 "['#1f77b4', 'orange']}, 'fig_kw': {'figsize': (12, 7)}}",
 'plot_first_pane_max_vals': '2',
 'plot_history_freq': "{'epoch': 1}",
 'pred_weighted_slices_range': 'None',
 'predict_threshold': '0.5',
 'report_configs': 'dict',
 'report_fontpath': 'C:\\deeptrain\\deeptrain\\util\\fonts\\consola.ttf',
 'reset_statefuls': 'False',
 'saveskip_list': "['model', 'optimizer_state', 'callbacks', 'key_metric_fn', "
                  "'custom_metrics', 'metric_to_alias', 'alias_to_metric', "
                  "'name_process_key_fn', '_fit_fn', '_eval_fn', '_labels', "
                  "'_preds', '_y_true', '_y_preds', '_labels_cache', "
                  "'_preds_cache', '_sw_cache', '_imports', '_history_fig', "
                  "'_val_max_set_name_chars', '_max_set_name_chars', "
                  "'_inferred_batch_size', '_class_labels_cache', "
                  "'_temp_history_empty', '_val_temp_history_empty', "
                  "'_val_sw', '_set_num', '_val_set_num', 'labels']",
 'temp_checkpoint_freq': 'None',
 'temp_history': "{'loss': []}",
 'train_metrics': "['loss']",
 'unique_checkpoint_freq': "{'epoch': 1}",
 'val_class_weights': 'None',
 'val_datagen': 'deeptrain.data_generator.DataGenerator',
 'val_freq': "{'epoch': 1}",
 'val_history': "{'loss': []}",
 'val_metrics': "['loss']",
 'val_temp_history': "{'loss': []}"}

Save directories

[5]:
print("Best model directory:", tg.best_models_dir)
print("Checkpoint directory:", tg.logdir)
print("Model full name:", tg.model_name)
Best model directory: C:\deeptrain\examples\dir\models
Checkpoint directory: C:\deeptrain\examples\dir\logs\M5__model-nadam__min999.000
Model full name: M5__model-nadam__min.144

Interrupts

Interrupts can be inspected by checking pertinent attributes manually (_train_loop_done, _train_postiter_processed, _val_loop_done, _val_postiter_processed), or calling interrupt_status() which checks these and prints an appropriate message.

[6]:
tg.interrupt_status()
No interrupts detected.

Flags checked:
        _train_loop_done          = False
        _train_postiter_processed = True
        _val_loop_done            = False
        _val_postiter_processed   = True
[6]:
(False, False)

Interrupts can be manual (KeyboardInterrupt) or due to a raise Exception; either interrupts the flow of train/validation, so knowing at which point the fault occurred allows us to correct manually (e.g. execute portion of code after an exception)

Interrupt example

[7]:
tg._train_loop_done = True
tg._val_loop_done = True
tg._val_postiter_processed = True

At this point _on_val_end() is called automatically, so if you’re able to access such a state, it means the call didn’t finish or was never initiated.

[8]:
tg.interrupt_status()
Incomplete or not called `_on_val_end()` within `validate()`.
Interrupted: train[no], validation[yes].

Flags checked:
        _train_loop_done          = True
        _train_postiter_processed = True
        _val_loop_done            = True
        _val_postiter_processed   = True
[8]:
(False, True)

Example 2

[9]:
tg._val_loop_done = False
tg._val_postiter_processed = False
tg.interrupt_status()
Interrupted during validation loop within `validate()`; incomplete or not called `_val_postiter_processing()`.
Interrupted: train[no], validation[yes].

Flags checked:
        _train_loop_done          = True
        _train_postiter_processed = True
        _val_loop_done            = False
        _val_postiter_processed   = False
[9]:
(False, True)
[10]:
help(tg.train)
Help on method train in module deeptrain.train_generator:

train() method of deeptrain.train_generator.TrainGenerator instance
    The train loop.

        - Fetches data from `get_data`
        - Fits data via `fin_fn`
        - Processes fit metrics in `_train_postiter_processing`
        - Stores metrics in `history`
        - Applies `'train:iter'`, `'train:batch'`, and `'train:epoch'`
          callbacks
        - Calls `validate` when appropriate

    **Interruption**:

        - *Safe*: during `get_data`, which can be called indefinitely
          without changing any attributes.
        - *Avoid*: during `_train_postiter_processing`, where `fit_fn` is
          applied and weights are updated - but metrics aren't stored, and
          `_train_postiter_processed=False`, restarting the loop without
          recording progress.
        - Best bet is during :meth:`validate`, as `get_data` may be too brief.

[11]:
help(tg.validate)
Help on method validate in module deeptrain.train_generator:

validate(record_progress=True, clear_cache=True, restart=False, use_callbacks=True) method of deeptrain.train_generator.TrainGenerator instance
    Validation loop.

        - Fetches data from `get_data`
        - Applies function based on `_eval_fn_name`
        - Processes and caches metrics/predictions in
          `_val_postiter_processing`
        - Applies `'val:iter'`, `'val:batch'`, and `'val:epoch'` callbacks
        - Calls `_on_val_end` at end of validation to compute metrics
          and store them in `val_history`
        - Applies `'val_end'` and maybe `('val_end': 'train:epoch')` callbacks
        - If `restart`, calls :meth:`reset_validation`.

    **Arguments**:
        record_progress: bool
            If False, won't update `val_history`, `_val_iters`,
            `_batches_validated`.
        clear_cache: bool
            If False, won't call :meth:`clear_cache`; useful for keeping
            preds & labels acquired during validation.
        restart: bool
            If True, will call :meth:`reset_valiation` before validation loop
            to reset validation attributes; useful for starting afresh (e.g.
            if interrupted).
        use_callbacks: bool
            If False, won't call :meth:`apply_callbacks`
            or :meth:`plot_history`.

    **Interruption:**

        - *Safe*: during `get_data`, which can be called indefinitely
          without changing any attributes.
        - *Avoid*: during `_val_postiter_processing`. Model remains
          unaffected*, but caches are updated; a restart may yield duplicate
          appending, which will error or yield inaccuracies.
          (* forward pass may consume random seed if random ops are used)
        - *In practice*: prefer interrupting immediately after
          `_print_iter_progress` executes.

Interrupts can also be inspected by checking temp_history, val_temp_history, and cache attributes (e.g. _preds_cache); cache attributes clear by default when validate() finishes. Check help(train) and help(validate) for further interrupt guidelines.

DataGenerator attributes

set_nums_to_process are the set nums remaining until end of epoch, which are then reset to set_nums_original. “Set” refers to data file to load.

[12]:
# We can check which set numbers remain to be processed in epoch or validation:
print(dg.set_nums_to_process)
print(vdg.set_nums_to_process)
# We can arbitrarily append to or pop from the list to skip or repeat a batch
['42', '38', '34', '20', '25', '41', '14', '33', '30', '5', '19', '32', '11', '28', '46', '40', '27', '24', '2', '21', '9', '17', '1', '29', '43', '26', '23', '36', '7', '6', '48', '4', '39', '13', '12', '37', '45', '18', '44', '35', '10', '31', '22', '47', '8', '16', '15', '3']
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36']

Info function

Lastly, we can access most of the above via info():

[13]:
tg.info()
Epochs: 1/1
Train batches fit: 0/48 (in current epoch)
Val   batches fit: 0/36 (in current validation)
--------------------------------------------------------------------------------
Best model directory: C:\deeptrain\examples\dir\models
Checkpoint directory: C:\deeptrain\examples\dir\logs\M5__model-nadam__min999.000
Load path: None
Model full name: M5__model-nadam__min.144
--------------------------------------------------------------------------------
Interrupted during validation loop within `validate()`; incomplete or not called `_val_postiter_processing()`.
Interrupted: train[no], validation[yes].

Flags checked:
        _train_loop_done          = True
        _train_postiter_processed = True
        _val_loop_done            = False
        _val_postiter_processed   = False