Inspecting internals¶
This example assumes you’ve read advanced.ipynb
, and covers:
- Inspecting useful internal
TrainGenerator
&DataGenerator
attributes - Inspecting train / validation interruptions
[1]:
import deeptrain
deeptrain.util.misc.append_examples_dir_to_sys_path() # for `from utils import`
from utils import make_autoencoder, init_session
from utils import AE_CONFIGS as C
Configure & train¶
[2]:
C['traingen']['epochs'] = 1 # don't need more
C['traingen']['iter_verbosity'] = 0 # don't need progress printing here
tg = init_session(C, make_autoencoder)
dg = tg.datagen
vdg = tg.val_datagen
tg.train()
WARNING: multiple file extensions found in `path`; only .npy will be used
Discovered 48 files with matching format
48 set nums inferred; if more are expected, ensure file names contain a common substring w/ a number (e.g. 'train1.npy', 'train2.npy', etc)
DataGenerator initiated
WARNING: multiple file extensions found in `path`; only .npy will be used
Discovered 36 files with matching format
36 set nums inferred; if more are expected, ensure file names contain a common substring w/ a number (e.g. 'train1.npy', 'train2.npy', etc)
DataGenerator initiated
NOTE: will exclude `labels` from saving when `input_as_labels=True`; to keep 'labels', add '{labels}'to `saveskip_list` instead
Preloading superbatch ... WARNING: multiple file extensions found in `path`; only .npy will be used
Discovered 48 files with matching format
................................................ finished, w/ 6144 total samples
Train initial data prepared
Preloading superbatch ... WARNING: multiple file extensions found in `path`; only .npy will be used
Discovered 36 files with matching format
.................................... finished, w/ 4608 total samples
Val initial data prepared
Logging ON; directory (new): C:\deeptrain\examples\dir\logs\M5__model-nadam__min999.000
Data set_nums shuffled
_____________________
EPOCH 1 -- COMPLETE
Validating...
TrainGenerator state saved
Model report generated and saved
Best model saved to C:\deeptrain\examples\dir\models\M5__model-nadam__min.144
TrainGenerator state saved
Model report generated and saved
Training has concluded.
Arguments passed to TrainGenerator¶
Can see the arguments passed at __init__
; this is saved in the state file, useful for seeing how exactly training was instantiated. Some objects are stored as string to allow pickling
[3]:
from pprint import pprint
pprint(tg._passed_args)
{'best_models_dir': 'C:\\deeptrain\\examples\\dir\\models',
'datagen': 'DataGenerator',
'epochs': 1,
'eval_fn': 'predict',
'input_as_labels': True,
'iter_verbosity': 0,
'logs_dir': 'C:\\deeptrain\\examples\\dir\\logs',
'max_is_best': False,
'model': 'Functional',
'model_configs': {'activation': ['relu', 'relu', 'relu', 'relu', 'relu'],
'batch_shape': (128, 28, 28, 1),
'filters': [6, 12, 2, 6, 12],
'input_dropout': 0.5,
'kernel_size': [(3, 3), (3, 3), (3, 3), (3, 3), (3, 3)],
'loss': 'mse',
'metrics': None,
'optimizer': 'nadam',
'preout_dropout': 0.4,
'strides': [(2, 2), (2, 2), 1, 1, 1],
'up_sampling_2d': [None, None, None, (2, 2), (2, 2)]},
'plot_configs': {'0': {'legend_kw': {'fontsize': 11}}},
'val_datagen': 'DataGenerator'}
Code used in training & initial attributes¶
TrainGenerator
’s attributes at end of__init__
are logged at end ofTrainGenerator.__init__
- savepath:
logdir/misc/init_state.json
- savepath:
- Source code used to run training (
__main__
) is also logged, assuming ran as a.py
file (not IPython excerpt or Jupyter notebook)- savepath:
logdir/misc/init_script.txt
- savepath:
[4]:
import json
with open(tg.get_last_log('init_state'), 'r') as f:
j = json.load(f)
pprint(j)
{'_batches_fit': '0',
'_batches_validated': '0',
'_class_labels_cache': '[]',
'_epoch': '0',
'_eval_fn': 'fn',
'_eval_fn_name': 'predict',
'_fit_fn': 'fn',
'_fit_fn_name': 'train_on_batch',
'_fit_iters': '0',
'_hist_vlines': '[]',
'_history_fig': 'None',
'_imports': "{'PIL': 1, 'LZ4F': 1}",
'_inferred_batch_size': 'None',
'_init_callbacks_called': 'True',
'_labels': '[]',
'_labels_cache': '[]',
'_max_set_name_chars': '3',
'_passed_args': 'dict',
'_preds_cache': '[]',
'_save_from_on_val_end': 'False',
'_set_name': '1',
'_set_name_cache': '[]',
'_set_num': '1',
'_sw_cache': '[]',
'_temp_history_empty': "{'loss': []}",
'_times_validated': '0',
'_train_loop_done': 'False',
'_train_new_batch_notified': 'False',
'_train_postiter_processed': 'True',
'_train_val_x_ticks': '[]',
'_train_x_ticks': '[]',
'_val_epoch': '0',
'_val_hist_vlines': '[]',
'_val_iters': '0',
'_val_loop_done': 'False',
'_val_max_set_name_chars': '2',
'_val_new_batch_notified': 'False',
'_val_postiter_processed': 'True',
'_val_set_name': '1',
'_val_set_name_cache': '[]',
'_val_set_num': '1',
'_val_temp_history_empty': "{'loss': []}",
'_val_train_x_ticks': '[]',
'_val_x_ticks': '[]',
'alias_to_metric': "{'acc': 'accuracy', 'mae': 'mean_absolute_error', 'mse': "
"'mean_squared_error', 'mape': "
"'mean_absolute_percentage_error', 'msle': "
"'mean_squared_logarithmic_error', 'kld': "
"'kullback_leibler_divergence', 'cosine': "
"'cosine_similarity', 'f1': 'f1_score', 'f1-score': "
"'f1_score'}",
'batch_size': '128',
'best_key_metric': '999',
'best_models_dir': 'C:\\deeptrain\\examples\\dir\\models',
'best_subset_nums': '[]',
'best_subset_size': 'None',
'callbacks': '[]',
'check_model_health': 'True',
'checkpoints_overwrite_duplicates': 'True',
'class_weights': 'None',
'custom_metrics': '{}',
'datagen': 'deeptrain.data_generator.DataGenerator',
'dynamic_predict_threshold': '0.5',
'dynamic_predict_threshold_min_max': 'None',
'epochs': '1',
'final_fig_dir': 'None',
'history': "{'loss': []}",
'input_as_labels': 'True',
'iter_verbosity': '0',
'key_metric': 'loss',
'key_metric_fn': 'mean_squared_error',
'key_metric_history': '[]',
'loadpath': 'None',
'loadskip_list': "['{auto}', 'model_name', 'model_base_name', 'model_num', "
"'use_passed_dirs_over_loaded', 'logdir', "
"'_init_callbacks_called']",
'logdir': 'C:\\deeptrain\\examples\\dir\\logs\\M5__model-nadam__min999.000',
'logs_dir': 'C:\\deeptrain\\examples\\dir\\logs',
'logs_use_full_model_name': 'True',
'loss_weighted_slices_range': 'None',
'max_checkpoints': '5',
'max_is_best': 'False',
'max_one_best_save': 'True',
'metric_printskip_configs': "{'train': [], 'val': []}",
'metric_to_alias': "{'loss': 'Loss', 'accuracy': 'Acc', 'f1_score': 'F1', "
"'tnr': '0-Acc', 'tpr': '1-Acc', 'mean_absolute_error': "
"'MAE', 'mean_squared_error': 'MSE'}",
'model': 'tensorflow.python.keras.engine.functional.Functional',
'model_base_name': 'model',
'model_configs': "{'batch_shape': (128, 28, 28, 1), 'loss': 'mse', 'metrics': "
"None, 'optimizer': 'nadam', 'activation': ['relu', 'relu', "
"'relu', 'relu', 'relu'], 'filters': [6, 12, 2, 6, 12], "
"'kernel_size': [(3, 3), (3, 3), (3, 3), (3, 3), (3, 3)], "
"'strides': [(2, 2), (2, 2), 1, 1, 1], 'up_sampling_2d': "
"[None, None, None, (2, 2), (2, 2)], 'input_dropout': 0.5, "
"'preout_dropout': 0.4}",
'model_name': 'M5__model-nadam__min999.000',
'model_name_configs': "{'optimizer': '', 'lr': '', 'best_key_metric': "
"'__min'}",
'model_num': '5',
'model_save_kw': "{'include_optimizer': True, 'save_format': 'h5'}",
'model_save_weights_kw': "{'save_format': 'h5'}",
'name_process_key_fn': 'NAME_PROCESS_KEY_FN',
'new_model_num': 'True',
'optimizer_load_configs': 'None',
'optimizer_save_configs': 'None',
'plot_configs': "{'0': {'legend_kw': {'fontsize': 11}, 'metrics': {'train': "
"['loss'], 'val': ['loss']}, 'x_ticks': {'train': "
"['_train_x_ticks'], 'val': ['_val_train_x_ticks']}, "
"'vhlines': {'v': '_hist_vlines', 'h': 1}, 'mark_best_cfg': "
"{'val': 'loss', 'max_is_best': False}, 'ylims': (0, 2), "
"'linewidth': [1.5, 1.5], 'linestyle': ['-', '-'], 'color': "
"['#1f77b4', 'orange']}, 'fig_kw': {'figsize': (12, 7)}}",
'plot_first_pane_max_vals': '2',
'plot_history_freq': "{'epoch': 1}",
'pred_weighted_slices_range': 'None',
'predict_threshold': '0.5',
'report_configs': 'dict',
'report_fontpath': 'C:\\deeptrain\\deeptrain\\util\\fonts\\consola.ttf',
'reset_statefuls': 'False',
'saveskip_list': "['model', 'optimizer_state', 'callbacks', 'key_metric_fn', "
"'custom_metrics', 'metric_to_alias', 'alias_to_metric', "
"'name_process_key_fn', '_fit_fn', '_eval_fn', '_labels', "
"'_preds', '_y_true', '_y_preds', '_labels_cache', "
"'_preds_cache', '_sw_cache', '_imports', '_history_fig', "
"'_val_max_set_name_chars', '_max_set_name_chars', "
"'_inferred_batch_size', '_class_labels_cache', "
"'_temp_history_empty', '_val_temp_history_empty', "
"'_val_sw', '_set_num', '_val_set_num', 'labels']",
'temp_checkpoint_freq': 'None',
'temp_history': "{'loss': []}",
'train_metrics': "['loss']",
'unique_checkpoint_freq': "{'epoch': 1}",
'val_class_weights': 'None',
'val_datagen': 'deeptrain.data_generator.DataGenerator',
'val_freq': "{'epoch': 1}",
'val_history': "{'loss': []}",
'val_metrics': "['loss']",
'val_temp_history': "{'loss': []}"}
Save directories¶
[5]:
print("Best model directory:", tg.best_models_dir)
print("Checkpoint directory:", tg.logdir)
print("Model full name:", tg.model_name)
Best model directory: C:\deeptrain\examples\dir\models
Checkpoint directory: C:\deeptrain\examples\dir\logs\M5__model-nadam__min999.000
Model full name: M5__model-nadam__min.144
Interrupts¶
Interrupts can be inspected by checking pertinent attributes manually (_train_loop_done, _train_postiter_processed, _val_loop_done, _val_postiter_processed)
, or calling interrupt_status()
which checks these and prints an appropriate message.
[6]:
tg.interrupt_status()
No interrupts detected.
Flags checked:
_train_loop_done = False
_train_postiter_processed = True
_val_loop_done = False
_val_postiter_processed = True
[6]:
(False, False)
Interrupts can be manual (KeyboardInterrupt
) or due to a raise Exception
; either interrupts the flow of train/validation, so knowing at which point the fault occurred allows us to correct manually (e.g. execute portion of code after an exception)
Interrupt example¶
[7]:
tg._train_loop_done = True
tg._val_loop_done = True
tg._val_postiter_processed = True
At this point _on_val_end()
is called automatically, so if you’re able to access such a state, it means the call didn’t finish or was never initiated.
[8]:
tg.interrupt_status()
Incomplete or not called `_on_val_end()` within `validate()`.
Interrupted: train[no], validation[yes].
Flags checked:
_train_loop_done = True
_train_postiter_processed = True
_val_loop_done = True
_val_postiter_processed = True
[8]:
(False, True)
Example 2¶
[9]:
tg._val_loop_done = False
tg._val_postiter_processed = False
tg.interrupt_status()
Interrupted during validation loop within `validate()`; incomplete or not called `_val_postiter_processing()`.
Interrupted: train[no], validation[yes].
Flags checked:
_train_loop_done = True
_train_postiter_processed = True
_val_loop_done = False
_val_postiter_processed = False
[9]:
(False, True)
[10]:
help(tg.train)
Help on method train in module deeptrain.train_generator:
train() method of deeptrain.train_generator.TrainGenerator instance
The train loop.
- Fetches data from `get_data`
- Fits data via `fin_fn`
- Processes fit metrics in `_train_postiter_processing`
- Stores metrics in `history`
- Applies `'train:iter'`, `'train:batch'`, and `'train:epoch'`
callbacks
- Calls `validate` when appropriate
**Interruption**:
- *Safe*: during `get_data`, which can be called indefinitely
without changing any attributes.
- *Avoid*: during `_train_postiter_processing`, where `fit_fn` is
applied and weights are updated - but metrics aren't stored, and
`_train_postiter_processed=False`, restarting the loop without
recording progress.
- Best bet is during :meth:`validate`, as `get_data` may be too brief.
[11]:
help(tg.validate)
Help on method validate in module deeptrain.train_generator:
validate(record_progress=True, clear_cache=True, restart=False, use_callbacks=True) method of deeptrain.train_generator.TrainGenerator instance
Validation loop.
- Fetches data from `get_data`
- Applies function based on `_eval_fn_name`
- Processes and caches metrics/predictions in
`_val_postiter_processing`
- Applies `'val:iter'`, `'val:batch'`, and `'val:epoch'` callbacks
- Calls `_on_val_end` at end of validation to compute metrics
and store them in `val_history`
- Applies `'val_end'` and maybe `('val_end': 'train:epoch')` callbacks
- If `restart`, calls :meth:`reset_validation`.
**Arguments**:
record_progress: bool
If False, won't update `val_history`, `_val_iters`,
`_batches_validated`.
clear_cache: bool
If False, won't call :meth:`clear_cache`; useful for keeping
preds & labels acquired during validation.
restart: bool
If True, will call :meth:`reset_valiation` before validation loop
to reset validation attributes; useful for starting afresh (e.g.
if interrupted).
use_callbacks: bool
If False, won't call :meth:`apply_callbacks`
or :meth:`plot_history`.
**Interruption:**
- *Safe*: during `get_data`, which can be called indefinitely
without changing any attributes.
- *Avoid*: during `_val_postiter_processing`. Model remains
unaffected*, but caches are updated; a restart may yield duplicate
appending, which will error or yield inaccuracies.
(* forward pass may consume random seed if random ops are used)
- *In practice*: prefer interrupting immediately after
`_print_iter_progress` executes.
Interrupts can also be inspected by checking temp_history
, val_temp_history
, and cache attributes (e.g. _preds_cache
); cache attributes clear by default when validate()
finishes. Check help(train)
and help(validate)
for further interrupt guidelines.
DataGenerator attributes¶
set_nums_to_process
are the set nums remaining until end of epoch, which are then reset to set_nums_original
. “Set” refers to data file to load.
[12]:
# We can check which set numbers remain to be processed in epoch or validation:
print(dg.set_nums_to_process)
print(vdg.set_nums_to_process)
# We can arbitrarily append to or pop from the list to skip or repeat a batch
['42', '38', '34', '20', '25', '41', '14', '33', '30', '5', '19', '32', '11', '28', '46', '40', '27', '24', '2', '21', '9', '17', '1', '29', '43', '26', '23', '36', '7', '6', '48', '4', '39', '13', '12', '37', '45', '18', '44', '35', '10', '31', '22', '47', '8', '16', '15', '3']
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36']
Info function¶
Lastly, we can access most of the above via info()
:
[13]:
tg.info()
Epochs: 1/1
Train batches fit: 0/48 (in current epoch)
Val batches fit: 0/36 (in current validation)
--------------------------------------------------------------------------------
Best model directory: C:\deeptrain\examples\dir\models
Checkpoint directory: C:\deeptrain\examples\dir\logs\M5__model-nadam__min999.000
Load path: None
Model full name: M5__model-nadam__min.144
--------------------------------------------------------------------------------
Interrupted during validation loop within `validate()`; incomplete or not called `_val_postiter_processing()`.
Interrupted: train[no], validation[yes].
Flags checked:
_train_loop_done = True
_train_postiter_processed = True
_val_loop_done = False
_val_postiter_processed = False