2.1.1.1.1.8. emicroml.modelling.cbed.distortion.estimation.split_ml_dataset_file

split_ml_dataset_file(input_ml_dataset_filename, output_ml_dataset_filename_1='ml_dataset_for_training.h5', output_ml_dataset_filename_2='ml_dataset_for_validation.h5', output_ml_dataset_filename_3='ml_dataset_for_testing.h5', split_ratio=(80, 10, 10), rng_seed=(80, 10, 10), rm_input_ml_dataset_file=False, max_num_ml_data_instances_per_file_update=100)[source]

Split file storing a machine learning dataset.

The current function copies the machine learning (ML) data instances stored in an input HDF5 file, and distributes those copies among at most three new output HDF5 files.

The input HDF5 file and the output HDF5 files are assumed to have the same file structure as an HDF5 file generated by the function emicroml.modelling.cbed.distortion.estimation.generate_and_save_ml_dataset(). See the documentation of said function for a description of the file structure. Moreover, the input HDF5 file is assumed to have been created in a manner that is consistent with the way HDF5 files are generated by the function emicroml.modelling.cbed.distortion.estimation.generate_and_save_ml_dataset().

Unlike the combining of ML datasets, as implemented in emicroml.modelling.cbed.distortion.estimation.combine_ml_dataset_files(), no renormalization is performed in the process of splitting a machine learning dataset.

The actual number of output HDF5 files is determined by the parameter split_ratio. The distribution of the copies of the input ML data instances is determined by the total number of input ML data instances num_input_ml_data_instances, and the parameters split_ratio, and rng_seed.

From split_ratio and num_input_ml_data_instances, the current function calculates an adjusted split ratio adjusted_split_ratio and uses this adjusted split ratio to distribute the copies of the input ML data instances. adjusted_split_ratio is calculated by:

import np as numpy

adjusted_split_ratio = (num_input_ml_data_instances
                        * np.array(split_ratio)
                        / np.sum(split_ratio))
adjusted_split_ratio = np.round(split_ratio).astype(int)

for idx, _ in enumerate(adjusted_split_ratio):
    discrepancy = (num_input_ml_data_instances 
                   - np.sum(adjusted_split_ratio))
    if discrepancy*adjusted_split_ratio[idx] != 0:
        adjustment_candidate = (adjusted_split_ratio[idx]
                                + np.sign(discrepancy))
        if adjustment_candidate >= 0:
            adjusted_split_ratio[idx] = adjustment_candidate

We describe below how the copies of the input ML data instances are distributed effectively.

  1. Copy the input ML data instances.

2. Reorder the copy of the input ML data instances using a random number generator with the seed specified by rng_seed.

3. If adjusted_split_ratio[0] > 0 go to instruction 4. Otherwise, go to instruction 7.

  1. Set i to 0.

  2. Set j to i+adjusted_split_ratio[0]-1.

6. Store the copies of the input ML data instances indexed from i to j (i.e. including the j th instance) after reordering into in a new output HDF5 file at a file location specified by the parameter output_ml_dataset_filename_1.

7. If adjusted_split_ratio[1] > 0 go to instruction 8. Otherwise, go to instruction 11.

  1. Set i to j+1.

  2. Set j to i+adjusted_split_ratio[1]-1.

10. Store the copies of the input ML data instances indexed from i to j (i.e. including the j th instance) after reordering into in a new output HDF5 file at a file location specified by the parameter output_ml_dataset_filename_2.

  1. If adjusted_split_ratio[2] > 0 go to instruction 8. Otherwise, stop.

  2. Set i to j+1.

  3. Set j to i+adjusted_split_ratio[2]-1.

14. Store the copies of the input ML data instances indexed from i to j (i.e. including the j th instance) after reordering into in a new output HDF5 file at a file location specified by the parameter output_ml_dataset_filename_3.

Parameters:
input_ml_dataset_filenamestr, optional

The relative or absolute filename of the input HDF5 file.

output_ml_dataset_filename_1str, optional

The relative or absolute filename of the first potential output HDF5 file.

output_ml_dataset_filename_2str, optional

The relative or absolute filename of the second potential output HDF5 file.

output_ml_dataset_filename_3str, optional

The relative or absolute filename of the third potential output HDF5 file.

split_ratioarray_like (float, ndim=1), optional

The split ratio. Must be a triplet of nonnegative numbers that add up to a positive number.

rng_seedint | None, optional

rng_seed specifies the seed used in the random number generator, which specifies the distribution of the ML data instances.

rm_input_ml_dataset_filebool, optional

If rm_input_ml_dataset_file is set to True, then the input HDF5 file is deleted after all the ML data instances stored in that input HDF5 file are copied into the output HDF5 files.

max_num_ml_data_instances_per_file_updateint, optional

The number of input ML data instances to distribute per file update. The larger the value, the larger the memory requirements.