2.1.1.1.1.8. emicroml.modelling.cbed.distortion.estimation.split_ml_dataset_file
- split_ml_dataset_file(input_ml_dataset_filename, output_ml_dataset_filename_1='ml_dataset_for_training.h5', output_ml_dataset_filename_2='ml_dataset_for_validation.h5', output_ml_dataset_filename_3='ml_dataset_for_testing.h5', split_ratio=(80, 10, 10), rng_seed=(80, 10, 10), rm_input_ml_dataset_file=False, max_num_ml_data_instances_per_file_update=100)[source]
Split file storing a machine learning dataset.
The current function copies the machine learning (ML) data instances stored in an input HDF5 file, and distributes those copies among at most three new output HDF5 files.
The input HDF5 file and the output HDF5 files are assumed to have the same file structure as an HDF5 file generated by the function
emicroml.modelling.cbed.distortion.estimation.generate_and_save_ml_dataset()
. See the documentation of said function for a description of the file structure. Moreover, the input HDF5 file is assumed to have been created in a manner that is consistent with the way HDF5 files are generated by the functionemicroml.modelling.cbed.distortion.estimation.generate_and_save_ml_dataset()
.Unlike the combining of ML datasets, as implemented in
emicroml.modelling.cbed.distortion.estimation.combine_ml_dataset_files()
, no renormalization is performed in the process of splitting a machine learning dataset.The actual number of output HDF5 files is determined by the parameter
split_ratio
. The distribution of the copies of the input ML data instances is determined by the total number of input ML data instancesnum_input_ml_data_instances
, and the parameterssplit_ratio
, andrng_seed
.From
split_ratio
andnum_input_ml_data_instances
, the current function calculates an adjusted split ratioadjusted_split_ratio
and uses this adjusted split ratio to distribute the copies of the input ML data instances.adjusted_split_ratio
is calculated by:import np as numpy adjusted_split_ratio = (num_input_ml_data_instances * np.array(split_ratio) / np.sum(split_ratio)) adjusted_split_ratio = np.round(split_ratio).astype(int) for idx, _ in enumerate(adjusted_split_ratio): discrepancy = (num_input_ml_data_instances - np.sum(adjusted_split_ratio)) if discrepancy*adjusted_split_ratio[idx] != 0: adjustment_candidate = (adjusted_split_ratio[idx] + np.sign(discrepancy)) if adjustment_candidate >= 0: adjusted_split_ratio[idx] = adjustment_candidate
We describe below how the copies of the input ML data instances are distributed effectively.
Copy the input ML data instances.
2. Reorder the copy of the input ML data instances using a random number generator with the seed specified by
rng_seed
.3. If
adjusted_split_ratio[0] > 0
go to instruction 4. Otherwise, go to instruction 7.Set
i
to0
.Set
j
toi+adjusted_split_ratio[0]-1
.
6. Store the copies of the input ML data instances indexed from
i
toj
(i.e. including thej
th instance) after reordering into in a new output HDF5 file at a file location specified by the parameteroutput_ml_dataset_filename_1
.7. If
adjusted_split_ratio[1] > 0
go to instruction 8. Otherwise, go to instruction 11.Set
i
toj+1
.Set
j
toi+adjusted_split_ratio[1]-1
.
10. Store the copies of the input ML data instances indexed from
i
toj
(i.e. including thej
th instance) after reordering into in a new output HDF5 file at a file location specified by the parameteroutput_ml_dataset_filename_2
.If
adjusted_split_ratio[2] > 0
go to instruction 8. Otherwise, stop.Set
i
toj+1
.Set
j
toi+adjusted_split_ratio[2]-1
.
14. Store the copies of the input ML data instances indexed from
i
toj
(i.e. including thej
th instance) after reordering into in a new output HDF5 file at a file location specified by the parameteroutput_ml_dataset_filename_3
.- Parameters:
- input_ml_dataset_filenamestr, optional
The relative or absolute filename of the input HDF5 file.
- output_ml_dataset_filename_1str, optional
The relative or absolute filename of the first potential output HDF5 file.
- output_ml_dataset_filename_2str, optional
The relative or absolute filename of the second potential output HDF5 file.
- output_ml_dataset_filename_3str, optional
The relative or absolute filename of the third potential output HDF5 file.
- split_ratioarray_like (float, ndim=1), optional
The split ratio. Must be a triplet of nonnegative numbers that add up to a positive number.
- rng_seedint | None, optional
rng_seed
specifies the seed used in the random number generator, which specifies the distribution of the ML data instances.- rm_input_ml_dataset_filebool, optional
If
rm_input_ml_dataset_file
is set toTrue
, then the input HDF5 file is deleted after all the ML data instances stored in that input HDF5 file are copied into the output HDF5 files.- max_num_ml_data_instances_per_file_updateint, optional
The number of input ML data instances to distribute per file update. The larger the value, the larger the memory requirements.