2.1.1.1.1.1. emicroml.modelling.cbed.distortion.estimation.combine_ml_dataset_files
- combine_ml_dataset_files(input_ml_dataset_filenames, output_ml_dataset_filename='ml_dataset.h5', rm_input_ml_dataset_files=False, max_num_ml_data_instances_per_file_update=100)[source]
Combine files storing machine learning datasets.
The current function copies the machine learning (ML) data instances stored in a set of input HDF5 files, and stores all those copies into a single new output HDF5 file.
The input HDF5 files and the output HDF5 file are assumed to have the same file structure as an HDF5 file generated by the function
emicroml.modelling.cbed.distortion.estimation.generate_and_save_ml_dataset()
. See the documentation of said function for a description of the file structure. Moreover, the input HDF5 files are assumed to have been created in a manner that is consistent with the way HDF5 files are generated by the functionemicroml.modelling.cbed.distortion.estimation.generate_and_save_ml_dataset()
.As discussed in the aforementioned documentation, some of the HDF5 datasets are normalizable. Prior to combining all the copies of the input ML data instances into a single new output HDF5 file, the copies of the normalizable input HDF5 datasets are unnormalized. After combining the input ML data instances into the new output HDF5 file, the normalizable HDF5 datasets therein are min-max normalized, only this time with respect to all ML data instances.
- Parameters:
- input_ml_dataset_filenamesarray_like (str, ndim=1), optional
The relative or absolute filenames of the input HDF5 files storing the ML datasets of interest.
- output_ml_dataset_filenamestr, optional
The relative or absolute filename of the output HDF5 file.
- rm_input_ml_dataset_filesbool, optional
If
rm_input_ml_dataset_files
is set toTrue
, then the input HDF5 files are deleted after all the ML data instances stored in those input HDF5 files are copied into the output HDF5 file.- max_num_ml_data_instances_per_file_updateint, optional
The number of ML data instances to write to the output file per file update. The larger the value, the larger the memory requirements.