2.1.1.1.1.2. emicroml.modelling.cbed.distortion.estimation.generate_and_save_ml_dataset
- generate_and_save_ml_dataset(num_cbed_patterns=1500, max_num_disks_in_any_cbed_pattern=90, cbed_pattern_generator=None, output_filename='ml_dataset.h5', max_num_ml_data_instances_per_file_update=100)[source]
Generate a machine learning dataset.
According to the parameters described below, the current function generates a file storing a machine learning (ML) dataset that can be used to train and/or evaluate ML models represented by the class
emicroml.modelling.cbed.distortion.estimation.MLModel.The number of ML data instances to be generated is specified by the parameter
num_cbed_patterns. Each ML data instance is derived from a “fake” CBED pattern, with each fake CBED pattern being generated from a fake CBED pattern generator that is specified by the parametercbed_pattern_generator. The maximum number of (fake) CBED disks that can appear in any generated fake CBED pattern is specified by the parametermax_num_disks_in_any_cbed_pattern.cbed_pattern_generatorcan be set to eitherNone, or any object that satisfies the following:1.
cbed_pattern_generatormust have a method calledgeneratewhich returns an instance of the classfakecbed.discretized.CBEDPatternupon calling said method viacbed_pattern_generator.generate().2. For each object
fake_cbed_patternreturned bycbed_pattern_generator.generate(),fake_cbed_pattern.core_attrs["num_pixels_across_pattern"]must yield the same integer valuenum_pixels_across_each_pattern.3. For each object
fake_cbed_patternreturned bycbed_pattern_generator.generate(),fake_cbed_pattern.core_attrs["undistorted_disks"]must be a nonempty sequenceundistorted_diskswhere for each elementundistorted_diskof the sequence,undistorted_disk.core_attrs["support"]must yield an instanceundistorted_disk_supportof the classfakecbed.shapes.Circle, withundistorted_disk_support.core_attrs["radius"]yielding a positive numbercommon_undistorted_disk_radius. The numbercommon_undistorted_disk_radiushas the same value for all elements of the sequenceundistorted_disksof the same objectfake_cbed_pattern. From one objectfake_cbed_patternreturned bycbed_pattern_generator.generate()to another, the value ofcommon_undistorted_disk_radiuscan change. Further below we refer tocommon_undistorted_disk_radiusas the common undistorted disk radius.4. For each object
fake_cbed_patternreturned bycbed_pattern_generator.generate(),fake_cbed_pattern.core_attrs["distortion_model"].is_standardmust yieldTrue.5. For each object
fake_cbed_patternreturned bycbed_pattern_generator.generate(),(~fake_cbed_pattern.disk_absence_registry).sum().item()must yield an integer less than or equal tomax_num_disks_in_any_cbed_pattern.If
cbed_pattern_generatoris set toNone, then the parameter will be reassigned to the value ofemicroml.modelling.cbed.distortion.estimation.DefaultCBEDPatternGenerator(), which satisfies the same conditions described above.As alluded to above, each valid object
fake_cbed_patternreturned bycbed_pattern_generator.generate()stores an instancedistortion_modelof the classdistoptica.DistortionModel, accessed byfake_cbed_pattern.core_attrs["distortion_model"].distortion_modelis the distortion model that determines that distortion field applied to the fake CBED pattern represented byfake_cbed_pattern. See the documentation fordistoptica.DistortionModelfor additional context. As implied above,distortion_modelis a “standard” distortion model, meaning that the corresponding coordinate transformation \(T_{⌑;x}\left(u_{x},u_{y}\right)\) that describes the optical distortions can be specified equivalently by an instancestandard_coord_transform_paramsofdistoptica.StandardCoordTransformParams.standard_coord_transform_paramsis the standard coordinate transformation parameter set of the fake CBED pattern. As discussed in the documentation for the classdistoptica.StandardCoordTransformParams, each instance of said class specifies a distortion center \(\left(x_{c;D},y_{c;D}\right)\), a quadratic radial distortion amplitude \(A_{r;0,2}\), an elliptical distortion vector \(\left(A_{r;2,0},B_{r;1,0}\right)\), a spiral distortion amplitude \(A_{t;0,2}\), and a parabolic distortion vector \(\left(A_{r;1,1},B_{r;0,1}\right)\).As alluded to above, each valid object
fake_cbed_patternreturned bycbed_pattern_generator.generate()stores a nonempty sequenceundistorted_disks, accessed byfake_cbed_pattern.core_attrs["undistorted_disks"]. For every nonnegative integerkless thanfake_cbed_pattern.num_disks,undistorted_disks[k]specifies the intensity pattern of thekth undistorted fake CBED disk of the fake CBED pattern represented byfake_cbed_pattern. The center of thekth undistorted fake CBED disk can be accessed byundistorted_disks[k].core_attrs["support"].core_attrs["center"]. During the process of deriving an ML data instance fromfake_cbed_pattern, the intra-disk averages of the distorted fake CBED disks of the fake CBED pattern are calculated, where thekth distorted fake CBED disk corresponds to thekth undistorted fake CBED disk, i.e. the former is obtained by distorting the latter. The intra-disk averagekth_intra_disk_avgof thekth distorted fake CBED disk is calculated bykth_intra_disk_sum = (fake_cbed_pattern.image * fake_cbed_pattern.disk_supports[k]).sum().item() kth_disk_area = (fake_cbed_pattern.disk_supports[k].sum().item() / (fake_cbed_pattern.image.shape[0]**2)) if (kth_disk_area > 0): kth_intra_disk_avg = kth_intra_disk_sum/kth_disk_area else: kth_intra_disk_avg = 0
We reference intra-disk averages again further below.
The ML data instances generated by the current function are stored in an HDF5 file, which has the following file structure:
cbed_pattern_images: <HDF5 3D dataset>
dim_0: “cbed pattern idx”
dim_1: “row”
dim_2: “col”
disk_overlap_maps: <HDF5 3D dataset>
dim_0: “cbed pattern idx”
dim_1: “row”
dim_2: “col”
disk_objectness_sets: <HDF5 2D dataset>
dim_0: “cbed pattern idx”
dim_1: “disk idx”
disk_clipping_registries: <HDF5 2D dataset>
dim_0: “cbed pattern idx”
dim_1: “disk idx”
undistorted_disk_center_sets: <HDF5 3D dataset>
dim_0: “cbed pattern idx”
dim_1: “disk idx”
dim_2: “vector cmpnt idx [0->x, 1->y]”
normalization_weight: <float>
normalization_bias: <float>
common_undistorted_disk_radii: <HDF5 1D dataset>
dim_0: “cbed pattern idx”
normalization_weight: <float>
normalization_bias: <float>
distortion_centers: <HDF5 2D dataset>
dim_0: “cbed pattern idx”
dim_1: “vector cmpnt idx [0->x, 1->y]”
normalization_weight: <float>
normalization_bias: <float>
quadratic_radial_distortion_amplitudes: <HDF5 1D dataset>
dim_0: “cbed pattern idx”
normalization_weight: <float>
normalization_bias: <float>
spiral_distortion_amplitudes: <HDF5 1D dataset>
dim_0: “cbed pattern idx”
normalization_weight: <float>
normalization_bias: <float>
elliptical_distortion_vectors: <HDF5 2D dataset>
dim_0: “cbed pattern idx”
dim_1: “vector cmpnt idx [0->x, 1->y]”
normalization_weight: <float>
normalization_bias: <float>
parabolic_distortion_vectors: <HDF5 2D dataset>
dim_0: “cbed pattern idx”
dim_1: “vector cmpnt idx [0->x, 1->y]”
normalization_weight: <float>
normalization_bias: <float>
Note that the sub-bullet points listed immediately below a given HDF5 dataset display the HDF5 attributes associated with said HDF5 dataset. Each HDF5 dataset has a set of attributes with names of the form
"dim_{}".format(i)withibeing an integer ranging from 0 to the rank of said HDF5 dataset minus 1. Attribute"dim_{}".format(i)of a given HDF5 dataset labels theith dimension of the underlying array of the dataset. The"cbed pattern idx"dimension is of the sizenum_cbed_patterns, the"row"dimension is of the sizenum_pixels_across_each_pattern, the"col"dimension is of the sizenum_pixels_across_each_pattern, the"disk idx"dimension is of the sizemax_num_disks_in_any_cbed_pattern, and the"vector cmpnt idx [0->x, 1->y]"is of the size2.The HDF5 datasets that have attributes named
"normalization_weight"and"normalization_bias"are min-max normalized, and are referred to as “normalizable”. Lethdf5_datasetbe the numerical data of such an HDF5 dataset. Furthermore, letnormalization_weightandnormalization_biasbe the values stored in the attributes"normalization_weight"and"normalization_bias"of said HDF5 dataset respectively.hdf5_datasetin this scenario is already min-max normalized. To reverse the normalization, i.e. to unnormalize the data, simply calculate(hdf5_dataset-normalization_bias) / normalization_weight.We describe below how the data of the HDF5 datasets are calculated effectively.
Set
Ntonum_pixels_across_each_pattern.
2. Set
cbed_pattern_imagestonp.zeros((num_cbed_patterns, N, N)), wherenpis an alias for the NumPy librarynumpy.3. Set
disk_overlap_mapstonp.zeros((num_cbed_patterns, N, N), dtype="int").4. Set
disk_objectness_setstonp.zeros((num_cbed_patterns, max_num_disks_in_any_cbed_pattern)).5. Set
disk_clipping_registriestonp.zeros((num_cbed_patterns, max_num_disks_in_any_cbed_pattern), dtype="bool").6. Set
undistorted_disk_center_setstonp.zeros((num_cbed_patterns, max_num_disks_in_any_cbed_pattern, 2)).7. Set
common_undistorted_disk_radiitonp.zeros((num_cbed_patterns,)).Set
distortion_centerstonp.zeros((num_cbed_patterns, 2)).
9. Set
quadratic_radial_distortion_amplitudestonp.zeros((num_cbed_patterns,)).10. Set
spiral_distortion_amplitudestonp.zeros((num_cbed_patterns,)).11. Set
elliptical_distortion_vectorstonp.zeros((num_cbed_patterns, 2)).12. Set
parabolic_distortion_vectorstonp.zeros((num_cbed_patterns, 2)).Set
cbed_pattern_idxto-1.Set
cbed_pattern_idxtocbed_pattern_idx+1.Set
fake_cbed_patterntocbed_pattern_generator.generate().
16. Store
fake_cbed_pattern.image.numpy(force=True)incbed_pattern_images[cbed_pattern_idx].17. Store
fake_cbed_pattern.disk_overlap_map.numpy(force=True)indisk_overlap_maps[cbed_pattern_idx].Set
intra_disk_avgstonp.zeros((fake_cbed_pattern.num_disks,)).
19. Set
num_elems_to_padtomax_num_disks_in_any_cbed_pattern - fake_cbed_pattern.num_disks.20. Set
single_dim_slicetoslice(0, max_num_disks_in_any_cbed_pattern).21. For every nonnegative integer
kless thanfake_cbed_pattern.num_disks, store the intra-disk average of thekth distorted fake CBED disk of fake_cbed_pattern` inintra_disk_avgs[k].Set
new_disk_ordertonp.argsort(intra_disk_avgs)[::-1].
23. Set
disk_objectness_setto(intra_disk_avgs > 0).astype("float").Set
disk_objectness_settodisk_objectness_set[new_disk_order].
25. Pad
num_elems_to_padtimes0to the end of the zeroth axis ofdisk_objectness_set.26. Set
disk_objectness_settodisk_objectness_set[single_dim_slice].27. Store
disk_objectness_setindisk_objectness_sets[cbed_pattern_idx].28. Set
disk_clipping_registrytofake_cbed_pattern.disk_clipping_registry.numpy(force=True).29. Pad
num_elems_to_padtimes0to the end of the zeroth axis ofdisk_clipping_registry.30. Set
disk_clipping_registrytodisk_clipping_registry[single_dim_slice].31. Store
disk_clipping_registryindisk_clipping_registries[cbed_pattern_idx].32. Set
undistorted_disk_center_settonp.ones((fake_cbed_pattern.num_disks, 2))/2.33. For every nonnegative integer
kless thanfake_cbed_pattern.num_disks, ifintra_disk_avgs[k]>0then store the center of thekth undistorted fake CBED disk offake_cbed_patterninundistorted_disk_center_set[k].34. Pad
num_elems_to_padtimes0.5to the end of the zeroth axis ofundistorted_disk_center_set.35. Set
undistorted_disk_center_settoundistorted_disk_center_set[single_dim_slice].36. Store
undistorted_disk_center_setinundistorted_disk_center_sets[cbed_pattern_idx].37. Store the common undistorted disk radius of
fake_cbed_patternincommon_undistorted_disk_radii[cbed_pattern_idx].38. Store the distortion center of
fake_cbed_patternindistortion_centers[cbed_pattern_idx].39. Store the quadratic radial distortion amplitude of
fake_cbed_patterninquadratic_radial_distortion_amplitudes[cbed_pattern_idx].40. Store the spiral distortion amplitude of
fake_cbed_patterninspiral_distortion_amplitudes[cbed_pattern_idx].41. Store the elliptical distortion vector of
fake_cbed_patterninelliptical_distortion_vectors[cbed_pattern_idx].42. Store the parabolic distortion vector of
fake_cbed_patterninparabolic_distortion_vectors[cbed_pattern_idx].43. If
cbed_pattern_idx < num_cbed_patterns-1, then go to instruction 14. Otherwise, go to instruction 44.Min-max normalized all normalizable HDF5 datasets.
Stop.
- Parameters:
- num_cbed_patternsint, optional
The number of images of fake CBED patterns to generate and store in the machine learning (ML) dataset.
- max_num_disks_in_any_cbed_patternint, optional
The maximum number of CBED disks to appear in the image of any fake CBED pattern to be generated.
- cbed_pattern_generatorany_fake_cbed_pattern_generator | None, optional
cbed_pattern_generatorspecifies the fake CBED pattern generator to be used.- output_filenamestr, optional
The relative or absolute filename of the HDF5 file to which to store the ML dataset to be generated.
- max_num_ml_data_instances_per_file_updateint, optional
The number of ML data instances to write to file per file update. The larger the value, the larger the memory requirements.