2.1.1.1.1.2. emicroml.modelling.cbed.distortion.estimation.generate_and_save_ml_dataset
- generate_and_save_ml_dataset(num_cbed_patterns=1500, max_num_disks_in_any_cbed_pattern=90, cbed_pattern_generator=None, output_filename='ml_dataset.h5', max_num_ml_data_instances_per_file_update=100)[source]
Generate a machine learning dataset.
According to the parameters described below, the current function generates a file storing a machine learning (ML) dataset that can be used to train and/or evaluate ML models represented by the class
emicroml.modelling.cbed.distortion.estimation.MLModel
.The number of ML data instances to be generated is specified by the parameter
num_cbed_patterns
. Each ML data instance is derived from a “fake” CBED pattern, with each fake CBED pattern being generated from a fake CBED pattern generator that is specified by the parametercbed_pattern_generator
. The maximum number of (fake) CBED disks that can appear in any generated fake CBED pattern is specified by the parametermax_num_disks_in_any_cbed_pattern
.cbed_pattern_generator
can be set to eitherNone
, or any object that satisfies the following:1.
cbed_pattern_generator
must have a method calledgenerate
which returns an instance of the classfakecbed.discretized.CBEDPattern
upon calling said method viacbed_pattern_generator.generate()
.2. For each object
fake_cbed_pattern
returned bycbed_pattern_generator.generate()
,fake_cbed_pattern.core_attrs["num_pixels_across_pattern"]
must yield the same integer valuenum_pixels_across_each_pattern
.3. For each object
fake_cbed_pattern
returned bycbed_pattern_generator.generate()
,fake_cbed_pattern.core_attrs["undistorted_disks"]
must be a nonempty sequenceundistorted_disks
where for each elementundistorted_disk
of the sequence,undistorted_disk.core_attrs["support"]
must yield an instanceundistorted_disk_support
of the classfakecbed.shapes.Circle
, withundistorted_disk_support.core_attrs["radius"]
yielding a positive numbercommon_undistorted_disk_radius
. The numbercommon_undistorted_disk_radius
has the same value for all elements of the sequenceundistorted_disks
of the same objectfake_cbed_pattern
. From one objectfake_cbed_pattern
returned bycbed_pattern_generator.generate()
to another, the value ofcommon_undistorted_disk_radius
can change. Further below we refer tocommon_undistorted_disk_radius
as the common undistorted disk radius.4. For each object
fake_cbed_pattern
returned bycbed_pattern_generator.generate()
,fake_cbed_pattern.core_attrs["distortion_model"].is_standard
must yieldTrue
.5. For each object
fake_cbed_pattern
returned bycbed_pattern_generator.generate()
,(~fake_cbed_pattern.disk_absence_registry).sum().item()
must yield an integer less than or equal tomax_num_disks_in_any_cbed_pattern
.If
cbed_pattern_generator
is set toNone
, then the parameter will be reassigned to the value ofemicroml.modelling.cbed.distortion.estimation.DefaultCBEDPatternGenerator()
, which satisfies the same conditions described above.As alluded to above, each valid object
fake_cbed_pattern
returned bycbed_pattern_generator.generate()
stores an instancedistortion_model
of the classdistoptica.DistortionModel
, accessed byfake_cbed_pattern.core_attrs["distortion_model"]
.distortion_model
is the distortion model that determines that distortion field applied to the fake CBED pattern represented byfake_cbed_pattern
. See the documentation fordistoptica.DistortionModel
for additional context. As implied above,distortion_model
is a “standard” distortion model, meaning that the corresponding coordinate transformation \(T_{⌑;x}\left(u_{x},u_{y}\right)\) that describes the optical distortions can be specified equivalently by an instancestandard_coord_transform_params
ofdistoptica.StandardCoordTransformParams
.standard_coord_transform_params
is the standard coordinate transformation parameter set of the fake CBED pattern. As discussed in the documentation for the classdistoptica.StandardCoordTransformParams
, each instance of said class specifies a distortion center \(\left(x_{c;D},y_{c;D}\right)\), a quadratic radial distortion amplitude \(A_{r;0,2}\), an elliptical distortion vector \(\left(A_{r;2,0},B_{r;1,0}\right)\), a spiral distortion amplitude \(A_{t;0,2}\), and a parabolic distortion vector \(\left(A_{r;1,1},B_{r;0,1}\right)\).As alluded to above, each valid object
fake_cbed_pattern
returned bycbed_pattern_generator.generate()
stores a nonempty sequenceundistorted_disks
, accessed byfake_cbed_pattern.core_attrs["undistorted_disks"]
. For every nonnegative integerk
less thanfake_cbed_pattern.num_disks
,undistorted_disks[k]
specifies the intensity pattern of thek
th undistorted fake CBED disk of the fake CBED pattern represented byfake_cbed_pattern
. The center of thek
th undistorted fake CBED disk can be accessed byundistorted_disks[k].core_attrs["support"].core_attrs["center"]
. During the process of deriving an ML data instance fromfake_cbed_pattern
, the intra-disk averages of the distorted fake CBED disks of the fake CBED pattern are calculated, where thek
th distorted fake CBED disk corresponds to thek
th undistorted fake CBED disk, i.e. the former is obtained by distorting the latter. The intra-disk averagekth_intra_disk_avg
of thek
th distorted fake CBED disk is calculated bykth_intra_disk_sum = (fake_cbed_pattern.image * fake_cbed_pattern.disk_supports[k]).sum().item() kth_disk_area = (fake_cbed_pattern.disk_supports[k].sum().item() / (fake_cbed_pattern.image.shape[0]**2)) if (kth_disk_area > 0): kth_intra_disk_avg = kth_intra_disk_sum/kth_disk_area else: kth_intra_disk_avg = 0
We reference intra-disk averages again further below.
The ML data instances generated by the current function are stored in an HDF5 file, which has the following file structure:
cbed_pattern_images: <HDF5 3D dataset>
dim_0: “cbed pattern idx”
dim_1: “row”
dim_2: “col”
disk_overlap_maps: <HDF5 3D dataset>
dim_0: “cbed pattern idx”
dim_1: “row”
dim_2: “col”
disk_objectness_sets: <HDF5 2D dataset>
dim_0: “cbed pattern idx”
dim_1: “disk idx”
disk_clipping_registries: <HDF5 2D dataset>
dim_0: “cbed pattern idx”
dim_1: “disk idx”
undistorted_disk_center_sets: <HDF5 3D dataset>
dim_0: “cbed pattern idx”
dim_1: “disk idx”
dim_2: “vector cmpnt idx [0->x, 1->y]”
normalization_weight: <float>
normalization_bias: <float>
common_undistorted_disk_radii: <HDF5 1D dataset>
dim_0: “cbed pattern idx”
normalization_weight: <float>
normalization_bias: <float>
distortion_centers: <HDF5 2D dataset>
dim_0: “cbed pattern idx”
dim_1: “vector cmpnt idx [0->x, 1->y]”
normalization_weight: <float>
normalization_bias: <float>
quadratic_radial_distortion_amplitudes: <HDF5 1D dataset>
dim_0: “cbed pattern idx”
normalization_weight: <float>
normalization_bias: <float>
spiral_distortion_amplitudes: <HDF5 1D dataset>
dim_0: “cbed pattern idx”
normalization_weight: <float>
normalization_bias: <float>
elliptical_distortion_vectors: <HDF5 2D dataset>
dim_0: “cbed pattern idx”
dim_1: “vector cmpnt idx [0->x, 1->y]”
normalization_weight: <float>
normalization_bias: <float>
parabolic_distortion_vectors: <HDF5 2D dataset>
dim_0: “cbed pattern idx”
dim_1: “vector cmpnt idx [0->x, 1->y]”
normalization_weight: <float>
normalization_bias: <float>
Note that the sub-bullet points listed immediately below a given HDF5 dataset display the HDF5 attributes associated with said HDF5 dataset. Each HDF5 dataset has a set of attributes with names of the form
"dim_{}".format(i)
withi
being an integer ranging from 0 to the rank of said HDF5 dataset minus 1. Attribute"dim_{}".format(i)
of a given HDF5 dataset labels thei
th dimension of the underlying array of the dataset. The"cbed pattern idx"
dimension is of the sizenum_cbed_patterns
, the"row"
dimension is of the sizenum_pixels_across_each_pattern
, the"col"
dimension is of the sizenum_pixels_across_each_pattern
, the"disk idx"
dimension is of the sizemax_num_disks_in_any_cbed_pattern
, and the"vector cmpnt idx [0->x, 1->y]"
is of the size2
.The HDF5 datasets that have attributes named
"normalization_weight"
and"normalization_bias"
are min-max normalized, and are referred to as “normalizable”. Lethdf5_dataset
be the numerical data of such an HDF5 dataset. Furthermore, letnormalization_weight
andnormalization_bias
be the values stored in the attributes"normalization_weight"
and"normalization_bias"
of said HDF5 dataset respectively.hdf5_dataset
in this scenario is already min-max normalized. To reverse the normalization, i.e. to unnormalize the data, simply calculate(hdf5_dataset-normalization_bias) / normalization_weight
.We describe below how the data of the HDF5 datasets are calculated effectively.
Set
N
tonum_pixels_across_each_pattern
.
2. Set
cbed_pattern_images
tonp.zeros((num_cbed_patterns, N, N))
, wherenp
is an alias for the NumPy librarynumpy
.3. Set
disk_overlap_maps
tonp.zeros((num_cbed_patterns, N, N), dtype="int")
.4. Set
disk_objectness_sets
tonp.zeros((num_cbed_patterns, max_num_disks_in_any_cbed_pattern))
.5. Set
disk_clipping_registries
tonp.zeros((num_cbed_patterns, max_num_disks_in_any_cbed_pattern), dtype="bool")
.6. Set
undistorted_disk_center_sets
tonp.zeros((num_cbed_patterns, max_num_disks_in_any_cbed_pattern, 2))
.7. Set
common_undistorted_disk_radii
tonp.zeros((num_cbed_patterns,))
.Set
distortion_centers
tonp.zeros((num_cbed_patterns, 2))
.
9. Set
quadratic_radial_distortion_amplitudes
tonp.zeros((num_cbed_patterns,))
.10. Set
spiral_distortion_amplitudes
tonp.zeros((num_cbed_patterns,))
.11. Set
elliptical_distortion_vectors
tonp.zeros((num_cbed_patterns, 2))
.12. Set
parabolic_distortion_vectors
tonp.zeros((num_cbed_patterns, 2))
.Set
cbed_pattern_idx
to-1
.Set
cbed_pattern_idx
tocbed_pattern_idx+1
.Set
fake_cbed_pattern
tocbed_pattern_generator.generate()
.
16. Store
fake_cbed_pattern.image.numpy(force=True)
incbed_pattern_images[cbed_pattern_idx]
.17. Store
fake_cbed_pattern.disk_overlap_map.numpy(force=True)
indisk_overlap_maps[cbed_pattern_idx]
.Set
intra_disk_avgs
tonp.zeros((fake_cbed_pattern.num_disks,))
.
19. Set
num_elems_to_pad
tomax_num_disks_in_any_cbed_pattern - fake_cbed_pattern.num_disks
.20. Set
single_dim_slice
toslice(0, max_num_disks_in_any_cbed_pattern)
.21. For every nonnegative integer
k
less thanfake_cbed_pattern.num_disks
, store the intra-disk average of thek
th distorted fake CBED disk of fake_cbed_pattern` inintra_disk_avgs[k]
.Set
new_disk_order
tonp.argsort(intra_disk_avgs)[::-1]
.
23. Set
disk_objectness_set
to(intra_disk_avgs > 0).astype("float")
.Set
disk_objectness_set
todisk_objectness_set[new_disk_order]
.
25. Pad
num_elems_to_pad
times0
to the end of the zeroth axis ofdisk_objectness_set
.26. Set
disk_objectness_set
todisk_objectness_set[single_dim_slice]
.27. Store
disk_objectness_set
indisk_objectness_sets[cbed_pattern_idx]
.28. Set
disk_clipping_registry
tofake_cbed_pattern.disk_clipping_registry.numpy(force=True)
.29. Pad
num_elems_to_pad
times0
to the end of the zeroth axis ofdisk_clipping_registry
.30. Set
disk_clipping_registry
todisk_clipping_registry[single_dim_slice]
.31. Store
disk_clipping_registry
indisk_clipping_registries[cbed_pattern_idx]
.32. Set
undistorted_disk_center_set
tonp.ones((fake_cbed_pattern.num_disks, 2))/2
.33. For every nonnegative integer
k
less thanfake_cbed_pattern.num_disks
, ifintra_disk_avgs[k]>0
then store the center of thek
th undistorted fake CBED disk offake_cbed_pattern
inundistorted_disk_center_set[k]
.34. Pad
num_elems_to_pad
times0.5
to the end of the zeroth axis ofundistorted_disk_center_set
.35. Set
undistorted_disk_center_set
toundistorted_disk_center_set[single_dim_slice]
.36. Store
undistorted_disk_center_set
inundistorted_disk_center_sets[cbed_pattern_idx]
.37. Store the common undistorted disk radius of
fake_cbed_pattern
incommon_undistorted_disk_radii[cbed_pattern_idx]
.38. Store the distortion center of
fake_cbed_pattern
indistortion_centers[cbed_pattern_idx]
.39. Store the quadratic radial distortion amplitude of
fake_cbed_pattern
inquadratic_radial_distortion_amplitudes[cbed_pattern_idx]
.40. Store the spiral distortion amplitude of
fake_cbed_pattern
inspiral_distortion_amplitudes[cbed_pattern_idx]
.41. Store the elliptical distortion vector of
fake_cbed_pattern
inelliptical_distortion_vectors[cbed_pattern_idx]
.42. Store the parabolic distortion vector of
fake_cbed_pattern
inparabolic_distortion_vectors[cbed_pattern_idx]
.43. If
cbed_pattern_idx < num_cbed_patterns-1
, then go to instruction 14. Otherwise, go to instruction 44.Min-max normalized all normalizable HDF5 datasets.
Stop.
- Parameters:
- num_cbed_patternsint, optional
The number of images of fake CBED patterns to generate and store in the machine learning (ML) dataset.
- max_num_disks_in_any_cbed_patternint, optional
The maximum number of CBED disks to appear in the image of any fake CBED pattern to be generated.
- cbed_pattern_generatorany_fake_cbed_pattern_generator | None, optional
cbed_pattern_generator
specifies the fake CBED pattern generator to be used.- output_filenamestr, optional
The relative or absolute filename of the HDF5 file to which to store the ML dataset to be generated.
- max_num_ml_data_instances_per_file_updateint, optional
The number of ML data instances to write to file per file update. The larger the value, the larger the memory requirements.