2.1.1.1.1.2. emicroml.modelling.cbed.distortion.estimation.generate_and_save_ml_dataset

generate_and_save_ml_dataset(num_cbed_patterns=1500, max_num_disks_in_any_cbed_pattern=90, cbed_pattern_generator=None, output_filename='ml_dataset.h5', max_num_ml_data_instances_per_file_update=100)[source]

Generate a machine learning dataset.

According to the parameters described below, the current function generates a file storing a machine learning (ML) dataset that can be used to train and/or evaluate ML models represented by the class emicroml.modelling.cbed.distortion.estimation.MLModel.

The number of ML data instances to be generated is specified by the parameter num_cbed_patterns. Each ML data instance is derived from a “fake” CBED pattern, with each fake CBED pattern being generated from a fake CBED pattern generator that is specified by the parameter cbed_pattern_generator. The maximum number of (fake) CBED disks that can appear in any generated fake CBED pattern is specified by the parameter max_num_disks_in_any_cbed_pattern.

cbed_pattern_generator can be set to either None, or any object that satisfies the following:

1. cbed_pattern_generator must have a method called generate which returns an instance of the class fakecbed.discretized.CBEDPattern upon calling said method via cbed_pattern_generator.generate().

2. For each object fake_cbed_pattern returned by cbed_pattern_generator.generate(), fake_cbed_pattern.core_attrs["num_pixels_across_pattern"] must yield the same integer value num_pixels_across_each_pattern.

3. For each object fake_cbed_pattern returned by cbed_pattern_generator.generate(), fake_cbed_pattern.core_attrs["undistorted_disks"] must be a nonempty sequence undistorted_disks where for each element undistorted_disk of the sequence, undistorted_disk.core_attrs["support"] must yield an instance undistorted_disk_support of the class fakecbed.shapes.Circle, with undistorted_disk_support.core_attrs["radius"] yielding a positive number common_undistorted_disk_radius. The number common_undistorted_disk_radius has the same value for all elements of the sequence undistorted_disks of the same object fake_cbed_pattern. From one object fake_cbed_pattern returned by cbed_pattern_generator.generate() to another, the value of common_undistorted_disk_radius can change. Further below we refer to common_undistorted_disk_radius as the common undistorted disk radius.

4. For each object fake_cbed_pattern returned by cbed_pattern_generator.generate(), fake_cbed_pattern.core_attrs["distortion_model"].is_standard must yield True.

5. For each object fake_cbed_pattern returned by cbed_pattern_generator.generate(), (~fake_cbed_pattern.disk_absence_registry).sum().item() must yield an integer less than or equal to max_num_disks_in_any_cbed_pattern.

If cbed_pattern_generator is set to None, then the parameter will be reassigned to the value of emicroml.modelling.cbed.distortion.estimation.DefaultCBEDPatternGenerator(), which satisfies the same conditions described above.

As alluded to above, each valid object fake_cbed_pattern returned by cbed_pattern_generator.generate() stores an instance distortion_model of the class distoptica.DistortionModel, accessed by fake_cbed_pattern.core_attrs["distortion_model"]. distortion_model is the distortion model that determines that distortion field applied to the fake CBED pattern represented by fake_cbed_pattern. See the documentation for distoptica.DistortionModel for additional context. As implied above, distortion_model is a “standard” distortion model, meaning that the corresponding coordinate transformation \(T_{⌑;x}\left(u_{x},u_{y}\right)\) that describes the optical distortions can be specified equivalently by an instance standard_coord_transform_params of distoptica.StandardCoordTransformParams. standard_coord_transform_params is the standard coordinate transformation parameter set of the fake CBED pattern. As discussed in the documentation for the class distoptica.StandardCoordTransformParams, each instance of said class specifies a distortion center \(\left(x_{c;D},y_{c;D}\right)\), a quadratic radial distortion amplitude \(A_{r;0,2}\), an elliptical distortion vector \(\left(A_{r;2,0},B_{r;1,0}\right)\), a spiral distortion amplitude \(A_{t;0,2}\), and a parabolic distortion vector \(\left(A_{r;1,1},B_{r;0,1}\right)\).

As alluded to above, each valid object fake_cbed_pattern returned by cbed_pattern_generator.generate() stores a nonempty sequence undistorted_disks, accessed by fake_cbed_pattern.core_attrs["undistorted_disks"]. For every nonnegative integer k less than fake_cbed_pattern.num_disks, undistorted_disks[k] specifies the intensity pattern of the k th undistorted fake CBED disk of the fake CBED pattern represented by fake_cbed_pattern. The center of the k th undistorted fake CBED disk can be accessed by undistorted_disks[k].core_attrs["support"].core_attrs["center"]. During the process of deriving an ML data instance from fake_cbed_pattern, the intra-disk averages of the distorted fake CBED disks of the fake CBED pattern are calculated, where the k th distorted fake CBED disk corresponds to the k th undistorted fake CBED disk, i.e. the former is obtained by distorting the latter. The intra-disk average kth_intra_disk_avg of the k th distorted fake CBED disk is calculated by

kth_intra_disk_sum = (fake_cbed_pattern.image
                      * fake_cbed_pattern.disk_supports[k]).sum().item()

kth_disk_area = (fake_cbed_pattern.disk_supports[k].sum().item()
                 / (fake_cbed_pattern.image.shape[0]**2))

if (kth_disk_area > 0):
    kth_intra_disk_avg = kth_intra_disk_sum/kth_disk_area
else:
    kth_intra_disk_avg = 0

We reference intra-disk averages again further below.

The ML data instances generated by the current function are stored in an HDF5 file, which has the following file structure:

  • cbed_pattern_images: <HDF5 3D dataset>

    • dim_0: “cbed pattern idx”

    • dim_1: “row”

    • dim_2: “col”

  • disk_overlap_maps: <HDF5 3D dataset>

    • dim_0: “cbed pattern idx”

    • dim_1: “row”

    • dim_2: “col”

  • disk_objectness_sets: <HDF5 2D dataset>

    • dim_0: “cbed pattern idx”

    • dim_1: “disk idx”

  • disk_clipping_registries: <HDF5 2D dataset>

    • dim_0: “cbed pattern idx”

    • dim_1: “disk idx”

  • undistorted_disk_center_sets: <HDF5 3D dataset>

    • dim_0: “cbed pattern idx”

    • dim_1: “disk idx”

    • dim_2: “vector cmpnt idx [0->x, 1->y]”

    • normalization_weight: <float>

    • normalization_bias: <float>

  • common_undistorted_disk_radii: <HDF5 1D dataset>

    • dim_0: “cbed pattern idx”

    • normalization_weight: <float>

    • normalization_bias: <float>

  • distortion_centers: <HDF5 2D dataset>

    • dim_0: “cbed pattern idx”

    • dim_1: “vector cmpnt idx [0->x, 1->y]”

    • normalization_weight: <float>

    • normalization_bias: <float>

  • quadratic_radial_distortion_amplitudes: <HDF5 1D dataset>

    • dim_0: “cbed pattern idx”

    • normalization_weight: <float>

    • normalization_bias: <float>

  • spiral_distortion_amplitudes: <HDF5 1D dataset>

    • dim_0: “cbed pattern idx”

    • normalization_weight: <float>

    • normalization_bias: <float>

  • elliptical_distortion_vectors: <HDF5 2D dataset>

    • dim_0: “cbed pattern idx”

    • dim_1: “vector cmpnt idx [0->x, 1->y]”

    • normalization_weight: <float>

    • normalization_bias: <float>

  • parabolic_distortion_vectors: <HDF5 2D dataset>

    • dim_0: “cbed pattern idx”

    • dim_1: “vector cmpnt idx [0->x, 1->y]”

    • normalization_weight: <float>

    • normalization_bias: <float>

Note that the sub-bullet points listed immediately below a given HDF5 dataset display the HDF5 attributes associated with said HDF5 dataset. Each HDF5 dataset has a set of attributes with names of the form "dim_{}".format(i) with i being an integer ranging from 0 to the rank of said HDF5 dataset minus 1. Attribute "dim_{}".format(i) of a given HDF5 dataset labels the i th dimension of the underlying array of the dataset. The "cbed pattern idx" dimension is of the size num_cbed_patterns, the "row" dimension is of the size num_pixels_across_each_pattern, the "col" dimension is of the size num_pixels_across_each_pattern, the "disk idx" dimension is of the size max_num_disks_in_any_cbed_pattern, and the "vector cmpnt idx [0->x, 1->y]" is of the size 2.

The HDF5 datasets that have attributes named "normalization_weight" and "normalization_bias" are min-max normalized, and are referred to as “normalizable”. Let hdf5_dataset be the numerical data of such an HDF5 dataset. Furthermore, let normalization_weight and normalization_bias be the values stored in the attributes "normalization_weight" and "normalization_bias" of said HDF5 dataset respectively. hdf5_dataset in this scenario is already min-max normalized. To reverse the normalization, i.e. to unnormalize the data, simply calculate (hdf5_dataset-normalization_bias) / normalization_weight.

We describe below how the data of the HDF5 datasets are calculated effectively.

  1. Set N to num_pixels_across_each_pattern.

2. Set cbed_pattern_images to np.zeros((num_cbed_patterns, N, N)), where np is an alias for the NumPy library numpy.

3. Set disk_overlap_maps to np.zeros((num_cbed_patterns, N, N), dtype="int").

4. Set disk_objectness_sets to np.zeros((num_cbed_patterns, max_num_disks_in_any_cbed_pattern)).

5. Set disk_clipping_registries to np.zeros((num_cbed_patterns, max_num_disks_in_any_cbed_pattern), dtype="bool").

6. Set undistorted_disk_center_sets to np.zeros((num_cbed_patterns, max_num_disks_in_any_cbed_pattern, 2)).

7. Set common_undistorted_disk_radii to np.zeros((num_cbed_patterns,)).

  1. Set distortion_centers to np.zeros((num_cbed_patterns, 2)).

9. Set quadratic_radial_distortion_amplitudes to np.zeros((num_cbed_patterns,)).

10. Set spiral_distortion_amplitudes to np.zeros((num_cbed_patterns,)).

11. Set elliptical_distortion_vectors to np.zeros((num_cbed_patterns, 2)).

12. Set parabolic_distortion_vectors to np.zeros((num_cbed_patterns, 2)).

  1. Set cbed_pattern_idx to -1.

  2. Set cbed_pattern_idx to cbed_pattern_idx+1.

  3. Set fake_cbed_pattern to cbed_pattern_generator.generate().

16. Store fake_cbed_pattern.image.numpy(force=True) in cbed_pattern_images[cbed_pattern_idx].

17. Store fake_cbed_pattern.disk_overlap_map.numpy(force=True) in disk_overlap_maps[cbed_pattern_idx].

  1. Set intra_disk_avgs to np.zeros((fake_cbed_pattern.num_disks,)).

19. Set num_elems_to_pad to max_num_disks_in_any_cbed_pattern - fake_cbed_pattern.num_disks.

20. Set single_dim_slice to slice(0, max_num_disks_in_any_cbed_pattern).

21. For every nonnegative integer k less than fake_cbed_pattern.num_disks, store the intra-disk average of the k th distorted fake CBED disk of fake_cbed_pattern` in intra_disk_avgs[k].

  1. Set new_disk_order to np.argsort(intra_disk_avgs)[::-1].

23. Set disk_objectness_set to (intra_disk_avgs > 0).astype("float").

  1. Set disk_objectness_set to disk_objectness_set[new_disk_order].

25. Pad num_elems_to_pad times 0 to the end of the zeroth axis of disk_objectness_set.

26. Set disk_objectness_set to disk_objectness_set[single_dim_slice].

27. Store disk_objectness_set in disk_objectness_sets[cbed_pattern_idx].

28. Set disk_clipping_registry to fake_cbed_pattern.disk_clipping_registry.numpy(force=True).

29. Pad num_elems_to_pad times 0 to the end of the zeroth axis of disk_clipping_registry.

30. Set disk_clipping_registry to disk_clipping_registry[single_dim_slice].

31. Store disk_clipping_registry in disk_clipping_registries[cbed_pattern_idx].

32. Set undistorted_disk_center_set to np.ones((fake_cbed_pattern.num_disks, 2))/2.

33. For every nonnegative integer k less than fake_cbed_pattern.num_disks, if intra_disk_avgs[k]>0 then store the center of the k th undistorted fake CBED disk of fake_cbed_pattern in undistorted_disk_center_set[k].

34. Pad num_elems_to_pad times 0.5 to the end of the zeroth axis of undistorted_disk_center_set.

35. Set undistorted_disk_center_set to undistorted_disk_center_set[single_dim_slice].

36. Store undistorted_disk_center_set in undistorted_disk_center_sets[cbed_pattern_idx].

37. Store the common undistorted disk radius of fake_cbed_pattern in common_undistorted_disk_radii[cbed_pattern_idx].

38. Store the distortion center of fake_cbed_pattern in distortion_centers[cbed_pattern_idx].

39. Store the quadratic radial distortion amplitude of fake_cbed_pattern in quadratic_radial_distortion_amplitudes[cbed_pattern_idx].

40. Store the spiral distortion amplitude of fake_cbed_pattern in spiral_distortion_amplitudes[cbed_pattern_idx].

41. Store the elliptical distortion vector of fake_cbed_pattern in elliptical_distortion_vectors[cbed_pattern_idx].

42. Store the parabolic distortion vector of fake_cbed_pattern in parabolic_distortion_vectors[cbed_pattern_idx].

43. If cbed_pattern_idx < num_cbed_patterns-1, then go to instruction 14. Otherwise, go to instruction 44.

  1. Min-max normalized all normalizable HDF5 datasets.

  2. Stop.

Parameters:
num_cbed_patternsint, optional

The number of images of fake CBED patterns to generate and store in the machine learning (ML) dataset.

max_num_disks_in_any_cbed_patternint, optional

The maximum number of CBED disks to appear in the image of any fake CBED pattern to be generated.

cbed_pattern_generatorany_fake_cbed_pattern_generator | None, optional

cbed_pattern_generator specifies the fake CBED pattern generator to be used.

output_filenamestr, optional

The relative or absolute filename of the HDF5 file to which to store the ML dataset to be generated.

max_num_ml_data_instances_per_file_updateint, optional

The number of ML data instances to write to file per file update. The larger the value, the larger the memory requirements.