Skip to content

iotools

Drinking Kazu edited this page Apr 30, 2019 · 4 revisions

iotools module (sub-directory) hosts APIs for data access. The codes are organized in the following submodules.

  • collates.py
    • provides different methods to collate batch data through pytorch DataLoader.
  • parsers.py
    • provides functions to convert data from input file to an appropriate format/shape for algorithms.
  • factories.py
  • samplers.py
    • custom entry number sampler classes for DataLoader.
  • datasets.py
    • provides pytorch Dataset APIs for supported data structures.

Concrete Example: script

If you prefer looking-at/running a script, refer to this script which is for unit testing. You can run this script like this:

python3 test_loader.py test_loader.cfg

... where test_loader.cfg is a yaml configuration file. You can provide your own configuration file as well. If the configuration file is not found in the relative path, it is searched under config directory at the top level of this repository. In the above example case, this file is used.

Concrete Example: English version

Suppose we want to have the following data from input larcv files.

  • Charge deposition profile in the voxelized 3D coordinate (nx,ny,nz,q) represented by larcv::EventSparseTensor3D ... with a string identifier "input_data".
  • Semantic segmentation label for those points (nx,ny,nz,l) represented by larcv::EventSparseTensor3D ... with a string identifier "segment_label". ... and maybe one more specification
  • Include "batch ID" for training uresnet_pytorch using SparseConvNet package.

Below we look into creating pytorch Dataset and DataLoader instances using iotools modules.

Create & Configure Dataset

First of all, LArCVDataset (in datasets.py) can interface larcv data files. This class can take arbitrary type and count of data products from the input files. At the construction, a set of data blobs should be specified. In this case, string keys "input_data" and "segment_label". Each key should come with a parser function, identified by a unique name of functions defined in parsers.py, and a list of input larcv data products identified by larcv convention (e.g. a string "sparse3d_aho" uniquely identifies larcv::EventSparseTensor3D with a label "aho").

In this example, we can use parse_sparse3d(data) function, which calls larcv.fill_3d_pcloud() with data[0] and np.zeros(shape=[data[0].as_vector().size(),4],dtype=np.float32) to be filled with (x,y,z,q). Note that, outside this example, you can define a parser function that takes data list with length larger than 1. For each data blob key string (i.e. "data" and "label" in this case), the output of a specified parser function executed with a list of input larcv data products is stored.

Finally, we can construct our Dataset using APIs in factories.py, in particular loader_factory(cfg) function takes a configuration parameters in a dictionary format. Here is an example configuration in a yaml format.

iotool:
  dataset:
    name: LArCVDataset
    data_dirs:
      - /gpfs/slac/staas/fs1/g/neutrino/kterao/data/dlprod_ppn_v10/combined
    data_key: train_512px
    limit_num_files: 10
    schema:
      input_data:
        - parse_sparse3d
        - sparse3d_data
      segment_label:
        - parse_sparse3d
        - sparse3d_fivetypes

name specifies the Dataset class name to be instantiated from iotools.datasets.py. Input files are listed from data_dirs and only files that contain data_key string (optional) in its name will be used. Those files a listed in alphabetical orders and, if provided, the first N files are used where N is specified by limit_num_files. schema defines individual data in a batch data blob. In above example, we are defining input_data and segment_label. For each data, we specify to use parse_sparse3d function from parsers.py, and the list continues to include arbitrary number of larcv data identifier strings. In this example, parse_sparse3d only requires one data type of larcv::EventSparseTensor3D, hence we specify one string.

Create & Configure DataLoader

So DataSet is enough to get individual data sample. But what about adding "batch ID", the 3rd requirement on the list? A natural place to do this is in collate function in pytorch DataLoader. We can use CollateSparse function, which adds "batch ID" to any data of type numpy.ndarray with 1D or 2D format. This includes our data which shape is (N,4) where N is number of points.

Like Dataset, there is a factory method for constructing DataLoader from a configuration, loader_factory(cfg). Here is an example configuration string.

iotool:
  batch_size: 32
  shuffle: True
  num_workers: 4
  collate_fn: CollateSparse
  dataset:
    name: LArCVDataset
    data_dirs:
      - /gpfs/slac/staas/fs1/g/neutrino/kterao/data/dlprod_ppn_v10/combined
    data_key: train_512px
    limit_num_files: 10
    schema:
      input_data:
        - parse_sparse3d
        - sparse3d_data
      segment_label:
        - parse_sparse3d
        - sparse3d_fivetypes

Newly added configuration parameter strings should be self-descriptive here.

Custom Sampler

The above configuration includes shuffle: True which enables random access to entries in your input file to form a batch of data. Sometimes you might want a custom data sampling method. For example, if the input data contains samples that are already randomly ordered, you could consider reading B consecutive events from an entry X in the file while X is randomly drawn for all possible integers (e.g. if N total samples, X may vary from 0 to N-B). This can greatly improve the speed of data streaming for file formats that store samples in sequence since N events can be read in a single block.

You can develop your custom sampling method in iotools.samplers.py, and specify in the configuration. When you do this, you must set shuffle: False following the pytorch DataLoader documentation. Here is an example configuration string that uses RandomSequenceSampler.

iotool:
  batch_size: 32
  shuffle: False
  num_workers: 4
  collate_fn: CollateSparse
  sampler:
    name: RandomSequenceSampler
    batch_size:	32
  dataset:
    name: LArCVDataset
    data_dirs:
      - /gpfs/slac/staas/fs1/g/neutrino/kterao/data/dlprod_ppn_v10/combined
    data_key: train_512px
    limit_num_files: 10
    schema:
      input_data:
	- parse_sparse3d
        - sparse3d_data
      segment_label:
        - parse_sparse3d
        - sparse3d_fivetypes

... which is now identical to this file used for unit testing.