-
Notifications
You must be signed in to change notification settings - Fork 32
iotools
iotools
module (sub-directory) hosts APIs for data access. The codes are organized in the following submodules.
-
collates.py
- provides different methods to collate batch data through pytorch DataLoader.
-
parsers.py
- provides functions to convert data from input file to an appropriate format/shape for algorithms.
-
factories.py
- provides APIs for constructing pytorch Dataset and DataLoader.
-
samplers.py
- custom entry number sampler classes for DataLoader.
-
datasets.py
- provides pytorch Dataset APIs for supported data structures.
If you prefer looking-at/running a script, refer to this script which is for unit testing. You can run this script like this:
python3 test_loader.py test_loader.cfg
... where test_loader.cfg
is a yaml configuration file. You can provide your own configuration file as well. If the configuration file is not found in the relative path, it is searched under config
directory at the top level of this repository. In the above example case, this file is used.
Suppose we want to have the following data from input larcv files.
- Charge deposition profile in the voxelized 3D coordinate (nx,ny,nz,q) represented by
larcv::EventSparseTensor3D
... with a string identifier "input_data". - Semantic segmentation label for those points (nx,ny,nz,l) represented by
larcv::EventSparseTensor3D
... with a string identifier "segment_label". ... and maybe one more specification - Include "batch ID" for training uresnet_pytorch using SparseConvNet package.
Below we look into creating pytorch Dataset
and DataLoader
instances using iotools modules.
First of all, LArCVDataset (in datasets.py
) can interface larcv
data files. This class can take arbitrary type and count of data products from the input files. At the construction, a set of data blobs should be specified. In this case, string keys "input_data" and "segment_label". Each key should come with a parser function, identified by a unique name of functions defined in parsers.py
, and a list of input larcv
data products identified by larcv convention (e.g. a string "sparse3d_aho" uniquely identifies larcv::EventSparseTensor3D
with a label "aho").
In this example, we can use parse_sparse3d(data) function, which calls larcv.fill_3d_pcloud() with data[0]
and np.zeros(shape=[data[0].as_vector().size(),4],dtype=np.float32)
to be filled with (x,y,z,q). Note that, outside this example, you can define a parser function that takes data
list with length larger than 1. For each data blob key string (i.e. "data" and "label" in this case), the output of a specified parser function executed with a list of input larcv data products is stored.
Finally, we can construct our Dataset using APIs in factories.py
, in particular loader_factory(cfg) function takes a configuration parameters in a dictionary format. Here is an example configuration in a yaml format.
iotool:
dataset:
name: LArCVDataset
data_dirs:
- /gpfs/slac/staas/fs1/g/neutrino/kterao/data/dlprod_ppn_v10/combined
data_key: train_512px
limit_num_files: 10
schema:
input_data:
- parse_sparse3d
- sparse3d_data
segment_label:
- parse_sparse3d
- sparse3d_fivetypes
name
specifies the Dataset class name to be instantiated from iotools.datasets.py
. Input files are listed from data_dirs
and only files that contain data_key
string (optional) in its name will be used. Those files a listed in alphabetical orders and, if provided, the first N files are used where N is specified by limit_num_files
. schema
defines individual data in a batch data blob. In above example, we are defining input_data and segment_label. For each data, we specify to use parse_sparse3d
function from parsers.py
, and the list continues to include arbitrary number of larcv data identifier strings. In this example, parse_sparse3d
only requires one data type of larcv::EventSparseTensor3D
, hence we specify one string.
So DataSet is enough to get individual data sample. But what about adding "batch ID", the 3rd requirement on the list? A natural place to do this is in collate function in pytorch DataLoader. We can use CollateSparse function, which adds "batch ID" to any data of type numpy.ndarray
with 1D or 2D format. This includes our data which shape is (N,4) where N is number of points.
Like Dataset, there is a factory method for constructing DataLoader from a configuration, loader_factory(cfg). Here is an example configuration string.
iotool:
batch_size: 32
shuffle: True
num_workers: 4
collate_fn: CollateSparse
dataset:
name: LArCVDataset
data_dirs:
- /gpfs/slac/staas/fs1/g/neutrino/kterao/data/dlprod_ppn_v10/combined
data_key: train_512px
limit_num_files: 10
schema:
input_data:
- parse_sparse3d
- sparse3d_data
segment_label:
- parse_sparse3d
- sparse3d_fivetypes
Newly added configuration parameter strings should be self-descriptive here.
The above configuration includes shuffle: True
which enables random access to entries in your input file to form a batch of data. Sometimes you might want a custom data sampling method. For example, if the input data contains samples that are already randomly ordered, you could consider reading B consecutive events from an entry X in the file while X is randomly drawn for all possible integers (e.g. if N total samples, X may vary from 0 to N-B). This can greatly improve the speed of data streaming for file formats that store samples in sequence since N events can be read in a single block.
You can develop your custom sampling method in iotools.samplers.py
, and specify in the configuration. When you do this, you must set shuffle: False
following the pytorch DataLoader documentation. Here is an example configuration string that uses RandomSequenceSampler.
iotool:
batch_size: 32
shuffle: False
num_workers: 4
collate_fn: CollateSparse
sampler:
name: RandomSequenceSampler
batch_size: 32
dataset:
name: LArCVDataset
data_dirs:
- /gpfs/slac/staas/fs1/g/neutrino/kterao/data/dlprod_ppn_v10/combined
data_key: train_512px
limit_num_files: 10
schema:
input_data:
- parse_sparse3d
- sparse3d_data
segment_label:
- parse_sparse3d
- sparse3d_fivetypes
... which is now identical to this file used for unit testing.