How to run HTTomo#

The next section gives an overview of the commands to quickly get started running HTTomo.

For those interested in learning about the different ways HTTomo can be configured to run, there is the In-depth Look at Running HTTomo section.

Quick Overview of Running HTTomo#

Required inputs#

In order to run HTTomo you require a data file (an HDF5 file) and a YAML process list file that describes the desired processing pipeline. For information on getting started creating this YAML file, please see Configure efficient pipelines and also ready-to-be-used Full YAML pipelines.

Running HTTomo Inside or Outside of Diamond#

As HTTomo was developed at the Diamond Light Source, there have been some extra efforts to accommodate the users at Diamond (for example, aliases for commands and launcher scripts). As such, there are some differences as to how one would run HTTomo at Diamond vs. outside of Diamond, and the guidance on running HTTomo has been split into two sections accordingly.

Additionally, HTTomo is able to run in serial or in parallel depending on what computer hardware is available to the user, so some sections have been further split into these two subsections where relevant.

In-depth Look at Running HTTomo#

Interacting with HTTomo through the command line interface (CLI)#

The way to interact with the HTTomo software is through its “command line interface” (CLI).

As mentioned earlier, the preliminary step to accessing installed HTTomo software depends on if you are using a Diamond machine or not:

  • not on a Diamond machine: activate the conda environment that HTTomo was installed into (please refer to Installation Guide for instructions on how to install HTTomo)

  • on a Diamond machine: run the command module load httomo

Once the appropriate step has been done, you will have access to the HTTomo CLI:

$ python -m httomo --help
Usage: python -m httomo [OPTIONS] COMMAND [ARGS]...

  httomo: High Throughput Tomography.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  check  Check a YAML pipeline file for errors.
  run    Run a processing pipeline defined in YAML on input data.

As can be seen from the output above, there are two HTTomo commands available: check and run.

The check command is used for checking a YAML process list file for errors, and is highly recommended to be run before attempting to run the pipeline. Please see YAML Checker - Why use it? for more information about the checks being performed, the help information that is printed, etc.

The run command is used for running HTTomo with a pipeline on the given HDF5 input data.

Both commands have arguments that are necessary to provide, arguments that are optional, as well as several options/flags to customise their behaviour.

Condensed information regarding the arguments that the commands take, as well as the options for both commands, can be found directly from the command line by using the --help flag, such as python -m httomo check --help.

However, the next sections will describe each command in more detail, providing supplementary material to the information in the CLI.

Note

Diamond users will be able to use httomo as a shortcut for python -m httomo

The check command#

$ python -m httomo check --help
Usage: python -m httomo check [OPTIONS] YAML_CONFIG [IN_DATA]

  Check a YAML pipeline file for errors.

Options:
  --help  Show this message and exit.

Arguments#

For check, there is one required argument YAML_CONFIG, and one optional argument IN_DATA.

YAML_CONFIG (required)#

This is the filepath to the YAML process list file that is to be checked.

IN_DATA (optional)#

This is the filepath to the HDF5 input data that you are intending to run the YAML process list file on.

This is useful to provide because the configuration of the loader in the YAML process list file will have some references to the internal paths within the HDF5 file, which must be typed correctly otherwise HTTomo will fail to access the intended dataset within the HDF5 file.

Providing the filepath to the HDF5 input data will perform a check of the loader configuration in the YAML process list, determining if the paths mentioned in it exist or not in the accompanying HDF5 file.

Options/flags#

The check command has no options/flags.

The run command#

$ python -m httomo run --help
Usage: python -m httomo run [OPTIONS] IN_DATA_FILE YAML_CONFIG OUT_DIR

  Run a pipeline defined in YAML on input data.

Options:
  --output-folder-name DIRECTORY  Define the name of the output folder created
                                  by HTTomo
  --save-all                      Save intermediate datasets for all tasks in
                                  the pipeline.
  --gpu-id INTEGER                The GPU ID of the device to use.
  --reslice-dir DIRECTORY         Directory for temporary files potentially
                                  needed for reslicing (defaults to output
                                  dir)
  --max-cpu-slices INTEGER        Maximum number of slices to use for a block
                                  for CPU-only sections (default: 64)
  --max-memory TEXT               Limit the amount of memory used by the
                                  pipeline to the given memory (supports
                                  strings like 3.2G or bytes)
  --monitor TEXT                  Add monitor to the runner (can be given
                                  multiple times). Available monitors: bench,
                                  summary
  --monitor-output FILENAME       File to store the monitoring output.
                                  Defaults to '-', which denotes stdout
  --intermediate-format [hdf5]    Write intermediate data in hdf5 format
  --compress-intermediate         Write intermediate data in chunked format
                                  with BLOSC compression applied
  --syslog-host TEXT              Host of the syslog server
  --syslog-port INTEGER           Port on the host the syslog server is
                                  running on
  --frames-per-chunk INTEGER RANGE
                                  Number of frames per-chunk in intermediate
                                  data (0 = write as contiguous)  [x>=0]
  --help                          Show this message and exit.

Arguments#

For run, there are three required arguments:

  • IN_FILE

  • YAML_CONFIG

  • OUT_DIR

and zero optional arguments.

IN_FILE (required)#

This is the filepath to the HDF5 input data that you are intending to process.

YAML_CONFIG (required)#

This is the filepath to the YAML process list file that contains the desired processing pipeline.

OUT_DIR (required)#

This is the path to a directory which HTTomo will create its output directory inside.

The output directory created by HTTomo contains a date and timestamp in the following format: {DAY}-{MONTH}-{YEAR}_{HOUR}_{MIN}_{SEC}_output/. For example, the output directory created for an HTTomo run on 1st May 2023 at 15:30:45 would be 01-05-2023_15_30_45_output/. If the OUT_DIR path provided was /home/myuser/, then the absolute path to the output directory created by HTTomo would be /home/myuser/01-05-2023_15_30_45_output/.

Options/flags#

The run command has 13 options/flags:

  • --output-folder-name

  • --save-all

  • --gpu-id

  • --reslice-dir

  • --max-cpu-slices

  • --max-memory

  • --monitor

  • --monitor-output

  • --intermediate-format

  • --compress-intermediate

  • --syslog-host

  • --syslog-port

  • --frames-per-chunk

--output-folder-name#

As described in the documentation for the OUT_DIR argument, the default name of the output directory created by HTTomo consists primarily of a timestamp. If one wishes to provide a name for the directory created by HTTomo instead of using the default timestamp name, then the --output-folder-name flag may be used to achieve this.

For example, if the OUT_DIR path provided was /home/myuser, and --output-folder-name=test-1 was given, then the absolute path of the output directory created by HTTomo would be /home/myuser/test-1/.

--save-all#

Regarding the output of methods, HTTomo’s default behaviour is to not write the output of a method to a file in the output directory unless one of the following conditions is satisfied:

  • the method is the last one in the processing pipeline

  • the save_result parameter has been provided a value of True in a method’s YAML configuration (see Saving intermediate files for more info on the save_result parameter)

However, there are certain cases such as debugging, where saving the output of all methods to files in the output directory is beneficial. This flag is a quick way of doing so.

--gpu-id#

TODO

--reslice-dir#

This is related to the --file-based-reslice flag.

By default, the directory that the file being used for the re-slice operation is the output directory that HTTomo creates.

If this output directory is on a network-mounted disk, then read/write operations to such a disk will in general be much slower compared to a local disk. In particular, this means that the re-slice operation will be much slower if the output directory is on a network-mounted disk rather than on a local disk.

This flag can be used to specify a different directory inside which the file used for re-slicing should reside.

In particular, if performing the re-slice with a file and the output directory is on a network-mounted disk, it is recommended to use this flag to choose an output directory that is on a local disk where possible. This will drastically improve performance, compared to performing the re-slice with a file on a network-mounted disk.

Note

If running HTTomo across multiple machines, using a single local disk to contain the file used for re-slicing is not possible.

Below is a summary of the different re-slicing approaches and their relative performances:

Re-slice type

Speed

In-memory

Very fast

File w/ local disk

Fast

File w/ network-mounted disk

Very slow

--max-cpu-slices#

This flag is only relevant only for runs which are using a pipeline that contains 1 or more sections that are composed of purely CPU methods.

Understanding this flag’s usage is dependent on knowledge of the concept of “chunks”, “blocks”, and “sections” within HTTomo’s framework, so please refer to Detailed concepts for information on these concepts.

The notion of a block is fully utilised to increase performance when a sequence of two or more GPU methods are being executed. When two or more CPU methods are executed in sequence, the notion of a block plays a less significant role in performance. The number of slices in a block is driven by the memory capacity of the GPU, but if no GPU is being used for executing a sequence of methods in the pipeline, there is no obvious way to choose the number of slices in a block (the “block size”).

In such cases the user may wish to tweak the block size to explore if a specific block size happens to improve performance for the CPU-only section(s).

--max-memory#

HTTomo supports running on both:

  • a compute cluster, where RAM on the host machine is often quite large

  • a personal machine, where RAM is not nearly as large

This is done by a mechanism within HTTomo to hold data in RAM wherever there is enough RAM to do the required processing, and write data to a file if there is not enough RAM.

The --max-memory flag is for telling HTTomo how much RAM the machine has, so then it can switch to using a file during execution of the pipeline if necessary.

--monitor#

HTTomo has the capability of reporting information about the performance of the various methods involved in the specific pipeline that will be executed. Specifically:

  • time taken for methods to execute on the CPU/GPU

  • transfer time to and from the GPU

  • time taken to write to files (if HTTomo uses a file instead of RAM to hold data during pipeline execution)

There are two options for this flag, summary and bench.

--monitor=summary#

The summary option will produce a brief summary of the time taken for each method to execute in the pipeline, which will look something like the following:

Summary Statistics (aggregated across 1 processes):
  Total methods CPU time:     19.376s
  Total methods GPU time:     19.042s
  Total host2device time:      0.013s
  Total device2host time:      0.548s
  Total sources time    :      0.063s
  Total sinks time      :      0.028s
  Other overheads       :      0.362s
  ---------------------------------------
  Total pipeline time   :     19.829s
  Total wall time       :     19.829s
  ---------------------------------------
Method breakdowns:
                    data_reducer :      0.001s ( 0.0%)
                  find_center_vo :     11.586s (58.4%)
                  remove_outlier :      3.312s (16.7%)
                       normalize :      0.334s ( 1.7%)
     remove_stripe_based_sorting :      2.987s (15.1%)
                             FBP :      0.966s ( 4.9%)
          save_intermediate_data :      0.019s ( 0.1%)
                  save_to_images :      0.171s ( 0.9%)
--monitor=bench#

The bench option (short for “benchmark”) provides a much more in-depth breakdown of the time taken for each method to execute (dividing it into time taken on CPU vs. GPU, data transfer times to and from the GPU), and providing this information for all processes involved in the run.

This output is very verbose, but can provide some insight if, for example, wanting to see what parts of the pipeline may be slower than expected.

--monitor-output#

By default the output of any usage of the --monitor flag will be written to stdout (ie, printed to the terminal). However, there are times when it’s useful to write the monitoring output to a file, such as for performance analysis.

HTTomo supports writing the monitoring results in CSV format, and so any given filepath to the --monitor-output flag will produce a file with the benchmarking results written in CSV format.

--intermediate-format#

TODO

--compress-intermediate#

TODO

--syslog-host#

TODO

--syslog-port#

TODO

--frames-per-chunk#

TODO