Single Jobs vs Multiple Jobs on HTCondor

Why two modes?

Not every analysis scales the same way. Sometimes you have one dataset and one script and you want to run it once on hardware you don’t have locally. Other times you have the same analysis but need to run it independently across many subsets of the data – once per species, once per site, once per simulation parameter, once per experimental condition.

HTCondor handles both cases, but the job setup is different. The container image, the submit file, the executable script, and the file transfer strategy all change depending on whether you are submitting one job or many. Getting the setup wrong is the most common source of job failures, and the errors are not always obvious.

This vignette walks through both modes side by side using the same analysis – a summary and visualization of the Palmer Penguins dataset. In single mode, the analysis runs once over the full dataset. In multiple mode, it runs once per species, producing independent results for Adelie, Chinstrap, and Gentoo penguins.

The R script is identical in both cases. What changes is how the data gets to the script and how the surrounding infrastructure is configured.

The analysis

The analysis is simple by design so the focus stays on the infrastructure. The R script loads a CSV file, computes a grouped summary of body mass and flipper length, produces a scatterplot, and writes both outputs to a results/ folder.

The script uses toolero::detect_execution_context() to resolve the input file path. In an interactive RStudio session, the path is hardcoded for convenience. When rendered via Quarto, it comes from the YAML params block. When run via Rscript on HTCondor, it comes from the first command-line argument:

context <- toolero::detect_execution_context()

input_file <- switch(context,
  interactive = "data-raw/sample.csv",
  quarto      = params$input_file,
  rscript     = commandArgs(trailingOnly = TRUE)[1]
)

penguins <- toolero::read_clean_csv(input_file)

The output section writes results to a relative path:

results_dir <- "results"
dir.create(results_dir, showWarnings = FALSE, recursive = TRUE)

toolero::write_clean_csv(p_stats, file.path(results_dir, "results.csv"),
                         overwrite = TRUE)
ggplot2::ggsave(filename = file.path(results_dir, "plot.png"), plot = p_plot)

This is portable. "results/" resolves correctly in RStudio, in quarto render, and on HTCondor – because the executable script sets the working directory to HTCondor’s writable scratch space before calling Rscript.

Two kinds of manifest

The word “manifest” appears in two different contexts in the submitr workflow. They serve different purposes and should not be confused.

The data manifest is a CSV file on disk. It is produced by toolero::write_by_group(manifest = TRUE) and lists the subset data files created when splitting a dataset by a grouping column. htc_gen_submit() reads this file via the queue_from argument to produce the subdatasets.csv that HTCondor uses to dispatch one job per subset. The data manifest is only relevant in multiple mode.

The job manifest is a session-level R option that accumulates metadata as you work through the submitr pipeline. It is not a file. It lives in memory and is cleared when you call htc_start() or restart R. Each function in the workflow contributes a piece: htc_gen_submit() records the mode, the output file pattern, and the subset names. htc_gen_executable() records the script name and results folder. htc_submit() records the cluster ID assigned by HTCondor. At the end of the workflow, htc_download() reads the job manifest to determine which files to retrieve – no glob patterns or manual file lists required.

Data manifest Job manifest
What is it A CSV file (manifest.csv) Session-level R option
Created by toolero::write_by_group() htc_gen_submit(), htc_gen_executable(), htc_submit()
Contains Subset filenames and row counts Mode, output pattern, subset names, cluster ID
Used by htc_gen_submit(queue_from = ...) htc_download()
Lives On disk in the project directory In memory for the R session
Relevant in Multiple mode only Both modes

Single mode: one dataset, one job

In single mode, the full dataset and the R script are baked into the container image at build time. Nothing is transferred at runtime – the container has everything it needs. HTCondor runs the job, the script writes results to results/, the executable script tars them up, and HTCondor transfers the tarball back to the submit node.

Building the container

The Dockerfile includes the R script and the data file via the code_file and data_file arguments:

containr::generate_dockerfile(
  r_version = "4.5.0",
  code_file = "R/analysis.R",
  data_file = "data-raw/sample.csv",
  comments  = TRUE,
  verbose   = TRUE
)

This produces COPY instructions that preserve the local directory structure inside the container:

COPY R/analysis.R /home/R/analysis.R
COPY data-raw/sample.csv /home/data-raw/sample.csv

Build and push the image:

containr::build_image(
  tag  = "registry.doit.wisc.edu/your.netid/penguins-analysis:1.0.0",
  tool = "docker"
)
containr::push_image(
  image_id    = "abc123",
  netid       = "your.netid",
  project     = "penguins-analysis",
  tag         = "1.0.0",
  tool        = "docker",
  check_login = FALSE
)

Generating the submit file and executable

submitr::htc_gen_submit(
  output_file     = "analysis.sub",
  container_image = "registry.doit.wisc.edu/your.netid/penguins-analysis:1.0.0",
  executable      = "analysis.sh",
  output_files    = "analysis-results.tar.gz"
)

submitr::htc_gen_executable(
  r_script       = "R/analysis.R",
  output_file    = "analysis.sh",
  data_files     = "data-raw/sample.csv",
  results_folder = "results"
)

The generated .sh file:

#!/bin/bash
set -euo pipefail

cd "${_CONDOR_SCRATCH_DIR:-$PWD}"

mkdir -p results
Rscript /home/R/analysis.R /home/data-raw/sample.csv
tar -czf analysis-results.tar.gz results

The script changes to the scratch directory (writable), creates the results folder, runs the R script using absolute paths to the baked-in files, and tars the results. The tarball lands in the scratch directory where HTCondor expects to find it.

The generated .sub file:

container_image = docker://registry.doit.wisc.edu/your.netid/penguins-analysis:1.0.0
universe = container

executable = analysis.sh

should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
transfer_output_files = analysis-results.tar.gz

request_cpus   = 1
request_memory = 4GB
request_disk   = 4GB

queue 1

No transfer_input_files – everything is inside the container.

Submitting and downloading

submitr::htc_start()
submitr::htc_upload(files = c("analysis.sub", "analysis.sh"))
job <- submitr::htc_submit("analysis.sub")
submitr::htc_status(cluster_id = job, watch = TRUE)
submitr::htc_download()

htc_download() takes no arguments here because the job manifest has everything it needs. htc_gen_submit() recorded that this is a single-mode job with analysis-results.tar.gz as the output. htc_submit() recorded the cluster ID. From those two pieces, htc_download() constructs the file list: one tarball plus three log files ({cluster_id}-0-job.log, .err, .out).

You can also be explicit if you prefer:

submitr::htc_download(cluster_id = job)
submitr::htc_download(files = "*.tar.gz", local_path = "results/")

Multiple mode: one analysis, three species

In multiple mode, the same R script runs independently on each subset of the data. The container holds the R script and the software environment, but not the data – each subset file is transferred at runtime by HTCondor.

The motivation is straightforward: rather than analyzing all three penguin species together, you want to analyze each one independently. Maybe the analysis is computationally expensive, or maybe the subsets come from different sources and should be processed in isolation. HTCondor runs one job per subset, in parallel, across available compute resources.

Splitting the data

Use toolero::write_by_group() to split the dataset by species and produce a data manifest:

penguins <- toolero::read_clean_csv("data-raw/sample.csv")

toolero::write_by_group(
  penguins,
  group_col  = "species",
  output_dir = "data/jobs",
  manifest   = TRUE
)

This produces three CSV files (adelie.csv, chinstrap.csv, gentoo.csv) and a data manifest (manifest.csv) listing them. This data manifest is a file on disk – not to be confused with the job manifest that submitr builds in memory during the submission workflow.

Building the container

The container needs the R script but not the data. The data files are transferred at runtime:

containr::generate_dockerfile(
  r_version = "4.5.0",
  code_file = "R/analysis.R",
  comments  = TRUE,
  verbose   = TRUE
)

No data_file argument. The Dockerfile copies only the script:

COPY R/analysis.R /home/R/analysis.R

Build and push as before.

Generating the submit file and executable

submitr::htc_gen_submit(
  output_file = "analysis.sub",
  container_image = "registry.doit.wisc.edu/your.netid/penguins-analysis:2.0.0",
  executable  = "analysis.sh",
  mode        = "multiple",
  queue_from  = "data/jobs/manifest.csv"
)

submitr::htc_gen_executable(
  r_script       = "R/analysis.R",
  output_file    = "analysis.sh",
  mode           = "multiple",
  results_folder = "results"
)

htc_gen_submit() reads the data manifest via queue_from, extracts the subset filenames, and writes subdatasets.csv alongside the submit file. It also records the mode, subset names, and output file pattern in the job manifest for htc_download() to use later.

The generated .sh file:

#!/bin/bash
set -euo pipefail

cd "${_CONDOR_SCRATCH_DIR:-$PWD}"

mkdir -p results
Rscript /home/R/analysis.R ${1}
tar -czf ${1}-results.tar.gz results

${1} is the subset filename passed by HTCondor – e.g. adelie.csv. The R script receives it as commandArgs(trailingOnly = TRUE)[1]. The tarball is named per-subset so jobs don’t overwrite each other.

The generated .sub file:

container_image = docker://registry.doit.wisc.edu/your.netid/penguins-analysis:2.0.0
universe = container

executable = analysis.sh
arguments = $(file)

should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = $(file)
transfer_output_files = $(file)-results.tar.gz

request_cpus   = 1
request_memory = 4GB
request_disk   = 4GB

queue file from subdatasets.csv

Key differences from single mode: arguments = $(file) passes the subset filename to the executable, transfer_input_files = $(file) sends each subset to the execute node at runtime, and queue file from subdatasets.csv submits one job per line in the file.

Submitting and downloading

submitr::htc_start()
submitr::htc_upload(
  files = c("analysis.sub", "analysis.sh", "subdatasets.csv",
            "data/jobs/adelie.csv", "data/jobs/chinstrap.csv",
            "data/jobs/gentoo.csv")
)
job <- submitr::htc_submit("analysis.sub")
submitr::htc_status(cluster_id = job, watch = TRUE)
submitr::htc_download()

This is where the job manifest pays off. With three species and three log file types per job, there are 12 files to retrieve. htc_download() constructs the full list automatically: it reads the subset names from the job manifest, expands the output pattern into adelie.csv-results.tar.gz, chinstrap.csv-results.tar.gz, and gentoo.csv-results.tar.gz, and generates the nine log file names from the cluster ID and process count.

One call, no globs, no guessing:

# These are all equivalent
submitr::htc_download()
submitr::htc_download(cluster_id = job)

Side-by-side comparison

What lives in the container

Single mode Multiple mode
R + packages + system libs Yes Yes
R script Yes Yes
Data files Yes No

The .sub file

Directive Single mode Multiple mode
container_image docker://... docker://...
universe container container
executable analysis.sh analysis.sh
arguments (absent) $(file)
transfer_input_files (absent) $(file)
transfer_output_files analysis-results.tar.gz $(file)-results.tar.gz
queue queue 1 queue file from subdatasets.csv

The .sh file

Line Single mode Multiple mode
Working directory cd "${_CONDOR_SCRATCH_DIR:-$PWD}" cd "${_CONDOR_SCRATCH_DIR:-$PWD}"
Results folder mkdir -p results mkdir -p results
Run script Rscript /home/R/analysis.R /home/data-raw/sample.csv Rscript /home/R/analysis.R ${1}
Package results tar -czf analysis-results.tar.gz results tar -czf ${1}-results.tar.gz results

The R script

The R script is identical in both modes. The only difference is where the input file path comes from:

Context Single mode Multiple mode
Interactive (RStudio) Hardcoded path Hardcoded path
Quarto render params$input_file params$input_file
Rscript (HTCondor) Absolute path from .sh ${1} from HTCondor

In both cases, detect_execution_context() resolves the right source and commandArgs(trailingOnly = TRUE)[1] picks up whatever the .sh script passes. The R script doesn’t know whether it’s running in single or multiple mode.

Files uploaded to the submit node

Single mode Multiple mode
.sub file Yes Yes
.sh file Yes Yes
subdatasets.csv No Yes
Subset data files No Yes
R script No No
Full dataset No No

Files downloaded after the job

Single mode Multiple mode
Result tarballs 1 (analysis-results.tar.gz) 1 per subset (3 total)
Log files 3 ({cluster}-0-job.{log,err,out}) 3 per subset (9 total)
Total files 4 12

How htc_download() resolves files

In both modes, htc_download() reads the job manifest that was built automatically during the workflow. No file lists or glob patterns are needed.

Single mode Multiple mode
Job manifest knows 1 tarball name, cluster ID Subset names, tarball pattern, cluster ID
Files resolved 1 tarball + 3 logs = 4 files 3 tarballs + 9 logs = 12 files
Researcher types htc_download() htc_download()

The same zero-argument call works for both modes because the job manifest captures the difference.

How the two manifests relate

The data manifest and the job manifest are connected but distinct. The data manifest feeds into the job manifest – when htc_gen_submit() reads the data manifest via queue_from, it extracts the subset filenames and stores them in the job manifest. From that point on, the job manifest carries the subset names forward so that htc_download() can reconstruct the tarball names without re-reading any files from disk.

toolero::write_by_group()
  |
  v
manifest.csv  (data manifest -- file on disk)
  |
  v
htc_gen_submit(queue_from = "manifest.csv")
  |
  +---> subdatasets.csv  (sent to HTCondor)
  +---> job manifest     (subset names stored in session)
          |
          v
        htc_submit()
          +---> job manifest  (cluster ID added)
                  |
                  v
                htc_download()  (resolves all files automatically)

When to use which mode

Use single mode when your analysis processes one dataset as a unit. The container has everything baked in. Nothing is transferred at runtime. This is the simplest path and the right starting point for a first CHTC job.

Use multiple mode when you need to run the same analysis independently across subsets of the data. The data files are transferred at runtime, one per job. This scales naturally – adding more subsets means more jobs, not more configuration.

Start with single mode to confirm the analysis runs correctly on CHTC. Switch to multiple mode when you are confident the container, script, and results pipeline are working.