---
title: "Single Jobs vs Multiple Jobs on HTCondor"
output: 
  rmarkdown::html_vignette:
    css: styles.css
vignette: >
  %\VignetteIndexEntry{Single Jobs vs Multiple Jobs on HTCondor}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
date: "Created 2026-05-20 | Last updated `r Sys.Date()`"
---

## Why two modes?

Not every analysis scales the same way. Sometimes you have one dataset and one
script and you want to run it once on hardware you don't have locally. Other
times you have the same analysis but need to run it independently across many
subsets of the data -- once per species, once per site, once per simulation
parameter, once per experimental condition.

HTCondor handles both cases, but the job setup is different. The container
image, the submit file, the executable script, and the file transfer strategy
all change depending on whether you are submitting one job or many. Getting
the setup wrong is the most common source of job failures, and the errors are
not always obvious.

This vignette walks through both modes side by side using the same analysis --
a summary and visualization of the Palmer Penguins dataset. In single mode,
the analysis runs once over the full dataset. In multiple mode, it runs once
per species, producing independent results for Adelie, Chinstrap, and Gentoo
penguins.

The R script is identical in both cases. What changes is how the data gets to
the script and how the surrounding infrastructure is configured.

## The analysis

The analysis is simple by design so the focus stays on the infrastructure. The
R script loads a CSV file, computes a grouped summary of body mass and flipper
length, produces a scatterplot, and writes both outputs to a `results/` folder.

The script uses `toolero::detect_execution_context()` to resolve the input
file path. In an interactive RStudio session, the path is hardcoded for
convenience. When rendered via Quarto, it comes from the YAML `params` block.
When run via `Rscript` on HTCondor, it comes from the first command-line
argument:

```r
context <- toolero::detect_execution_context()

input_file <- switch(context,
  interactive = "data-raw/sample.csv",
  quarto      = params$input_file,
  rscript     = commandArgs(trailingOnly = TRUE)[1]
)

penguins <- toolero::read_clean_csv(input_file)
```

The output section writes results to a relative path:

```r
results_dir <- "results"
dir.create(results_dir, showWarnings = FALSE, recursive = TRUE)

toolero::write_clean_csv(p_stats, file.path(results_dir, "results.csv"),
                         overwrite = TRUE)
ggplot2::ggsave(filename = file.path(results_dir, "plot.png"), plot = p_plot)
```

This is portable. `"results/"` resolves correctly in RStudio, in
`quarto render`, and on HTCondor -- because the executable script sets the
working directory to HTCondor's writable scratch space before calling
`Rscript`.

## Two kinds of manifest

The word "manifest" appears in two different contexts in the submitr workflow.
They serve different purposes and should not be confused.

The **data manifest** is a CSV file on disk. It is produced by
`toolero::write_by_group(manifest = TRUE)` and lists the subset data files
created when splitting a dataset by a grouping column. `htc_gen_submit()` reads
this file via the `queue_from` argument to produce the `subdatasets.csv` that
HTCondor uses to dispatch one job per subset. The data manifest is only relevant
in multiple mode.

The **job manifest** is a session-level R option that accumulates metadata as
you work through the submitr pipeline. It is not a file. It lives in memory
and is cleared when you call `htc_start()` or restart R. Each function in the
workflow contributes a piece: `htc_gen_submit()` records the mode, the output
file pattern, and the subset names. `htc_gen_executable()` records the script
name and results folder. `htc_submit()` records the cluster ID assigned by
HTCondor. At the end of the workflow, `htc_download()` reads the job manifest
to determine which files to retrieve -- no glob patterns or manual file lists
required.

| | Data manifest | Job manifest |
|---|---|---|
| What is it | A CSV file (`manifest.csv`) | Session-level R option |
| Created by | `toolero::write_by_group()` | `htc_gen_submit()`, `htc_gen_executable()`, `htc_submit()` |
| Contains | Subset filenames and row counts | Mode, output pattern, subset names, cluster ID |
| Used by | `htc_gen_submit(queue_from = ...)` | `htc_download()` |
| Lives | On disk in the project directory | In memory for the R session |
| Relevant in | Multiple mode only | Both modes |

## Single mode: one dataset, one job

In single mode, the full dataset and the R script are baked into the container
image at build time. Nothing is transferred at runtime -- the container has
everything it needs. HTCondor runs the job, the script writes results to
`results/`, the executable script tars them up, and HTCondor transfers the
tarball back to the submit node.

### Building the container

The Dockerfile includes the R script and the data file via the `code_file` and
`data_file` arguments:

```r
containr::generate_dockerfile(
  r_version = "4.5.0",
  code_file = "R/analysis.R",
  data_file = "data-raw/sample.csv",
  comments  = TRUE,
  verbose   = TRUE
)
```

This produces `COPY` instructions that preserve the local directory structure
inside the container:

```dockerfile
COPY R/analysis.R /home/R/analysis.R
COPY data-raw/sample.csv /home/data-raw/sample.csv
```

Build and push the image:

```r
containr::build_image(
  tag  = "registry.doit.wisc.edu/your.netid/penguins-analysis:1.0.0",
  tool = "docker"
)
containr::push_image(
  image_id    = "abc123",
  netid       = "your.netid",
  project     = "penguins-analysis",
  tag         = "1.0.0",
  tool        = "docker",
  check_login = FALSE
)
```

### Generating the submit file and executable

```r
submitr::htc_gen_submit(
  output_file     = "analysis.sub",
  container_image = "registry.doit.wisc.edu/your.netid/penguins-analysis:1.0.0",
  executable      = "analysis.sh",
  output_files    = "analysis-results.tar.gz"
)

submitr::htc_gen_executable(
  r_script       = "R/analysis.R",
  output_file    = "analysis.sh",
  data_files     = "data-raw/sample.csv",
  results_folder = "results"
)
```

The generated `.sh` file:

```bash
#!/bin/bash
set -euo pipefail

cd "${_CONDOR_SCRATCH_DIR:-$PWD}"

mkdir -p results
Rscript /home/R/analysis.R /home/data-raw/sample.csv
tar -czf analysis-results.tar.gz results
```

The script changes to the scratch directory (writable), creates the results
folder, runs the R script using absolute paths to the baked-in files, and
tars the results. The tarball lands in the scratch directory where HTCondor
expects to find it.

The generated `.sub` file:

```
container_image = docker://registry.doit.wisc.edu/your.netid/penguins-analysis:1.0.0
universe = container

executable = analysis.sh

should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
transfer_output_files = analysis-results.tar.gz

request_cpus   = 1
request_memory = 4GB
request_disk   = 4GB

queue 1
```

No `transfer_input_files` -- everything is inside the container.

### Submitting and downloading

```r
submitr::htc_start()
submitr::htc_upload(files = c("analysis.sub", "analysis.sh"))
job <- submitr::htc_submit("analysis.sub")
submitr::htc_status(cluster_id = job, watch = TRUE)
submitr::htc_download()
```

`htc_download()` takes no arguments here because the job manifest has
everything it needs. `htc_gen_submit()` recorded that this is a single-mode
job with `analysis-results.tar.gz` as the output. `htc_submit()` recorded
the cluster ID. From those two pieces, `htc_download()` constructs the file
list: one tarball plus three log files (`{cluster_id}-0-job.log`, `.err`,
`.out`).

You can also be explicit if you prefer:

```r
submitr::htc_download(cluster_id = job)
submitr::htc_download(files = "*.tar.gz", local_path = "results/")
```

## Multiple mode: one analysis, three species

In multiple mode, the same R script runs independently on each subset of the
data. The container holds the R script and the software environment, but not
the data -- each subset file is transferred at runtime by HTCondor.

The motivation is straightforward: rather than analyzing all three penguin
species together, you want to analyze each one independently. Maybe the
analysis is computationally expensive, or maybe the subsets come from
different sources and should be processed in isolation. HTCondor runs one
job per subset, in parallel, across available compute resources.

### Splitting the data

Use `toolero::write_by_group()` to split the dataset by species and produce a
data manifest:

```r
penguins <- toolero::read_clean_csv("data-raw/sample.csv")

toolero::write_by_group(
  penguins,
  group_col  = "species",
  output_dir = "data/jobs",
  manifest   = TRUE
)
```

This produces three CSV files (`adelie.csv`, `chinstrap.csv`, `gentoo.csv`)
and a data manifest (`manifest.csv`) listing them. This data manifest is a
file on disk -- not to be confused with the job manifest that `submitr` builds
in memory during the submission workflow.

### Building the container

The container needs the R script but not the data. The data files are
transferred at runtime:

```r
containr::generate_dockerfile(
  r_version = "4.5.0",
  code_file = "R/analysis.R",
  comments  = TRUE,
  verbose   = TRUE
)
```

No `data_file` argument. The Dockerfile copies only the script:

```dockerfile
COPY R/analysis.R /home/R/analysis.R
```

Build and push as before.

### Generating the submit file and executable

```r
submitr::htc_gen_submit(
  output_file = "analysis.sub",
  container_image = "registry.doit.wisc.edu/your.netid/penguins-analysis:2.0.0",
  executable  = "analysis.sh",
  mode        = "multiple",
  queue_from  = "data/jobs/manifest.csv"
)

submitr::htc_gen_executable(
  r_script       = "R/analysis.R",
  output_file    = "analysis.sh",
  mode           = "multiple",
  results_folder = "results"
)
```

`htc_gen_submit()` reads the data manifest via `queue_from`, extracts the
subset filenames, and writes `subdatasets.csv` alongside the submit file.
It also records the mode, subset names, and output file pattern in the job
manifest for `htc_download()` to use later.

The generated `.sh` file:

```bash
#!/bin/bash
set -euo pipefail

cd "${_CONDOR_SCRATCH_DIR:-$PWD}"

mkdir -p results
Rscript /home/R/analysis.R ${1}
tar -czf ${1}-results.tar.gz results
```

`${1}` is the subset filename passed by HTCondor -- e.g. `adelie.csv`. The
R script receives it as `commandArgs(trailingOnly = TRUE)[1]`. The tarball
is named per-subset so jobs don't overwrite each other.

The generated `.sub` file:

```
container_image = docker://registry.doit.wisc.edu/your.netid/penguins-analysis:2.0.0
universe = container

executable = analysis.sh
arguments = $(file)

should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = $(file)
transfer_output_files = $(file)-results.tar.gz

request_cpus   = 1
request_memory = 4GB
request_disk   = 4GB

queue file from subdatasets.csv
```

Key differences from single mode: `arguments = $(file)` passes the subset
filename to the executable, `transfer_input_files = $(file)` sends each
subset to the execute node at runtime, and `queue file from subdatasets.csv`
submits one job per line in the file.

### Submitting and downloading

```r
submitr::htc_start()
submitr::htc_upload(
  files = c("analysis.sub", "analysis.sh", "subdatasets.csv",
            "data/jobs/adelie.csv", "data/jobs/chinstrap.csv",
            "data/jobs/gentoo.csv")
)
job <- submitr::htc_submit("analysis.sub")
submitr::htc_status(cluster_id = job, watch = TRUE)
submitr::htc_download()
```

This is where the job manifest pays off. With three species and three log
file types per job, there are 12 files to retrieve. `htc_download()` constructs
the full list automatically: it reads the subset names from the job manifest,
expands the output pattern into `adelie.csv-results.tar.gz`,
`chinstrap.csv-results.tar.gz`, and `gentoo.csv-results.tar.gz`, and generates
the nine log file names from the cluster ID and process count.

One call, no globs, no guessing:

```r
# These are all equivalent
submitr::htc_download()
submitr::htc_download(cluster_id = job)
```

## Side-by-side comparison

### What lives in the container

| | Single mode | Multiple mode |
|---|---|---|
| R + packages + system libs | Yes | Yes |
| R script | Yes | Yes |
| Data files | Yes | No |

### The `.sub` file

| Directive | Single mode | Multiple mode |
|---|---|---|
| `container_image` | `docker://...` | `docker://...` |
| `universe` | `container` | `container` |
| `executable` | `analysis.sh` | `analysis.sh` |
| `arguments` | *(absent)* | `$(file)` |
| `transfer_input_files` | *(absent)* | `$(file)` |
| `transfer_output_files` | `analysis-results.tar.gz` | `$(file)-results.tar.gz` |
| `queue` | `queue 1` | `queue file from subdatasets.csv` |

### The `.sh` file

| Line | Single mode | Multiple mode |
|---|---|---|
| Working directory | `cd "${_CONDOR_SCRATCH_DIR:-$PWD}"` | `cd "${_CONDOR_SCRATCH_DIR:-$PWD}"` |
| Results folder | `mkdir -p results` | `mkdir -p results` |
| Run script | `Rscript /home/R/analysis.R /home/data-raw/sample.csv` | `Rscript /home/R/analysis.R ${1}` |
| Package results | `tar -czf analysis-results.tar.gz results` | `tar -czf ${1}-results.tar.gz results` |

### The R script

The R script is identical in both modes. The only difference is where the
input file path comes from:

| Context | Single mode | Multiple mode |
|---|---|---|
| Interactive (RStudio) | Hardcoded path | Hardcoded path |
| Quarto render | `params$input_file` | `params$input_file` |
| Rscript (HTCondor) | Absolute path from `.sh` | `${1}` from HTCondor |

In both cases, `detect_execution_context()` resolves the right source and
`commandArgs(trailingOnly = TRUE)[1]` picks up whatever the `.sh` script
passes. The R script doesn't know whether it's running in single or multiple
mode.

### Files uploaded to the submit node

| | Single mode | Multiple mode |
|---|---|---|
| `.sub` file | Yes | Yes |
| `.sh` file | Yes | Yes |
| `subdatasets.csv` | No | Yes |
| Subset data files | No | Yes |
| R script | No | No |
| Full dataset | No | No |

### Files downloaded after the job

| | Single mode | Multiple mode |
|---|---|---|
| Result tarballs | 1 (`analysis-results.tar.gz`) | 1 per subset (3 total) |
| Log files | 3 (`{cluster}-0-job.{log,err,out}`) | 3 per subset (9 total) |
| Total files | 4 | 12 |

### How `htc_download()` resolves files

In both modes, `htc_download()` reads the job manifest that was built
automatically during the workflow. No file lists or glob patterns are needed.

| | Single mode | Multiple mode |
|---|---|---|
| Job manifest knows | 1 tarball name, cluster ID | Subset names, tarball pattern, cluster ID |
| Files resolved | 1 tarball + 3 logs = 4 files | 3 tarballs + 9 logs = 12 files |
| Researcher types | `htc_download()` | `htc_download()` |

The same zero-argument call works for both modes because the job manifest
captures the difference.

### How the two manifests relate

The data manifest and the job manifest are connected but distinct. The data
manifest feeds into the job manifest -- when `htc_gen_submit()` reads the
data manifest via `queue_from`, it extracts the subset filenames and stores
them in the job manifest. From that point on, the job manifest carries the
subset names forward so that `htc_download()` can reconstruct the tarball
names without re-reading any files from disk.

```
toolero::write_by_group()
  |
  v
manifest.csv  (data manifest -- file on disk)
  |
  v
htc_gen_submit(queue_from = "manifest.csv")
  |
  +---> subdatasets.csv  (sent to HTCondor)
  +---> job manifest     (subset names stored in session)
          |
          v
        htc_submit()
          +---> job manifest  (cluster ID added)
                  |
                  v
                htc_download()  (resolves all files automatically)
```

## When to use which mode

Use **single mode** when your analysis processes one dataset as a unit. The
container has everything baked in. Nothing is transferred at runtime. This is
the simplest path and the right starting point for a first CHTC job.

Use **multiple mode** when you need to run the same analysis independently
across subsets of the data. The data files are transferred at runtime, one per
job. This scales naturally -- adding more subsets means more jobs, not more
configuration.

Start with single mode to confirm the analysis runs correctly on CHTC.
Switch to multiple mode when you are confident the container, script, and
results pipeline are working.