---
title: "From the Notebook to the Cluster: Your First CHTC Job with submitr"
output:
rmarkdown::html_vignette:
css: styles.css
vignette: >
%\VignetteIndexEntry{From the Notebook to the Cluster: Your First CHTC Job with submitr}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
date: "Created 2026-04-30 | Last updated `r Sys.Date()`"
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
eval = FALSE
)
# library(submitr)
```
## The promise of a first CHTC job
Many research coding projects begin in a notebook-style workflow: an RStudio
project, a Quarto document, a few scripts, a folder of input files, and enough
local experimentation to understand what the analysis needs to do.
That is a good place to start. A laptop is often the right place to explore
data, write early code, make plots, and decide what question the analysis is
actually answering. But at some point, the same local workflow can become the
wrong place to keep pushing.
Maybe the analysis takes too long. Maybe you need to run the same model across
hundreds of parameter combinations. Maybe you need one job per participant,
county, simulation, image, genome, or bootstrap sample. Maybe you simply want a
workflow that will be easier to rerun six months from now.
That is where high-throughput computing becomes useful.
The UW-Madison Center for High Throughput Computing (CHTC) gives researchers
access to large pools of computing capacity. Instead of asking one computer to
do everything in sequence, you can break work into independent jobs and let the
HTC system run those jobs when resources are available.
`submitr` helps you take the final step in the **From the Notebook to the
Cluster** workflow: sending a prepared R project to CHTC. It is designed for
researchers who know R but may not yet be comfortable with HTCondor submit
files, executable shell scripts, `ssh`, `scp`, or the rhythm of working on a
remote submit node. It is also useful for regular CHTC users who want to reduce
repetitive setup work and make job submission easier to reproduce, review, and
share.
The goal is not to hide CHTC from you. The goal is to make the standard path
visible, repeatable, and less fragile.
---
## The larger idea: make the right choice easy
`submitr` is part of a small family of R packages for research computing
workflows:
```text
local R project
└─ toolero: organize the project and prepare job-sized inputs
└─ containr: capture the R software environment in a container image
└─ submitr: send the containerized job to CHTC
```
You can use each package on its own.
Use `toolero` if you want a better project skeleton, cleaner data-loading
habits, Quarto scaffolding, or a simple way to split a dataset into many
job-sized files.
Use `containr` if you already have a project with an `renv.lock` file and want
to build a container image that can run somewhere other than your laptop.
Use `submitr` if your project is already organized and containerized, and you
are ready to submit it to CHTC.
Used together, the packages support a practical arc: start with a project that
is easier to understand, make its software environment portable, then send it
to CHTC with fewer command-line hurdles.
---
## What submitr does
A CHTC job needs a few pieces of information:
- what code to run;
- what input files to transfer;
- what container image to use;
- how much CPU, memory, and disk to request;
- what output files to retrieve;
- how many jobs to queue.
In HTCondor, that information is split across two main files.
The **submit file** tells HTCondor how to run the job. It describes the
executable script, container image, input files, output files, log files,
resource requests, and queue instructions.
The **executable script** tells the job what to do after it starts. For an R
analysis, that usually means creating an output folder, running `Rscript`, and
packaging results.
`submitr` helps you generate those files and then use them:
```{r}
submitr::htc_config() # configure your submit-node connection
submitr::htc_gen_submit() # generate the HTCondor submit file
submitr::htc_gen_executable() # generate the executable shell script
submitr::htc_upload() # copy files to the submit node
submitr::htc_submit() # submit the job
submitr::htc_status() # check progress
submitr::htc_download() # copy results back
```
---
## Before you submit anything
A successful CHTC submission starts before `condor_submit`.
Before using `submitr`, confirm that:
- your R script runs with `Rscript analysis.R` outside RStudio;
- your container image is pushed to a registry CHTC can access;
- you have SSH access to a CHTC submit node such as `ap2002.chtc.wisc.edu`.
The most important check is simple: your analysis should run outside RStudio.
```bash
Rscript analysis.R
```
If that command fails locally, the same analysis is likely to fail on CHTC. Fix
that first. CHTC will not know about objects in your Global Environment, local
RStudio settings, manually clicked files, or packages that happen to be
installed on your laptop.
**Set up SSH connection reuse now, before anything else.** Every `submitr`
function that touches CHTC opens an SSH connection, which can trigger a Duo MFA
prompt. ControlMaster caches your authenticated session so all subsequent
calls -- uploads, submits, status checks, downloads -- reuse the same
connection without prompting again. The setup takes two minutes and is worth
doing before your first `htc_config()` call. Full instructions appear right
after Step 1.
For a first submission, choose something small and intentionally boring. The
goal is not to prove that your full analysis can scale yet. The goal is to
prove that the pathway works.
---
## A small example analysis
Suppose your project has this shape:
```text
my-analysis/
├── analysis.R
├── data.csv
├── renv.lock
└── results/
```
Your `analysis.R` script might look like this:
```{r}
library(readr)
library(dplyr)
input <- read_csv("data.csv")
summary <- input |>
group_by(group) |>
summarise(
mean_value = mean(value, na.rm = TRUE),
n = dplyr::n(),
.groups = "drop"
)
if (!dir.exists("results")) {
dir.create("results")
}
write_csv(summary, "results/summary.csv")
```
This script is deliberately modest. A first CHTC job should be easy to inspect.
Once the small version works, you can scale the pattern with more confidence.
---
## Step 1: configure your CHTC connection
Load `submitr` and create a project-level configuration:
```{r}
library(submitr)
cfg <- htc_config()
```
On first use, `htc_config()` prompts for your NetID and submit node. It writes
an `htc.cfg` file to the project directory so later calls can reuse the same
connection information, and it displays ControlMaster setup instructions.
A later call should look something like this:
```{r}
cfg <- htc_config()
#> Reading HTC config from ./htc.cfg
#> ✔ Connected to "ap2002.chtc.wisc.edu" as "your.netid".
```
This configuration file is deliberately project-local. Different projects may
need different submit nodes, paths, or connection settings.
---
## Setting up SSH connection reuse
Before continuing, take two minutes to configure ControlMaster. Add this block
to `~/.ssh/config`:
```bash
Host *.chtc.wisc.edu
ControlMaster auto
ControlPersist 2h
ControlPath ~/.ssh/connections/%r@%h:%p
```
Then create the directory used by `ControlPath`:
```bash
mkdir -p ~/.ssh/connections
```
With ControlMaster in place, all subsequent SSH connections reuse the same
authenticated session. You authenticate once when the connection is first
established; everything that follows -- file uploads, job submission, status
checks, result downloads -- happens without prompting for Duo MFA again. Full
documentation is at
.
The rest of this vignette assumes ControlMaster is in place.
---
## Step 2: generate the submit file
The submit file is the main HTCondor instruction file. It answers the question:
what should the HTC system run, and what does it need?
```{r}
htc_gen_submit(
output_file = "analysis.sub",
container_image = "docker://registry.doit.wisc.edu/your.netid/my-analysis:1.0.0",
executable = "analysis.sh",
input_files = c("analysis.R", "data.csv"),
output_files = "results.tar.gz",
resources = "small",
comments = TRUE,
output = "."
)
```
For a first submission, keep `comments = TRUE`. The generated file includes
explanations of the main sections, making it easier to inspect, learn from,
and share with a collaborator or consultant.
The `resources` argument uses presets. For a first test, always start with
`"small"` regardless of what your eventual job will need:
| preset | cpus | memory | disk | when to use |
|--------|------|--------|-------|-------------|
| small | 1 | 4 GB | 4 GB | first test jobs, lightweight scripts |
| medium | 4 | 16 GB | 15 GB | moderate analyses, model fitting |
| large | 8 | 64 GB | 32 GB | memory-intensive work, large datasets |
The HTCondor log file reports actual resource usage after each run. That log
is the ground truth for tuning future submissions -- not guesswork. Requesting
too little causes jobs to fail; requesting much more than you need makes jobs
harder to match with available resources.
---
## Step 3: generate the executable script
The executable script answers a different question: once the job starts, what
commands should run?
```{r}
htc_gen_executable(
r_script = "analysis.R",
output_file = "analysis.sh",
results_folder = "results",
comments = TRUE
)
```
The generated script handles a standard sequence:
1. create the results folder;
2. run the R script with `Rscript`;
3. archive the results as `results.tar.gz`.
That sequence is not complicated, but it is exactly the kind of glue code that
can become a barrier for researchers who are new to shell scripts. `submitr`
writes the standard version so you can focus on the analysis.
---
## Step 4: preview and upload files
Before copying files to the submit node, do a dry run:
```{r}
htc_upload(
files = c("analysis.sub", "analysis.sh", "analysis.R", "data.csv"),
config = cfg,
dry_run = TRUE
)
#> ✔ Dry run -- command that would be executed:
#> `scp analysis.sub analysis.sh analysis.R data.csv your.netid@ap2002.chtc.wisc.edu:~/`
```
A dry run is a safety habit. It lets you see the command before it changes
anything on the remote system. Once the command looks right, upload the files:
```{r}
htc_upload(
files = c("analysis.sub", "analysis.sh", "analysis.R", "data.csv"),
config = cfg
)
```
---
## Step 5: submit the job
```{r}
cluster_id <- htc_submit(
submit_file = "analysis.sub",
config = cfg,
verbose = TRUE
)
#> Submitting "analysis.sub" on "ap2002.chtc.wisc.edu"...
#> Submitting job(s)...
#> 1 job(s) submitted to cluster 6302860.
#> ✔ Job submitted from "~/analysis.sub" on "ap2002.chtc.wisc.edu".
```
The cluster ID is the handle for this submission. Store it in an object so you
can check the job later without having to look it up.
---
## Step 6: check progress
```{r}
# One-shot status check
htc_status(cluster_id = cluster_id, config = cfg)
# Watch until the job completes
htc_status(cluster_id = cluster_id, config = cfg, watch = TRUE)
```
For a small test job, `watch = TRUE` is useful. For larger workloads,
occasional one-shot checks are usually a better fit than keeping an R session
occupied.
---
## Step 7: download results
When the job is complete, download the result archive and logs:
```{r}
# Results
htc_download(
files = "*.tar.gz",
config = cfg,
local_path = "results/"
)
# Logs
htc_download(
files = c("job.log", "job.err"),
config = cfg,
local_path = "logs/"
)
```
The logs are not just for failures. They record what happened when the job ran
-- including actual resource usage, which informs future resource requests.
---
## From one test job to many HTC jobs
A first job proves that the path works. The next step is to think like an HTC
user: how can the analysis be divided into many independent pieces?
Common patterns include one job per simulation replicate, model specification,
input file, county, participant, sample, parameter set, or bootstrap iteration.
This is where `toolero::write_by_group()` helps upstream. It splits a data
frame into separate CSV files and writes a manifest describing those files.
Then `submitr` queues one job per row of the manifest:
```{r}
htc_gen_submit(
output_file = "analysis.sub",
container_image = "docker://registry.doit.wisc.edu/your.netid/my-analysis:1.0.0",
executable = "analysis.sh",
input_files = "analysis.R",
mode = "multiple",
queue_from = "data/manifest.csv",
resources = "medium",
comments = TRUE
)
htc_gen_executable(
r_script = "analysis.R",
output_file = "analysis.sh",
results_folder = "results",
mode = "multiple",
comments = TRUE
)
```
In multiple-job mode, the generated executable passes the per-job input file
to your R script as the first command-line argument. Your script should read
that argument explicitly:
```{r}
args <- commandArgs(trailingOnly = TRUE)
input_file <- args[[1]]
data <- readr::read_csv(input_file)
```
This is a key pattern. The script stays the same; each job receives a different
input.
---
## Where containr fits
CHTC needs to know what software environment your job should use. Your laptop
may have the right R packages installed, but the execute node will not
automatically have the same setup. A container image solves that problem by
packaging the R version, packages, and system libraries needed to run the
analysis.
`containr` handles that step:
```{r}
containr::generate_dockerfile(r_version = "4.4.0", output = ".")
containr::build_image(verbose = TRUE)
imgs <- containr::list_images()
containr::push_image(
image_id = imgs$image_id[1],
netid = "your.netid",
project = "my-analysis",
tag = "1.0.0"
)
```
After the image is pushed to a registry CHTC can access, `submitr` can refer to
it in `container_image`.
Use explicit image tags such as `"1.0.0"` rather than `"latest"`. A versioned
tag makes it unambiguous which software environment was used for a particular
analysis.
---
## A practical first-submission checklist
Before scaling up, confirm that the small job works end to end:
- The script runs locally with `Rscript analysis.R`.
- ControlMaster is configured and the session is authenticated.
- The container image is pushed to a registry CHTC can access.
- The image tag is explicit, not `"latest"`.
- The submit file lists the correct executable and input files.
- The executable script points to the correct R script and results folder.
- The dry-run upload shows the expected files.
- `htc_config()` connects to the submit node without error.
- The resource request is reasonable for a test job.
- The job produces logs and a result archive.
Once this works, you have something valuable: a known-good pathway from local
R project to CHTC.
---
## What submitr does not do
`submitr` reduces friction, but it does not remove the need to make sound
research-computing decisions.
It does not:
- decide whether your workload is a good fit for CHTC;
- make interactive R code safe for batch execution;
- guarantee that your container image contains every system dependency;
- manage restricted or sensitive data;
- replace CHTC documentation or consultation for complex workflows.
That boundary is intentional. Good tools should make the common path easier
while still leaving the important decisions visible.
The [CHTC facilitation team](https://chtc.cs.wisc.edu/uw-research-computing/get-help)
is the right resource for complex workflow questions.
---
## A good first goal
Do not make your first submission your largest analysis.
Make your first goal smaller: send one boring job to CHTC, watch it run, and
download one result file.
After that, the cluster becomes less mysterious. You can inspect the generated
files, adjust resources, split work into many jobs, and grow the workflow with
more confidence. That is the role of `submitr`: to help you take the first
successful step from local R code to high-throughput research computing.