.. _snakefiles-rules:

====================
Snakefiles and Rules
====================

A Snakemake workflow defines a data analysis in terms of rules that are specified in the Snakefile.
Most commonly, rules consist of a name, input files, output files, and a shell command to generate the output from the input:

.. code-block:: python

    rule NAME:
        input: "path/to/inputfile", "path/to/other/inputfile"
        output: "path/to/outputfile", "path/to/another/outputfile"
        shell: "somecommand {input} {output}"

The name is optional and can be left out, creating an anonymous rule. It can also be overridden by setting a rule's ``name`` attribute.

.. sidebar:: Note

    Note that any placeholders in the shell command (like ``{input}``) are always evaluated and replaced
    when the corresponding job is executed, even if they are occuring inside a comment.
    To avoid evaluation and replacement, you have to mask the braces by doubling them,
    i.e. ``{{input}}``.

Inside the shell command, all local and global variables, especially input and output files can be accessed via their names in the `python format minilanguage <https://docs.python.org/py3k/library/string.html#formatspec>`_.
Here, input and output (and in general any list or tuple) automatically evaluate to a space-separated list of files (i.e. ``path/to/inputfile path/to/other/inputfile``).
From Snakemake 3.8.0 on, adding the special formatting instruction ``:q`` (e.g. ``"somecommand {input:q} {output:q}")``) will let Snakemake quote each of the list or tuple elements that contains whitespace.

By default shell commands will be invoked with ``bash`` shell (unless the workflow specifies a different default shell via ``shell.executable(...)``).

Instead of a shell command, a rule can run some python code to generate the output:

.. code-block:: python

    rule NAME:
        input: "path/to/inputfile", "path/to/other/inputfile"
        output: "path/to/outputfile", somename = "path/to/another/outputfile"
        run:
            for f in input:
                ...
                with open(output[0], "w") as out:
                    out.write(...)
            with open(output.somename, "w") as out:
                out.write(...)

As can be seen, instead of accessing input and output as a whole, we can also access by index (``output[0]``) or by keyword (``output.somename``).
Note that, when adding keywords or names for input or output files, their order won't be preserved when accessing them as a whole via e.g. ``{output}`` in a shell command.

Shell commands like above can also be invoked inside a python based rule, via the function ``shell`` that takes a string with the command and allows the same formatting like in the rule above, e.g.:

.. code-block:: python

    shell("somecommand {output.somename}")

Further, this combination of python and shell commands allows us to iterate over the output of the shell command, e.g.:

.. code-block:: python

    for line in shell("somecommand {output.somename}", iterable=True):
        ... # do something in python

Note that shell commands in Snakemake use the bash shell in `strict mode <http://redsymbol.net/articles/unofficial-bash-strict-mode/>`_ by default.

.. _snakefiles-wildcards:

Wildcards
---------

Usually, it is useful to generalize a rule to be applicable to a number of e.g. datasets. For this purpose, wildcards can be used.
Automatically resolved multiple named wildcards are a key feature and strength of Snakemake in comparison to other systems.
Consider the following example.

.. code-block:: python

    rule complex_conversion:
        input:
            "{dataset}/inputfile"
        output:
            "{dataset}/file.{group}.txt"
        shell:
            "somecommand --group {wildcards.group} < {input} > {output}"

Here, we define two wildcards, ``dataset`` and ``group``. By this, the rule can produce all files that follow the regular expression pattern ``.+/file\..+\.txt``, i.e. the wildcards are replaced by the regular expression ``.+``. If the rule's output matches a requested file, the substrings matched by the wildcards are propagated to the input files and to the variable wildcards, that is here also used in the shell command. The wildcards object can be accessed in the same way as input and output, which is described above.

For example, if another rule in the workflow requires the file ``101/file.A.txt``, Snakemake recognizes that this rule is able to produce it by setting ``dataset=101`` and ``group=A``.
Thus, it requests file ``101/inputfile`` as input and executes the command ``somecommand --group A  < 101/inputfile  > 101/file.A.txt``.
Of course, the input file might have to be generated by another rule with different wildcards.

Importantly, the wildcard names in input and output must be named identically. Most typically, the same wildcard is present in both input and output, but it is of course also possible to have wildcards only in the output but not the input section.


Multiple wildcards in one filename can cause ambiguity.
Consider the pattern ``{dataset}.{group}.txt`` and assume that a file ``101.B.normal.txt`` is available.
It is not clear whether ``dataset=101.B`` and ``group=normal`` or ``dataset=101`` and ``group=B.normal`` in this case.

Hence wildcards can be constrained to given regular expressions.
Here we could restrict the wildcard ``dataset`` to consist of digits only using ``\d+`` as the corresponding regular expression.
With Snakemake 3.8.0, there are three ways to constrain wildcards.
First, a wildcard can be constrained within the file pattern, by appending a regular expression separated by a comma:

.. code-block:: python

    output: "{dataset,\d+}.{group}.txt"

Second, a wildcard can be constrained within the rule via the keyword ``wildcard_constraints``:

.. code-block:: python

    rule complex_conversion:
        input:
            "{dataset}/inputfile"
        output:
            "{dataset}/file.{group}.txt"
        wildcard_constraints:
            dataset="\d+"
        shell:
            "somecommand --group {wildcards.group}  < {input}  > {output}"

Finally, you can also define global wildcard constraints that apply for all rules:

.. code-block:: python

    wildcard_constraints:
        dataset="\d+"

    rule a:
        ...

    rule b:
        ...

See the `Python documentation on regular expressions <https://docs.python.org/py3k/library/re.html>`_ for detailed information on regular expression syntax.

.. _snakefiles_aggregation:

Aggregation
-----------

Input files can be Python lists, allowing to easily aggregate over parameters or samples:

.. code-block:: python

    rule aggregate:
        input:
            ["{dataset}/a.txt".format(dataset=dataset) for dataset in DATASETS]
        output:
            "aggregated.txt"
        shell:
            ...

The above expression can be simplified in two ways.

.. _snakefiles_expand:

The expand function
~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    rule aggregate:
        input:
            expand("{dataset}/a.txt", dataset=DATASETS)
        output:
            "aggregated.txt"
        shell:
            ...


Note that *dataset* is NOT a wildcard here because it is resolved by Snakemake due to the ``expand`` statement.
The ``expand`` function also allows us to combine different variables, e.g.

.. code-block:: python

    rule aggregate:
        input:
            expand("{dataset}/a.{ext}", dataset=DATASETS, ext=FORMATS)
        output:
            "aggregated.txt"
        shell:
            ...

If ``FORMATS=["txt", "csv"]`` contains a list of desired output formats then expand will automatically combine any dataset with any of these extensions.

Furthermore, the first argument can also be a list of strings. In that case, the transformation is applied to all elements of the list. E.g.

.. code-block:: python

    expand(["{dataset}/a.{ext}", "{dataset}/b.{ext}"], dataset=DATASETS, ext=FORMATS)

leads to

.. code-block:: python

    ["ds1/a.txt", "ds1/b.txt", "ds2/a.txt", "ds2/b.txt", "ds1/a.csv", "ds1/b.csv", "ds2/a.csv", "ds2/b.csv"]

Per default, ``expand`` uses the python itertools function ``product`` that yields all combinations of the provided wildcard values. However by inserting a second positional argument this can be replaced by any combinatoric function, e.g. ``zip``:

.. code-block:: python

    expand(["{dataset}/a.{ext}", "{dataset}/b.{ext}"], zip, dataset=DATASETS, ext=FORMATS)

leads to

.. code-block:: python

    ["ds1/a.txt", "ds1/b.txt", "ds2/a.csv", "ds2/b.csv"]

You can also mask a wildcard expression in ``expand`` such that it will be kept, e.g.

.. code-block:: python

    expand("{{dataset}}/a.{ext}", ext=FORMATS)

will create strings with all values for ext but starting with the wildcard ``"{dataset}"``.


.. _snakefiles-multiext:

The multiext function
~~~~~~~~~~~~~~~~~~~~~

``multiext`` provides a simplified variant of ``expand`` that allows us to define a set of output or input files that just differ by their extension:


.. code-block:: python

    rule plot:
        input:
            ...
        output:
            multiext("some/plot", ".pdf", ".svg", ".png")
        shell:
            ...

The effect is the same as if you would write ``expand("some/plot{ext}", ext=[".pdf", ".svg", ".png"])``, however, using a simpler syntax.
Moreover, defining output with ``multiext`` is the only way to use :ref:`between workflow caching <caching>` for rules with multiple output files.


.. _snakefiles-targets:

Targets and aggregation
-----------------------

By default snakemake executes the first rule in the snakefile. This gives rise to pseudo-rules at the beginning of the file that can be used to define build-targets similar to GNU Make:

.. code-block:: python

    rule all:
      input:
        expand("{dataset}/file.A.txt", dataset=DATASETS)


Here, for each dataset in a python list ``DATASETS`` defined before, the file ``{dataset}/file.A.txt`` is requested.
In this example, Snakemake recognizes automatically that these can be created by multiple applications of the rule ``complex_conversion`` shown above.

It is possible to overwrite this behavior to use the first rule as a default target, by explicitly marking a rule as being the default target via the ``default_target`` directive:

.. code-block:: python

    rule xy:
        input:
            expand("{dataset}/file.A.txt", dataset=DATASETS)
        default_target: True

Regardless of where this rule appears in the Snakefile, it will be the default target.
Usually, it is still recommended to keep the default target rule (and in fact all other rules that could act as optional targets) at the top of the file, such that it can be easily found.
The ``default_target`` directive becomes particularly useful when :ref:`combining several pre-existing workflows <use_with_modules>`.

.. _snakefiles-threads:

Threads
-------

Further, a rule can be given a number of threads to use, i.e.

.. code-block:: python

    rule NAME:
        input: "path/to/inputfile", "path/to/other/inputfile"
        output: "path/to/outputfile", "path/to/another/outputfile"
        threads: 8
        shell: "somecommand --threads {threads} {input} {output}"

.. sidebar:: Note

    On a cluster node, Snakemake uses as many cores as available on that node.
    Hence, the number of threads used by a rule never exceeds the number of physically available cores on the node.
    Note: This behavior is not affected by ``--local-cores``, which only applies to jobs running on the main node.

Snakemake can alter the number of cores available based on command line options. Therefore it is useful to propagate it via the built in variable ``threads`` rather than hardcoding it into the shell command.
In particular, it should be noted that the specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. ``threads = min(threads, cores)`` with ``cores`` being the number of cores specified at the command line (option ``--cores``).

Hardcoding a particular maximum number of threads like above is useful when a certain tool has a natural maximum beyond which parallelization won't help to further speed it up.
This is often the case, and should be evaluated carefully for production workflows.
Also, setting a ``threads:`` maximum is required to achieve parallelism in tools that (often implicitly and without the user knowing) rely on an environment variable for the maximum of cores to use.
For example, this is the case for many linear algebra libraries and for OpenMP.
Snakemake limits the respective environment variables to one core by default, to avoid unexpected and unlimited core-grabbing, but will override this with the ``threads:`` you specify in a rule (the parameters set to ``threads:``, or defaulting to ``1``, are: ``OMP_NUM_THREADS``, ``GOTO_NUM_THREADS``, ``OPENBLAS_NUM_THREADS``, ``MKL_NUM_THREADS``, ``VECLIB_MAXIMUM_THREADS``, ``NUMEXPR_NUM_THREADS``).

If it is certain that no maximum for efficient parallelism exists for a tool, one can instead define threads as a function of the number of cores given to Snakemake:

.. code-block:: python

    rule NAME:
        input: "path/to/inputfile", "path/to/other/inputfile"
        output: "path/to/outputfile", "path/to/another/outputfile"
        threads: workflow.cores * 0.75
        shell: "somecommand --threads {threads} {input} {output}"

The number of given cores is globally available in the Snakefile as an attribute of the workflow object: ``workflow.cores``.
Any arithmetic operation can be performed to derive a number of threads from this. E.g., in the above example, we reserve 75% of the given cores for the rule.
Snakemake will always round the calculated value down (while enforcing a minimum of 1 thread).

Starting from version 3.7, threads can also be a callable that returns an ``int`` value. The signature of the callable should be ``callable(wildcards[, input])`` (input is an optional parameter).  It is also possible to refer to a predefined variable (e.g, ``threads: threads_max``) so that the number of cores for a set of rules can be changed with one change only by altering the value of the variable ``threads_max``.

Both threads can be defined (or overwritten) upon invocation (without modifying the workflow code) via `--set-threads` see :ref:`user_manual-snakemake_options` and via workflow profiles, see :ref:`profiles`.
To quickly exemplify the latter, you could provide the following workflow profile in a file ``profiles/default/config.yaml`` relative to the Snakefile or the current working directory:

.. code-block:: yaml

    set-threads:
        b: 16

to set the (maximum) number of threads rule ``b`` uses to 16.


.. _snakefiles-resources:

Resources
---------

In addition to threads, a rule can use arbitrary user-defined resources by specifying them with the resources-keyword:

.. code-block:: python

    rule a:
        input:     ...
        output:    ...
        resources:
            mem_mb=100
        shell:
            "..."

If limits for the resources are given via the command line, e.g.

.. code-block:: console

    $ snakemake --resources mem_mb=100


the scheduler will ensure that the given resources are not exceeded by running jobs.
Resources are always meant to be specified as total per job, not by thread (i.e. above ``mem_mb=100`` in rule ``a`` means that any job from rule ``a`` will require ``100`` megabytes of memory in total, and not per thread).

**Importantly**, there are some :ref:`standard resources <snakefiles-standard-resources>` that should be considered before making up your own.

In general, resources are just names to the Snakemake scheduler, i.e., Snakemake does not check on the resource consumption of jobs in real time.
Instead, resources are used to determine which jobs can be executed at the same time without exceeding the limits specified at the command line.
Apart from making Snakemake aware of hybrid-computing architectures (e.g. with a limited number of additional devices like GPUs) this allows us to control scheduling in various ways, e.g. to limit IO-heavy jobs by assigning an artificial IO-resource to them and limiting it via the ``--resources`` flag.
If no limits are given, the resources are ignored in local execution.

Resources can have any arbitrary name, and must be assigned ``int`` or ``str`` values.
In case of ``None``, the resource is considered to be unset (i.e. ignored) in the rule.

Resources can also be callables (e.g. functions or lambda expressions) that return ``int``, ``str`` or ``None`` values.
The signature of the callable must be ``callable(wildcards [, input] [, threads] [, attempt])`` (``input``, ``threads``, and ``attempt`` are optional parameters).
Such callables are evaluated immediately before the job is executed (or printed during a dry-run).

Since the callables can take e.g. ``input`` as an argument, they can for example be used to obtain the size of an input file and infer the amount of memory needed for the job.
In order to make this work with a dry-run, where the input files are not yet present, Snakemake automatically converts a ``FileNotFoundError`` that is raised by the callable into a placeholder called ``<TBD>`` that will be displayed during dry-run in such a case.

The parameter ``attempt`` allows us to adjust resources based on how often the job has been restarted (see :ref:`all_options`, option ``--retries``).
This is handy when executing a Snakemake workflow in a cluster environment, where jobs can e.g. fail because of too limited resources.
When Snakemake is executed with ``--retries 3``, it will try to restart a failed job 3 times before it gives up.
Thereby, the parameter ``attempt`` will contain the current attempt number (starting from ``1``).
This can be used to adjust the required memory as follows

.. code-block:: python

    def get_mem_mb(wildcards, attempt):
        return attempt * 100

    rule:
        input:    ...
        output:   ...
        resources:
            mem_mb=get_mem_mb
        shell:
            "..."

Here, the first attempt will require 100 MB memory, the second attempt will require 200 MB memory and so on.
When passing memory requirements to the cluster engine, you can by this automatically try out larger nodes if it turns out to be necessary.

Another application of callables as resources is when memory usage depends on the number of threads:

.. code-block:: python

    def get_mem_mb(wildcards, threads):
        return threads * 150

    rule b:
        input:     ...
        output:    ...
        threads: 8
        resources:
            mem_mb=get_mem_mb
        shell:
            "..."

Here, the value the function ``get_mem_mb`` returns grows linearly with the number of threads.
Of course, any other arithmetic could be performed in that function.

Both threads and resources can be defined (or overwritten) upon invocation (without modifying the workflow code) via `--set-threads` and `--set-resources`, see :ref:`user_manual-snakemake_options` and via workflow profiles, see :ref:`profiles`.
To quickly exemplify the latter, you could provide the following workflow profile in a file ``profiles/default/config.yaml`` relative to the Snakefile or the current working directory:

.. code-block:: yaml

    set-resources:
        b:
            mem_mb: 1000

to set the memory requirement of rule ``b`` to 1000 MB.

.. _snakefiles-standard-resources:

Standard Resources
~~~~~~~~~~~~~~~~~~

There are several **standard resources**, for total memory, disk usage, runtime, and the temporary directory of a job: ``mem``, ``disk``, ``runtime``, and ``tmpdir``.
All of these resources have specific meanings understood by snakemake and are treated in varying unique ways:

* The ``tmpdir`` resource automatically leads to setting the ``$TMPDIR`` variable for shell commands, scripts, wrappers and notebooks. In cluster or cloud setups, its evaluation is delayed until the actual execution of the job. This way, it can dynamically react on the context of the node of execution.

* The ``runtime`` resource indicates the amount of wall clock time a job needs to run.
  It can be given as string defining a time span or as integer defining **minutes**.
  In the former case, the time span can be defined as a string with a number followed by a unit
  (``ms``, ``s``, ``m``, ``h``, ``d``, ``w``, ``y`` for seconds, minutes, hours, days, and years, respectively).
  The interpretation happens via the `humanfriendly package <https://humanfriendly.readthedocs.io/en/latest/api.html?highlight=parse_timespan#humanfriendly.parse_timespan>`_.
  Cluster or cloud backends may use this to constrain the allowed execution time of the submitted job.
  See :ref:`the section below <resources_remote_execution>` for more information.

* ``disk`` and ``mem`` define the amount of memory and disk space needed by the job.
  They are given as strings with a number followed by a unit (``B``, ``KB``, ``MB``, ``GB``, ``TB``, ``PB``, ``KiB``, ``MiB``, ``GiB``, ``TiB``, ``PiB``).
  The interpretation of the definition happens via the `humanfriendly package <https://humanfriendly.readthedocs.io/en/latest/api.html?highlight=parse_timespan#humanfriendly.parse_size>`_.
  Alternatively, the two can be directly defined as integers via the resources ``mem_mb`` and ``disk_mb`` (to which ``disk`` and ``mem`` are also automatically translated internally).
  They are both locally scoped by default, a fact important for cluster and compute execution.
  :ref:`See below <resources_remote_execution>` for more info.
  They are usually passed to execution backends, e.g. to allow the selection of appropriate compute nodes for the job execution.

Because of these special meanings, the above names should always be used instead of possible synonyms (e.g. ``tmp``, ``time``, ``temp``, etc).

.. _default-resources:

Default Resources
~~~~~~~~~~~~~~~~~~

Since it could be cumbersome to define these standard resources for every rule, you can set default values at
the terminal or in a :ref:`profile <profiles>`.
This works via the command line flag ``--default-resources``, see ``snakemake --help`` for more information.
If those resource definitions are mandatory for a certain execution mode, Snakemake will fail with a hint if they are missing.
Any resource definitions inside a rule override what has been defined with ``--default-resources``.
If ``--default-resources`` are not specified, Snakemake uses ``'mem_mb=max(2*input.size_mb, 1000)'``,
``'disk_mb=max(2*input.size_mb, 1000)'``, and ``'tmpdir=system_tmpdir'``.
The latter points to whatever is the default of the operating system or specified by any of the environment variables ``$TMPDIR``, ``$TEMP``, or ``$TMP`` as outlined `here <https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir>`_.
If ``--default-resources`` is specified with some definitions, but any of the above defaults (e.g. ``mem_mb``) is omitted, these are still used.
In order to explicitly unset these defaults, assign them a value of ``None``, e.g. ``--default-resources mem_mb=None``.

.. _resources-remote-execution:

Resources and Remote Execution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

New to Snakemake 7.11. In cluster or cloud execution, resources may represent either a global constraint across all submissions (e.g. number of API calls per second), or a constraint local to each specific job sumbmission (e.g. the amount of memory available on a node).
Snakemake distinguishes between these two types of constraints using **resource scopes**.
By default, ``mem_mb``, ``disk_mb``, and ``threads`` are all considered ``"local"`` resources, meaning specific to individual submissions.
So if a constraint of 16G of memory is given to snakemake (e.g. ``snakemake --resources mem_mb=16000``), each group job will be allowed 16G of memory.
All other resources are considered ``"global"``, meaning they are tracked across all jobs across all submissions.
For example, if ``api_calls`` was limited to 5 and each job scheduled used 1 api call, only 5 jobs would be scheduled at a time, even if more job submissions were available.

These resource scopes may be modified both in the Snakefile and via the CLI parameter ``--set-resource-scopes``.
The CLI parameter takes priority.
Modification in the Snakefile uses the following syntax:

.. code-block:: python

    resource_scopes:
        gpus="local",
        foo="local",
        disk_mb="global"

Here, we set both ``gpus`` and ``foo`` as local resources, and we changed ``disk_mb`` from its default to be a ``global`` resource.
These options could be overriden at the command line using:

.. code-block:: console

    $ snakemake --set-resource-scopes gpus=global disk_mb=local

Resources and Group Jobs
~~~~~~~~~~~~~~~~~~~~~~~~

New to Snakemake 7.11.
When submitting :ref:`group jobs <job_grouping>` to the cluster, Snakemake calculates how many resources to request by first determining which component jobs can be run in parallel, and which must be run in series.
For most resources, such as ``mem_mb`` or ``threads``, a sum will be taken across each parallel layer.
The layer requiring the most resource (i.e. ``max()``) will determine the final amount requested.
The only exception is ``runtime``.
For it, ``max()`` will be used within each layer, then the total amount of time across all layers will be summed.
If resource constraints are provided (via ``--resources`` or ``--cores``) Snakemake will prevent group jobs from requesting more than the constraint.
Jobs that could otherwise be run in parallel will be run in series to prevent the violation of resource constraints.



Preemptible Jobs
~~~~~~~~~~~~~~~~


You can specify parameters ``preemptible-rules`` and ``preemption-default`` to request a `Google Cloud preemptible virtual machine <https://cloud.google.com/life-sciences/docs/reference/gcloud-examples#using_preemptible_vms>`_ for use with the `Google Life Sciences Executor <https://snakemake.readthedocs.io/en/stable/executing/cloud.html#executing-a-snakemake-workflow-via-google-cloud-life-sciences>`_. There are
several ways to go about doing this. This first example will use preemptible instances for all rules, with 10 repeats (restarts
of the instance if it stops unexpectedly).

.. code-block:: console

    snakemake --preemption-default 10


If your preference is to set a default but then overwrite some rules with a custom value, this is where you can use ``--preemtible-rules``:

.. code-block:: console

    snakemake --preemption-default 10 --preemptible-rules map_reads=3 call_variants=0


The above statement says that we want to use preemtible instances for all steps, defaulting to 10 retries,
but for the steps "map_reads" and "call_variants" we want to apply 3 and 0 retries, respectively. The final
option is to not use preemptible instances by default, but only for a particular rule:


.. code-block:: console

    snakemake --preemptible-rules map_reads=10


Note that this is currently implemented for the Google Life Sciences API.


GPU Resources
~~~~~~~~~~~~~

The Google Life Sciences API currently has support for
`NVIDIA GPUs <https://cloud.google.com/compute/docs/gpus#restrictions>`_, meaning that you can request a number of NVIDIA GPUs explicitly by adding ``nvidia_gpu`` or ``gpu`` to your Snakefile resources for a step:


.. code-block:: python

    rule a:
        output:
            "test.txt"
        resources:
            nvidia_gpu=1
        shell:
            "somecommand ..."


A specific `gpu model <https://cloud.google.com/compute/docs/gpus#introduction>`_ can be requested using ``gpu_model`` and lowercase identifiers like ``nvidia-tesla-p100`` or ``nvidia-tesla-p4``, for example: ``gpu_model="nvidia-tesla-p100"``. If you don't specify ``gpu`` or ``nvidia_gpu`` with a count, but you do specify a ``gpu_model``, the count will default to 1.



Messages
--------

When executing snakemake, a short summary for each running rule is given to the console. This can be overridden by specifying a message for a rule:


.. code-block:: python

    rule NAME:
        input: "path/to/inputfile", "path/to/other/inputfile"
        output: "path/to/outputfile", "path/to/another/outputfile"
        threads: 8
        message: "Executing somecommand with {threads} threads on the following files {input}."
        shell: "somecommand --threads {threads} {input} {output}"

Note that access to wildcards is also possible via the variable ``wildcards`` (e.g, ``{wildcards.sample}``), which is the same as with shell commands. It is important to have a namespace around wildcards in order to avoid clashes with other variable names.

Priorities
----------

Snakemake allows for rules that specify numeric priorities:


.. code-block:: python

    rule:
      input: ...
      output: ...
      priority: 50
      shell: ...

Per default, each rule has a priority of 0. Any rule that specifies a higher priority, will be preferred by the scheduler over all rules that are ready to execute at the same time without having at least the same priority.

Furthermore, the ``--prioritize`` or ``-P`` command line flag allows to specify files (or rules) that shall be created with highest priority during the workflow execution. This means that the scheduler will assign the specified target and all its dependencies highest priority, such that the target is finished as soon as possible.
The ``--dry-run`` (equivalently ``--dryrun``) or ``-n`` option allows you to see the scheduling plan including the assigned priorities.



Log-Files
---------

Each rule can specify a log file where information about the execution is written to:

.. code-block:: python

    rule abc:
        input: "input.txt"
        output: "output.txt"
        log: "logs/abc.log"
        shell: "somecommand --log {log} {input} {output}"

Log files can be used as input for other rules, just like any other output file.
However, unlike output files, log files are not deleted upon error.
This is obviously necessary in order to discover causes of errors which might become visible in the log file.

The variable ``log`` can be used inside a shell command to tell the used tool to which file to write the logging information.
The log file has to use the same wildcards as output files, e.g.

.. code-block:: python

    log: "logs/abc.{dataset}.log"


For programs that do not have an explicit ``log`` parameter, you may always use ``2> {log}`` to redirect stderr to a file (here, the ``log`` file) in Linux-based systems.
Note that it is also possible to have multiple named log files, which could be used to capture stdout and stderr:

.. code-block:: python

    rule abc:
        input: "input.txt"
        output: "output.txt"
        log: stdout="logs/foo.stdout", stderr="logs/foo.stderr"
        shell: "somecommand {input} {output} > {log.stdout} 2> {log.stderr}"

Non-file parameters for rules
-----------------------------

Sometimes you may want to define certain parameters separately from the rule body. Snakemake provides the ``params`` keyword for this purpose:


.. code-block:: python

    rule:
        input:
            ...
        params:
            prefix="somedir/{sample}"
        output:
            "somedir/{sample}.csv"
        shell:
            "somecommand -o {params.prefix}"

The ``params`` keyword allows you to specify additional parameters depending on the wildcards values. This allows you to circumvent the need to use ``run:`` and python code for non-standard commands like in the above case.
Here, the command ``somecommand`` expects the prefix of the output file instead of the actual one. The ``params`` keyword helps here since you cannot simply add the prefix as an output file (as the file won't be created, Snakemake would throw an error after execution of the rule).

Furthermore, for enhanced readability and clarity, the ``params`` section is also an excellent place to name and assign parameters and variables for your subsequent command.

Similar to ``input``, ``params`` can take functions as well (see :ref:`snakefiles-input_functions`), e.g. you can write

.. code-block:: python

    rule:
        input:
            ...
        params:
            prefix=lambda wildcards, output: output[0][:-4]
        output:
            "somedir/{sample}.csv"
        shell:
            "somecommand -o {params.prefix}"

.. sidebar:: Note

    When accessing auxiliary source files (i.e. files that are located relative to the current Snakefile, e.g. some additional configuration)
    it is crucial to not manually build their path but rather rely on Snakemake's special registration for these files, see :ref:`snakefiles-aux_source_files`.

to get the same effect as above. Note that in contrast to the ``input`` directive, the
``params`` directive can optionally take more arguments than only ``wildcards``, namely ``input``, ``output``, ``threads``, and ``resources``.
From the Python perspective, they can be seen as optional keyword arguments without a default value.
Their order does not matter, apart from the fact that ``wildcards`` has to be the first argument.
In the example above, this allows you to derive the prefix name from the output file.

.. _snakefiles-external_scripts:

External scripts
----------------

A rule can also point to an external script instead of a shell command or inline Python code, e.g.

Python
~~~~~~

.. code-block:: python

    rule NAME:
        input:
            "path/to/inputfile",
            "path/to/other/inputfile"
        output:
            "path/to/outputfile",
            "path/to/another/outputfile"
        script:
            "scripts/script.py"

.. sidebar:: Note

    It is possible to refer to wildcards and params in the script path, e.g. by specifying ``"scripts/{params.scriptname}.py"`` or ``"scripts/{wildcards.scriptname}.py"``.

The script path is always relative to the Snakefile containing the directive (in contrast to the input and output file paths, which are relative to the working directory).
It is recommended to put all scripts into a subfolder ``scripts`` as above.
Inside the script, you have access to an object ``snakemake`` that provides access to the same objects that are available in the ``run`` and ``shell`` directives (input, output, params, wildcards, log, threads, resources, config), e.g. you can use ``snakemake.input[0]`` to access the first input file of above rule.

An example external Python script could look like this:

.. code-block:: python

    def do_something(data_path, out_path, threads, myparam):
        # python code

    do_something(snakemake.input[0], snakemake.output[0], snakemake.threads, snakemake.config["myparam"])

You can use the Python debugger from within the script if you invoke Snakemake with ``--debug``.

R and R Markdown
~~~~~~~~~~~~~~~~

Apart from Python scripts, this mechanism also allows you to integrate R_ and R Markdown_ scripts with Snakemake, e.g.

.. _R: https://www.r-project.org
.. _Markdown: https://rmarkdown.rstudio.com

.. code-block:: python

    rule NAME:
        input:
            "path/to/inputfile",
            "path/to/other/inputfile"
        output:
            "path/to/outputfile",
            "path/to/another/outputfile"
        script:
            "scripts/script.R"

In the R script, an S4 object named ``snakemake`` analogous to the Python case above is available and allows access to input and output files and other parameters. Here the syntax follows that of S4 classes with attributes that are R lists, e.g. we can access the first input file with ``snakemake@input[[1]]`` (note that the first file does not have index ``0`` here, because R starts counting from ``1``). Named input and output files can be accessed in the same way, by just providing the name instead of an index, e.g. ``snakemake@input[["myfile"]]``.

An equivalent script (:ref:`to the Python one above <Python>`) written in R would look like this:

.. code-block:: r

    do_something <- function(data_path, out_path, threads, myparam) {
        # R code
    }

    do_something(snakemake@input[[1]], snakemake@output[[1]], snakemake@threads, snakemake@config[["myparam"]])


To debug R scripts, you can save the workspace with ``save.image()``, and invoke R after Snakemake has terminated. Then you can use the usual R debugging facilities while having access to the ``snakemake`` variable.
It is best practice to wrap the actual code into a separate function. This increases the portability if the code shall be invoked outside of Snakemake or from a different rule.
A convenience method, ``snakemake@source()``, acts as a wrapper for the normal R ``source()`` function, and can be used to source files relative to the original script directory.

An R Markdown file can be integrated in the same way as R and Python scripts, but only a single output (html) file can be used:

.. code-block:: python

    rule NAME:
        input:
            "path/to/inputfile",
            "path/to/other/inputfile"
        output:
            "path/to/report.html",
        script:
            "path/to/report.Rmd"

In the R Markdown file you can insert output from a R command, and access variables stored in the S4 object named ``snakemake``

.. code-block:: R

    ---
    title: "Test Report"
    author:
        - "Your Name"
    date: "`r format(Sys.time(), '%d %B, %Y')`"
    params:
       rmd: "report.Rmd"
    output:
      html_document:
      highlight: tango
      number_sections: no
      theme: default
      toc: yes
      toc_depth: 3
      toc_float:
        collapsed: no
        smooth_scroll: yes
    ---

    ## R Markdown

    This is an R Markdown document.

    Test include from snakemake `r snakemake@input`.

    ## Source
    <a download="report.Rmd" href="`r base64enc::dataURI(file = params$rmd, mime = 'text/rmd', encoding = 'base64')`">R Markdown source file (to produce this document)</a>

A link to the R Markdown document with the snakemake object can be inserted. Therefore a variable called ``rmd`` needs to be added to the ``params`` section in the header of the ``report.Rmd`` file. The generated R Markdown file with snakemake object will be saved in the file specified in this ``rmd`` variable. This file can be embedded into the HTML document using base64 encoding and a link can be inserted as shown in the example above.
Also other input and output files can be embedded in this way to make a portable report. Note that the above method with a data URI only works for small files. An experimental technology to embed larger files is using Javascript Blob `object <https://developer.mozilla.org/en-US/docs/Web/API/Blob>`_.

Julia_
~~~~~~

.. _Julia: https://julialang.org

.. code-block:: python

    rule NAME:
        input:
            "path/to/inputfile",
            "path/to/other/inputfile"
        output:
            "path/to/outputfile",
            "path/to/another/outputfile"
        script:
            "path/to/script.jl"

In the Julia_ script, a ``snakemake`` object is available, which can be accessed similar to the :ref:`Python case <Python>`, with the only difference that you have to index from 1 instead of 0.

Rust_
~~~~~

.. _Rust: https://www.rust-lang.org/

.. code-block:: python

    rule NAME:
        input:
            "path/to/inputfile",
            "path/to/other/inputfile",
            named_input="path/to/named/inputfile",
        output:
            "path/to/outputfile",
            "path/to/another/outputfile"
        params:
            seed=4
        conda:
            "rust.yaml"
        log:
            stdout="path/to/stdout.log",
            stderr="path/to/stderr.log",
        script:
            "path/to/script.rs"

The ability to execute Rust scripts is facilitated by |rust-script|_.
As such, the script must be a valid ``rust-script`` script and ``rust-script``
(plus OpenSSL and a C compiler toolchain, provided by Conda packages ``openssl``, ``c-compiler``, ``pkg-config``)
must be available in the environment the rule is run in.
The minimum required ``rust-script`` version is 1.15.0, so in the example above, the contents of ``rust.yaml`` might look like this:

.. code-block:: yaml

    channels:
      - conda-forge
      - bioconda
    dependencies:
      - rust-script>=0.15.0
      - openssl
      - c-compiler
      - pkg-config

Some example scripts can be found in the
`tests directory <https://github.com/snakemake/snakemake/tree/main/tests/test_script/scripts>`_.

In the Rust script, a ``snakemake`` instance is available, which is automatically generated from the python snakemake object using |json_typegen|_.
It usually looks like this:

.. code-block:: rust

    pub struct Snakemake {
        input: Input,
        output: Ouput,
        params: Params,
        wildcards: Wildcards,
        threads: u64,
        log: Log,
        resources: Resources,
        config: Config,
        rulename: String,
        bench_iteration: Option<usize>,
        scriptdir: String,
    }

Any named parameter is translated to a corresponding ``field_name: Type``, such that ``params.seed`` from the example above can be accessed just like in python, i.e.:

.. code-block:: rust

    let seed = snakemake.params.seed;
    assert_eq!(seed, 4);

Positional arguments for ``input``, ``output``, ``log`` and ``wildcards`` can be accessed by index and iterated over:

.. code-block:: rust

    let input = &snakemake.input;

    // Input implements Index<usize>
    let inputfile = input[0];
    assert_eq!(inputfile, "path/to/inputfile");

    // Input implements IntoIterator
    //
    // prints
    // > 'path/to/inputfile'
    // > 'path/to/other/inputfile'
    for f in input {
        println!("> '{}'", &f);
    }


It is also possible to redirect ``stdout`` and ``stderr``:

.. code-block:: rust

    println!("This will NOT be written to path/to/stdout.log");
    // redirect stdout to "path/to/stdout.log"
    let _stdout_redirect = snakemake.redirect_stdout(snakemake.log.stdout)?;
    println!("This will be written to path/to/stdout.log");

    // redirect stderr to "path/to/stderr.log"
    let _stderr_redirect = snakemake.redirect_stderr(snakemake.log.stderr)?;
    eprintln!("This will be written to path/to/stderr.log");
    drop(_stderr_redirect);
    eprintln!("This will NOT be written to path/to/stderr.log");

Redirection of stdout/stderr is only "active" as long as the returned ``Redirect`` instance is alive; in order to stop redirecting, drop the respective instance.

In order to work, rust-script support for snakemake has some dependencies enabled by default:

#. ``anyhow=1``, for its ``Result`` type
#. ``gag=1``, to enable stdout/stderr redirects
#. ``json_typegen=0.6``, for generating rust structs from a json representation of the snakemake object
#. ``lazy_static=1.4``, to make a ``snakemake`` instance easily accessible
#. ``serde=1``, explicit dependency of ``json_typegen``
#. ``serde_derive=1``, explicit dependency of ``json_typegen``
#. ``serde_json=1``, explicit dependency of ``json_typegen``

If your script uses any of these packages, you do not need to ``use`` them in your script. Trying to ``use`` them will cause a compilation error.

.. |rust-script| replace:: ``rust-script``
.. _rust-script: https://rust-script.org/
.. |json_typegen| replace:: ``json_typegen``
.. _json_typegen: https://github.com/evestera/json_typegen


Bash
~~~~

Bash scripts work much the same as the other script languages above, but with some important differences. Access to the
rule's directives is provided through the use of `associative arrays <arrays_>`_ - **requiring Bash version 4.0 or greater**.
One "limitation" of associative arrays is they cannot be nested. As such, the following rule directives are found in a separate
variable, named as ``snakemake_<directive>``:

* ``input``
* ``output``
* ``log``
* ``wildcards``
* ``resources``
* ``params``
* ``config``

Access to the ``input`` directive is faciliated through the bash associative array named ``snakemake_input``. The
remaining directives can be found in the variable ``snakemake``.

.. sidebar:: Note

    As arrays cannot be nested in Bash, use of python's ``dict`` in directives is not supported. So, adding a ``params`` key of ``data={"foo": "bar"}`` will not be reflected - ``${snakemake_params[data]}`` actually only returns ``"foo"``.

Bash Example 1
^^^^^^^^^^^^^^

.. code-block:: python

    rule align:
        input:
            "{sample}.fq",
            reference="ref.fa",
        output:
            "{sample}.sam"
        params:
            opts="-a -x map-ont",
        threads: 4
        log:
            "align/{sample}.log"
        conda:
            "envs/align.yaml"
        script:
            "scripts/align.sh"



``align.sh``

.. code-block:: bash

    #!/usr/bin/env bash

    echo "Aligning sample ${snakemake_wildcards[sample]} with minimap2" 2> "${snakemake_log[0]}"

    minimap2 ${snakemake_params[opts]} -t ${snakemake[threads]} "${snakemake_input[reference]}" \
        "${snakemake_input[0]}" > "${snakemake_output[0]}" 2>> "${snakemake_log[0]}"


If you don't add a shebang, the default ``#!/usr/bin/env bash`` will be inserted for you. A tutorial on how to use
associative arrays can be found `here <https://www.xmodulo.com/key-value-dictionary-bash.html>`_.

You may also have noticed the mixed use of double-quotes when accessing some variables. It is generally good practice in
Bash to double-quote variables for which you want to `prevent word splitting <split_>`_; generally, you will want to
double-quote any variable that could contain a file name. However, `in some cases <exception_>`_, word splitting *is* desired,
such as ``${snakemake_params[opts]}`` in the above example.

Bash Example 2
^^^^^^^^^^^^^^

.. code-block:: python

    rule align:
        input:
            reads=["{sample}_R1.fq", "{sample}_R2.fq]"],
            reference="ref.fa",
        output:
            "{sample}.sam"
        params:
            opts="-M",
        threads: 4
        log:
            "align/{sample}.log"
        conda:
            "envs/align.yaml"
        script:
            "scripts/align.sh"


In this example, the ``input`` variable ``reads``, which is a python list, actually gets stored as a space-separated string
in Bash because, you guessed it, you can't nest arrays in Bash! So in order to access the individual members, we turn the
string into an array; allowing us to access individual elements of the list/array. See `this stackoverflow question <so_>`_ for other solutions.

``align.sh``

.. code-block:: bash

    #!/usr/bin/env bash

    exec 2> "${snakemake_log[0]}"  # send all stderr from this script to the log file

    reads=(${snakemake_input[reads]})  # don't double-quote this - we want word splitting

    r1="${reads[0]}"
    r2="${reads[1]}"

    bwa index "${snakemake_input[reference]}"
    bwa mem ${snakemake_params[opts]} -t ${snakemake[threads]} \
        "${snakemake_input[reference]}" "$r1" "$r2" > "${snakemake_output[0]}"

If, in the above example, the fastq reads were not in a named variable, but were instead just a list, they would be available
as ``"${snakemake_input[0]}"`` and ``"${snakemake_input[1]}"``.

.. _arrays: https://www.gnu.org/software/bash/manual/html_node/Arrays.html#Arrays
.. _split: https://github.com/koalaman/shellcheck/wiki/SC2046
.. _exception: https://github.com/koalaman/shellcheck/wiki/SC2046#exceptions
.. _so: https://stackoverflow.com/q/1469849/5299417

----

For technical reasons, scripts are executed in ``.snakemake/scripts``. The original script directory is available as ``scriptdir`` in the ``snakemake`` object.

.. _snakefiles_notebook-integration:

Jupyter notebook integration
----------------------------

Instead of plain scripts (see above), one can integrate Jupyter_ Notebooks.
This enables the interactive development of data analysis components (e.g. for plotting).
Integration works as follows (note the use of `notebook:` instead of `script:`):

.. _Jupyter: https://jupyter.org/

.. code-block:: python

    rule hello:
        output:
            "test.txt"
        log:
            # optional path to the processed notebook
            notebook="logs/notebooks/processed_notebook.ipynb"
        notebook:
            "notebooks/hello.py.ipynb"

.. sidebar:: Note

    Consider Jupyter notebook integration as a way to get the best of both worlds.
    A modular, readable workflow definition with Snakemake, and the ability to quickly explore and plot data with Jupyter.
    The benefit will be maximal when integrating many small notebooks that each do a particular job, hence allowing to get away from large monolithic, and therefore unreadable notebooks.

It is recommended to prefix the ``.ipynb`` suffix with either ``.py`` or ``.r`` to indicate the notebook language.
In the notebook, a snakemake object is available, which can be accessed in the same way as the with :ref:`script integration <snakefiles_external-scripts>`.
In other words, you have access to input files via ``snakemake.input`` (in the Python case) and ``snakemake@input`` (in the R case) etc..
Optionally it is possible to automatically store the processed notebook.
This can be achieved by adding a named logfile ``notebook=...`` to the ``log`` directive.

.. sidebar:: Note

    It is possible to refer to wildcards and params in the notebook path, e.g. by specifying ``"notebook/{params.name}.py"`` or ``"notebook/{wildcards.name}.py"``.

In order to simplify the coding of notebooks given the automatically inserted ``snakemake`` object, Snakemake provides an interactive edit mode for notebook rules.
Let us assume you have written above rule, but the notebook does not yet exist.
By running

.. code-block:: console

    snakemake --cores 1 --edit-notebook test.txt

you instruct Snakemake to allow interactive editing of the notebook needed to create the file ``test.txt``.
Snakemake will run all dependencies of the notebook rule, such that all input files are present.
Then, it will start a jupyter notebook server with an empty draft of the notebook, in which you can interactively program everything needed for this particular step.
Once done, you should save the notebook from the jupyter web interface, go to the jupyter dashboard and hit the ``Quit`` button on the top right in order to shut down the jupyter server.
Snakemake will detect that the server is closed and automatically store the drafted notebook into the path given in the rule (here ``hello.py.ipynb``).
If the notebook already exists, above procedure can be used to easily modify it.
Note that Snakemake requires local execution for the notebook edit mode.
On a cluster or the cloud, you can generate all dependencies of the notebook rule via

.. code-block:: console

    snakemake --cluster ... --jobs 100 --until test.txt

Then, the notebook rule can easily be executed locally.
An demo of the entire interactive editing process can be found by clicking below:

.. image:: images/snakemake-notebook-demo.gif
    :scale: 20%
    :alt: Notebook integration demo
    :align: center

Finally, it is advisable to combine the ``notebook`` directive with the ``conda`` directive (see :ref:`integrated_package_management`) in order to define a software stack to use.
At least, this software stack should contain jupyter and the language to use (e.g. Python or R).
For the above case, this means

.. code-block:: python

    rule hello:
        output:
            "test.txt"
        conda:
            "envs/hello.yaml"
        notebook:
            "notebooks/hello.py.ipynb"

with

.. code-block:: yaml

    channels:
      - conda-forge
    dependencies:
      - python =3.8
      - jupyter =1.0
      - jupyterlab_code_formatter =1.4

The last dependency is advisable in order to enable autoformatting of notebook cells when editing.
When using other languages than Python in the notebook, one needs to additionally add the respective kernel, e.g. ``r-irkernel`` for R support.

When using an IDE with built-in Jupyter support, an alternative to ``--edit-notebook`` is ``--draft-notebook``.
Instead of firing up a notebook server, ``--draft-notebook`` just creates a skeleton notebook for editing within the IDE.
In addition, it prints instructions for configuring the IDE's notebook environment to use the interpreter from the
Conda environment defined in the corresponding rule.
For example, running

.. code-block:: console

    snakemake --cores 1 --draft-notebook test.txt --use-conda

will generate skeleton code in ``notebooks/hello.py.ipynb`` and additionally print instructions on how to open and execute the notebook in VSCode.


Protected and Temporary Files
-----------------------------

A particular output file may require a huge amount of computation time. Hence one might want to protect it against accidental deletion or overwriting. Snakemake allows this by marking such a file as ``protected``:

.. code-block:: python

    rule NAME:
        input:
            "path/to/inputfile"
        output:
            protected("path/to/outputfile")
        shell:
            "somecommand {input} {output}"

A protected file will be write-protected after the rule that produces it is completed.

Further, an output file marked as ``temp`` is deleted after all rules that use it as an input are completed:

.. code-block:: python

    rule NAME:
        input:
            "path/to/inputfile"
        output:
            temp("path/to/outputfile")
        shell:
            "somecommand {input} {output}"

.. _snakefiles-directory_output:

Directories as outputs
----------------------

Sometimes it can be convenient to have directories, rather than files, as outputs of a rule.
As of version 5.2.0, directories as outputs have to be explicitly marked with ``directory``.
This is primarily for safety reasons; since all outputs are deleted before a job is executed, we don't want to risk deleting important directories if the user makes some mistake.
Marking the output as ``directory`` makes the intent clear, and the output can be safely removed.
Another reason comes down to how modification time for directories work.
The modification time on a directory changes when a file or a subdirectory is added, removed or renamed.
This can easily happen in not-quite-intended ways, such as when Apple macOS or MS Windows add ``.DS_Store`` or ``thumbs.db`` files to store parameters for how the directory contents should be displayed.
When the ``directory`` flag is used a hidden file called ``.snakemake_timestamp`` is created in the output directory, and the modification time of that file is used when determining whether the rule output is up to date or if it needs to be rerun.
Always consider if you can't formulate your workflow using normal files before resorting to using ``directory()``.

.. code-block:: python

    rule NAME:
        input:
            "path/to/inputfile"
        output:
            directory("path/to/outputdir")
        shell:
            "somecommand {input} {output}"

Ignoring timestamps
-------------------

For determining whether output files have to be re-created, Snakemake checks whether the file modification date (i.e. the timestamp) of any input file of the same job is newer than the timestamp of the output file.
This behavior can be overridden by marking an input file as ``ancient``.
The timestamp of such files is ignored and always assumed to be older than any of the output files:

.. code-block:: python

    rule NAME:
        input:
            ancient("path/to/inputfile")
        output:
            "path/to/outputfile"
        shell:
            "somecommand {input} {output}"

Here, this means that the file ``path/to/outputfile`` will not be triggered for re-creation after it has been generated once, even when the input file is modified in the future.
Note that any flag that forces re-creation of files still also applies to files marked as ``ancient``.

.. _snakefiles_ensure::

Ensuring output file properties like non-emptyness or checksum compliance
-------------------------------------------------------------------------

It is possible to annotate certain additional criteria for output files to be ensured after they have been generated successfully.
For example, this can be used to check for output files to be non-empty, or to compare them against a given sha256 checksum.
If this functionality is used, Snakemake will check such annotated files before considering a job to be successfull.
Non-emptyness can be checked as follows:

.. code-block:: python

    rule NAME:
        output:
            ensure("test.txt", non_empty=True)
        shell:
            "somecommand {output}"

Above, the output file ``test.txt`` is marked as non-empty.
If the command ``somecommand`` happens to generate an empty output,
the job will fail with an error listing the unexpected empty file.

A sha256 checksum can be compared as follows:

.. code-block:: python

    my_checksum = "u98a9cjsd98saud090923ßkpoasköf9ß32"

    rule NAME:
        output:
            ensure("test.txt", sha256=my_checksum)
        shell:
            "somecommand {output}"

In addition to providing the checksum as plain string, it is possible to provide a pointer to a function (similar to :ref:`input functions <snakefiles_input-functions>`).
The function has to accept a single argument that will be the wildcards object generated from the application of the rule to create some requested output files:

.. code-block:: python

    def get_checksum(wildcards):
        # e.g., look up the checksum with the value of the wildcard sample
        # in some dictionary
        return my_checksums[wildcards.sample]

    rule NAME:
        output:
            ensure("test/{sample}.txt", sha256=get_checksum)
        shell:
            "somecommand {output}"


Note that you can also use `lambda expressions <https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions>`_ instead of full function definitions.

Often, it is a good idea to combine ``ensure`` annotations with :ref:`retry definitions <snakefiles_retries>`, e.g. for retrying upon invalid checksums or empty files.

Shadow rules
------------

Shadow rules result in each execution of the rule to be run in isolated temporary directories.
This "shadow" directory contains symlinks to files and directories in the current workdir.
This is useful for running programs that generate lots of unused files which you don't want to manually cleanup in your snakemake workflow.
It can also be useful if you want to keep your workdir clean while the program executes,
or simplify your workflow by not having to worry about unique filenames for all outputs of all rules.

By setting ``shadow: "shallow"``, the top level files and directories are symlinked,
so that any relative paths in a subdirectory will be real paths in the filesystem.
The setting ``shadow: "full"`` fully shadows the entire subdirectory structure of the current workdir.
The setting ``shadow: "minimal"`` only symlinks the inputs to the rule,
and ``shadow: "copy-minimal"`` copies the inputs instead of just creating symlinks.
Once the rule successfully executes, the output file will be moved if necessary to the real path as indicated by ``output``.

Typically, you will not need to modify your rule for compatibility with ``shadow``,
unless you reference parent directories relative to your workdir in a rule.

.. code-block:: python

    rule NAME:
        input: "path/to/inputfile"
        output: "path/to/outputfile"
        shadow: "shallow"
        shell: "somecommand --other_outputs other.txt {input} {output}"

Shadow directories are stored one per rule execution in ``.snakemake/shadow/``,
and are cleared on successful execution.
Consider running with the ``--cleanup-shadow`` argument every now and then
to remove any remaining shadow directories from aborted jobs.
The base shadow directory can be changed with the ``--shadow-prefix`` command line argument.

.. _snakefiles_retries:

Defining retries for fallible rules
-----------------------------------

Sometimes, rules may be expected to fail occasionally.
For example, this can happen when a rule downloads some online resources.
For such cases, it is possible to defined a number of automatic retries for each job from that particular rule via the ``retries`` directive:

.. code-block:: python

    rule a:
        output:
            "test.txt"
        retries: 3
        shell:
            "curl https://some.unreliable.server/test.txt > {output}"

Often, it is a good idea to combine retry functionality with :ref:`ensure annotations <snakefiles_ensure>`, e.g. for retrying upon invalid checksums or empty files.

Note that it is also possible to define retries globally (via the ``--retries`` command line option, see :ref:`all_options`).
The local definition of the rule thereby overwrites the global definition.

Importantly the ``retries`` directive is meant to be used for defining platform independent behavior (like adding robustness to above download command).
For dealing with unreliable cluster or cloud systems, you should use the ``--retries`` command line option.

Flag files
----------

Sometimes it is necessary to enforce some rule execution order without real file dependencies. This can be achieved by "touching" empty files that denote that a certain task was completed. Snakemake supports this via the `touch` flag:

.. code-block:: python

    rule all:
        input: "mytask.done"

    rule mytask:
        output: touch("mytask.done")
        shell: "mycommand ..."

With the ``touch`` flag, Snakemake touches (i.e. creates or updates) the file ``mytask.done`` after ``mycommand`` has finished successfully.


.. _snakefiles-job_properties:

Job Properties
--------------

.. sidebar:: Note

    If there are more than 100 input and/or output files for a job, ``None`` will be used instead of listing all values. This is to prevent the jobscript from becoming larger than `Slurm jobscript size limits <https://slurm.schedmd.com/slurm.conf.html#OPT_max_script_size=#>`_.

When executing a workflow on a cluster using the ``--cluster`` parameter (see below), Snakemake creates a job script for each job to execute.
This script is then invoked using the provided cluster submission command (e.g. ``qsub``).
Sometimes you want to provide a custom wrapper for the cluster submission command that decides about additional parameters.
As this might be based on properties of the job, Snakemake stores the job properties (e.g. rule name, threads, input files, params etc.) as JSON inside the job script.
For convenience, there exists a parser function ``snakemake.utils.read_job_properties`` that can be used to access the properties.
The following shows an example job submission wrapper:

.. code-block:: python

    #!/usr/bin/env python3
    import os
    import sys

    from snakemake.utils import read_job_properties

    jobscript = sys.argv[1]
    job_properties = read_job_properties(jobscript)

    # do something useful with the threads
    threads = job_properties[threads]

    # access property defined in the cluster configuration file (Snakemake >=3.6.0)
    job_properties["cluster"]["time"]

    os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))

.. _snakefiles-input_functions:

Input functions
---------------

Instead of specifying strings or lists of strings as input files, snakemake can also make use of functions that return single **or** lists of input files:

.. code-block:: python

    def myfunc(wildcards):
        return [... a list of input files depending on given wildcards ...]

    rule:
        input:
            myfunc
        output:
            "someoutput.{somewildcard}.txt"
        shell:
            "..."

The function has to accept a single argument that will be the wildcards object generated from the application of the rule to create some requested output files.
Note that you can also use `lambda expressions <https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions>`_ instead of full function definitions.
By this, rules can have entirely different input files (both in form and number) depending on the inferred wildcards. E.g. you can assign input files that appear in entirely different parts of your filesystem based on some wildcard value and a dictionary that maps the wildcard value to file paths.

In additon to a single wildcards argument, input functions can optionally take a ``groupid`` (with exactly that name) as second argument, see :ref:`snakefiles_group-local` for details.

Finally, when implementing the input function, it is best practice to make sure that it can properly handle all possible wildcard values your rule can have.
In particular, input files should not be combined with very general rules that can be applied to create almost any file: Snakemake will try to apply the rule, and will report the exceptions of your input function as errors.

For a practical example, see the :ref:`tutorial` (:ref:`tutorial-input_functions`).

.. _snakefiles-unpack:

Input Functions and ``unpack()``
--------------------------------

In some cases, you might want to have your input functions return named input files.
This can be done by having them return ``dict()`` objects with the names as the dict keys and the file names as the dict values and using the ``unpack()`` keyword.

.. code-block:: python

    def myfunc(wildcards):
        return {'foo': '{wildcards.token}.txt'.format(wildcards=wildcards)}

    rule:
        input:
            unpack(myfunc)
        output:
            "someoutput.{token}.txt"
        shell:
            "..."

Note that ``unpack()`` is only necessary for input functions returning ``dict``.
While it also works for ``list``, remember that lists (and nested lists) of strings are automatically flattened.

Also note that if you do not pass in a *function* into the input list but you directly *call a function* then you shouldn't use ``unpack()``.
Here, you can simply use Python's double-star (``**``) operator for unpacking the parameters.

Note that as Snakefiles are translated into Python for execution, the same rules as for using the `star and double-star unpacking Python operators <https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists>`_ apply.
These restrictions do not apply when using ``unpack()``.

.. code-block:: python

    def myfunc1():
        return ['foo.txt']

    def myfunc2():
        return {'foo': 'nowildcards.txt'}

    rule:
        input:
            *myfunc1(),
            **myfunc2(),
        output:
            "..."
        shell:
            "..."

.. _snakefiles-code_tracking:

Code Tracking
-------------

Snakemake tracks the code that was used to create your files.
In combination with ``--summary`` or ``--list-code-changes`` this can be used to see what files may need a re-run because the implementation changed.
Re-run can be automated by invoking Snakemake as follows:

.. code-block:: console

    $ snakemake -R `snakemake --list-code-changes`


.. _snakefiles-job_lifetime_handlers:

Onstart, onsuccess and onerror handlers
---------------------------------------

Sometimes, it is necessary to specify code that shall be executed when the workflow execution is finished (e.g. cleanup, or notification of the user).
With Snakemake 3.2.1, this is possible via the ``onsuccess`` and ``onerror`` keywords:

.. code-block:: python

    onsuccess:
        print("Workflow finished, no error")

    onerror:
        print("An error occurred")
        shell("mail -s "an error occurred" youremail@provider.com < {log}")

The ``onsuccess`` handler is executed if the workflow finished without error. Otherwise, the ``onerror`` handler is executed.
In both handlers, you have access to the variable ``log``, which contains the path to a logfile with the complete Snakemake output.
Snakemake 3.6.0 adds an ``onstart`` handler, that will be executed before the workflow starts.
Note that dry-runs do not trigger any of the handlers.


Rule dependencies
-----------------

From version 2.4.8 on, rules can also refer to the output of other rules in the Snakefile, e.g.:

.. code-block:: python

    rule a:
        input:  "path/to/input"
        output: "path/to/output"
        shell:  ...

    rule b:
        input:  rules.a.output
        output: "path/to/output/of/b"
        shell:  ...

Importantly, be aware that referring to rule ``a`` here requires that rule ``a`` was defined above rule ``b`` in the file, since the object has to be known already.
This feature also allows us to resolve dependencies that are ambiguous when using filenames.

Note that when the rule you refer to defines multiple output files but you want to require only a subset of those as input for another rule, you should name the output files and refer to them specifically:

.. code-block:: python

    rule a:
        input:  "path/to/input"
        output: a = "path/to/output", b = "path/to/output2"
        shell:  ...

    rule b:
        input:  rules.a.output.a
        output: "path/to/output/of/b"
        shell:  ...


.. _snakefiles-ambiguous-rules:

Handling Ambiguous Rules
------------------------

When two rules can produce the same output file, snakemake cannot decide which one to use without additional guidance. Hence an ``AmbiguousRuleException`` is thrown.
Note: ruleorder is not intended to bring rules in the correct execution order (this is solely guided by the names of input and output files you use), it only helps snakemake to decide which rule to use when multiple ones can create the same output file!
To deal with such ambiguity, provide a ``ruleorder`` for the conflicting rules, e.g.

.. code-block:: python

    ruleorder: rule1 > rule2 > rule3

Here, ``rule1`` is preferred over ``rule2`` and ``rule3``, and ``rule2`` is preferred over ``rule3``.
Only if rule1 and rule2 cannot be applied (e.g. due to missing input files), rule3 is used to produce the desired output file.

Alternatively, rule dependencies (see above) can also resolve ambiguities.

Another (quick and dirty) possiblity is to tell snakemake to allow ambiguity via a command line option

.. code-block:: console

    $ snakemake --allow-ambiguity

such that similar to GNU Make always the first matching rule is used. Here, a warning that summarizes the decision of snakemake is provided at the terminal.

.. _snakefiles-local-rules:

Local Rules
-----------

When working in a cluster environment, not all rules need to become a job that has to be submitted (e.g. downloading some file, or a target-rule like `all`, see :ref:`snakefiles-targets`).
The keyword `localrules` allows to mark a rule as local, so that it is not submitted to the cluster and instead executed on the host node:

.. code-block:: python

    localrules: all, foo

    rule all:
        input: ...

    rule foo:
        ...

    rule bar:
        ...

Here, only jobs from the rule ``bar`` will be submitted to the cluster, whereas all and foo will be run locally.
Note that you can use the localrules directive **multiple times**. The result will be the union of all declarations.

Alternatively, you can also use the rule directive `localrule`:

.. code-block:: python

    rule all:
        input: ...
        localrule: True

    rule foo:
        ...
        localrule: True

    rule bar:
        ...

Benchmark Rules
---------------

Since version 3.1, Snakemake provides support for benchmarking the run times of rules.
This can be used to create complex performance analysis pipelines.
With the `benchmark` keyword, a rule can be declared to store a benchmark of its code into the specified location. E.g. the rule

.. code-block:: python

    rule benchmark_command:
        input:
            "path/to/input.{sample}.txt"
        output:
            "path/to/output.{sample}.txt"
        benchmark:
            "benchmarks/somecommand/{sample}.tsv"
        shell:
            "somecommand {input} {output}"

benchmarks the

* CPU time (in seconds),
* wall clock time,
* memory usage (`RSS <https://en.wikipedia.org/wiki/Resident_set_size>`_, `VMS <https://en.wikipedia.org/wiki/Virtual_memory>`_, `USS <https://en.wikipedia.org/wiki/Unique_set_size>`_, `PSS <https://en.wikipedia.org/wiki/Proportional_set_size>`_ in megabytes),
* CPU load (CPU time divided by wall clock time),
* I/O (in bytes)

of the command ``somecommand`` for the given output and input files.

For this, the shell or run body of the rule is executed on that data, and all run times are stored into the given benchmark tsv file (which will contain a tab-separated table of run times and memory usage in MiB).
Per default, Snakemake executes the job once, generating one run time.
However, the benchmark file can be annotated with the desired number of repeats, e.g.,

.. code-block:: python

    rule benchmark_command:
        input:
            "path/to/input.{sample}.txt"
        output:
            "path/to/output.{sample}.txt"
        benchmark:
            repeat("benchmarks/somecommand/{sample}.tsv", 3)
        shell:
            "somecommand {input} {output}"

will instruct Snakemake to run each job of this rule three times and store all measurements in the benchmark file.
The resulting tsv file can be used as input for other rules, just like any other output file.

.. sidebar:: Note

    Note that benchmarking is only possible in a reliable fashion for subprocesses (thus for tasks run through the ``shell``, ``script``, and ``wrapper`` directive).
    In the ``run`` block, the variable ``bench_record`` is available that you can pass to ``shell()`` as ``bench_record=bench_record``.
    When using ``shell(..., bench_record=bench_record)``, the maximum of all measurements of all ``shell()`` calls will be used but the running time of the rule execution including any Python code.


.. _snakefiles-scattergather:

Defining scatter-gather processes
---------------------------------

Via Snakemake's powerful and abitrary Python based aggregation abilities (via the ``expand`` function and arbitrary Python code, see :ref:`here <snakefiles_aggregation>`), scatter-gather workflows are well supported.
Nevertheless, it can sometimes be handy to use Snakemake's specific scatter-gather support, which allows to avoid boilerplate and offers additional configuration options.
Scatter-gather processes can be defined via a global ``scattergather`` directive:

.. code-block:: python

    scattergather:
        split=8

Each process thereby defines a name (here e.g. ``split``) and a default number of scatter items.
Then, scattering and gathering can be implemented by using globally available ``scatter`` and ``gather`` objects:

.. code-block:: python


    rule all:
        input:
            "gathered/all.txt"


    rule split:
        output:
            scatter.split("splitted/{scatteritem}.txt")
        shell:
            "touch {output}"


    rule intermediate:
        input:
            "splitted/{scatteritem}.txt"
        output:
            "splitted/{scatteritem}.post.txt"
        shell:
            "cp {input} {output}"


    rule gather:
        input:
            gather.split("splitted/{scatteritem}.post.txt")
        output:
            "gathered/all.txt"
        shell:
            "cat {input} > {output}"

Thereby, ``scatter.split("splitted/{scatteritem}.txt")`` yields a list of paths ``"splitted/1-of-n.txt"``, ``"splitted/2-of-n.txt"``, ..., depending on the number ``n`` of scatter items defined.
Analogously, ``gather.split("splitted/{scatteritem}.post.txt")``, yields a list of paths ``"splitted/0.post.txt"``, ``"splitted/1.post.txt"``, ..., which request the application of the rule ``intermediate`` to each scatter item.

The default number of scatter items can be overwritten via the command line interface.
For example

.. code-block:: bash

    snakemake --set-scatter split=2

would set the number of scatter items for the split process defined above to 2 instead of 8.
This allows to adapt parallelization according to the needs of the underlying computing platform and the analysis at hand.

For more complex workflows it's possible to define multiple processes, for example:

.. code-block:: python

    scattergather:
        split_a=8,
        split_b=3,

The calls to ``scatter`` and ``gather`` would need to reference the appropriate process name, e.g. ``scatter.split_a`` and ``gather.split_a`` to use the ``split_a`` settings.

.. _snakefiles-grouping:

Defining groups for execution
-----------------------------

From Snakemake 5.0 on, it is possible to assign rules to groups.
Such groups will be executed together in **cluster** or **cloud mode**, as a so-called **group job**, i.e., all jobs of a particular group will be submitted at once, to the same computing node.
When executing locally, group definitions are ignored.

Groups can be defined via the ``group`` keyword.
This way, queueing and execution time can be saved, in particular if one or several short-running rules are involved.

.. code-block:: python

  samples = [1,2,3,4,5]


  rule all:
      input:
          "test.out"


  rule a:
      output:
          "a/{sample}.out"
      group: "mygroup"
      shell:
          "touch {output}"


  rule b:
      input:
          "a/{sample}.out"
      output:
          "b/{sample}.out"
      group: "mygroup"
      shell:
          "touch {output}"


  rule c:
      input:
          expand("b/{sample}.out", sample=samples)
      output:
          "test.out"
      shell:
          "touch {output}"

Here, jobs from rule ``a`` and ``b`` end up in one group ``mygroup``, whereas jobs from rule ``c`` are executed separately.
Note that Snakemake always determines a **connected subgraph** with the same group id to be a **group job**.
Here, this means that, e.g., the jobs creating ``a/1.out`` and ``b/1.out`` will be in one group, and the jobs creating ``a/2.out`` and ``b/2.out`` will be in a separate group.
However, if we would add ``group: "mygroup"`` to rule ``c``, all jobs would end up in a single group, including the one spawned from rule ``c``, because ``c`` connects all the other jobs.

Alternatively, groups can be defined via the command line interface.
This enables to almost arbitrarily partition the DAG, e.g. in order to safe network traffic, see :ref:`here <job_grouping>`.

For execution on the cloud using Google Life Science API and preemptible instances, we expect all rules in the group to be homogenously set as preemptible instances (e.g., with command-line option ``--preemptible-rules``), such that a preemptible VM is requested for the execution of the group job.

.. _snakefiles_group-local:

Group-local jobs
~~~~~~~~~~~~~~~~

From Snakemake 7.0 on, it is further possible to ensure that jobs from a certain rule are executed separately within each :ref:`job group <job_grouping>`.
For this purpose we use :ref:`input functions <snakefiles-input_functions>`, which, in addition to the ``wildcards`` argument can expect a ``groupid`` argument.
In such a case, Snakemake passes the ID of the corresponding group job to the input function.
Consider the following example

.. code-block:: python

    rule all:
        input:
            expand("bar{i}.txt", i=range(3))


    rule grouplocal:
        output:
            "foo.{groupid}.txt"
        group:
            "foo"
        shell:
            "echo test > {output}"


    def get_input(wildcards, groupid):
        return f"foo.{groupid}.txt"


    rule consumer:
        input:
            get_input
        output:
            "bar{i}.txt"
        group:
            "foo"
        shell:
            "cp {input} {output}"

Here, the value of ``groupid`` that is passed by Snakemake to the input function is a `UUID <https://en.wikipedia.org/wiki/Universally_unique_identifier>`_ that uniquely identifies the group job in which each instance of the rule ``consumer`` is contained.
In the input function ``get_input`` we use this ID to request the desired input file from the rule ``grouplocal``.
Since the value of the corresponding wildcard ``groupid`` is now always a group specific unique ID, it is ensured that the rule ``grouplocal`` will run for every group job spawned from the group ``foo`` (remember that group jobs by default only span one connected component, and that this can be configured via the command line, see :ref:`job_grouping`).
Of course, above example would also work if the groups are not specified via the rule definition but entirely via the :ref:`command line <job_grouping>`.

.. _snakefiles-piped-output:

Piped output
------------

From Snakemake 5.0 on, it is possible to mark output files as pipes, via the ``pipe`` flag, e.g.:

.. code-block:: python

  rule all:
      input:
          expand("test.{i}.out", i=range(2))


  rule a:
      output:
          pipe("test.{i}.txt")
      shell:
          "for i in {{0..2}}; do echo {wildcards.i} >> {output}; done"


  rule b:
      input:
          "test.{i}.txt"
      output:
          "test.{i}.out"
      shell:
          "grep {wildcards.i} < {input} > {output}"

If an output file is marked to be a pipe, then Snakemake will first create a `named pipe <https://en.wikipedia.org/wiki/Named_pipe>`_ with the given name and then execute the creating job simultaneously with the consuming job, inside a **group job** (see above).
This works in all execution modes, local, cluster, and cloud.
Naturally, a pipe output may only have a single consumer.
It is possible to combine explicit group definition as above with pipe outputs.
Thereby, pipe jobs can live within, or (automatically) extend existing groups.
However, the two jobs connected by a pipe may not exist in conflicting groups.

As with other groups, Snakemake will automatically calculate the required resources for the group job (see :ref:`resources <snakefiles-resources>`.

.. _snakefiles-service-rules:

Service rules/jobs
------------------

From Snakemake 7.0 on, it is possible to define so-called service rules.
Jobs spawned from such rules provide at least one special output file that is marked as ``service``, which means that it is considered to provide a resource that shall be kept available until all consuming jobs are finished.
This can for example be the socket of a database, a shared memory device, a ramdisk, and so on.
It can even just be a dummy file, and access to the service might happen via a different channel (e.g. a local http port).
Service jobs are expected to not exit after creating that resource, but instead wait until Snakemake terminates them (e.g. via SIGTERM on Unixoid systems).

Consider the following example:

.. code-block:: python

    rule the_service:
        output:
            service("foo.socket")
        shell:
            # here we simulate some kind of server process that provides data via a socket
            "ln -s /dev/random {output}; sleep 10000"


    rule consumer1:
        input:
            "foo.socket"
        output:
            "test.txt"
        shell:
            "head -n1 {input} > {output}"


    rule consumer2:
        input:
            "foo.socket"
        output:
            "test2.txt"
        shell:
            "head -n1 {input} > {output}"

Snakemake will schedule the service with all consumers to the same physical node (in the future we might provide further controls and other modes of operation).
Once all consumer jobs are finished, the service job will be terminated automatically by Snakemake, and the service output will be removed.

Group-local service jobs
~~~~~~~~~~~~~~~~~~~~~~~~

Since Snakemake supports arbitrary partitioning of the DAG into so-called :ref:`job groups <job-grouping>`, one should consider what this implies for service jobs when running a workflow in a cluster of cloud context:
since each group job spans at least one connected component (see :ref:`job groups <job-grouping>` and `the Snakemake paper <https://doi.org/10.12688/f1000research.29032.2>`), this means that the service job will automatically connect all consumers into one big group.
This can be undesired, because depending on the number of consumers that group job can become too big for efficient execution on the underlying architecture.
In case of local execution, this is not a problem because here DAG partitioning has no effect.

However, to make a workflow portable across different backends, this behavior should always be considered.
In order to circumvent it, it is possible to model service jobs as group-local, i.e. ensuring that each group job gets its own instance of the service rule.
This works by combining the service job pattern from above with the :ref:`group-local pattern <snakefiles_group-local>` as follows:

.. code-block:: python

    rule the_service:
        output:
            service("foo.{groupid}.socket")
        shell:
            # here we simulate some kind of server process that provides data via a socket
            "ln -s /dev/random {output}; sleep 10000"


    def get_socket(wildcards, groupid):
        return f"foo.{groupid}.socket"


    rule consumer1:
        input:
            get_socket
        output:
            "test.txt"
        shell:
            "head -n1 {input} > {output}"


    rule consumer2:
        input:
            get_socket
        output:
            "test2.txt"
        shell:
            "head -n1 {input} > {output}"

.. _snakefiles-paramspace:

Parameter space exploration
---------------------------

The basic Snakemake functionality already provides everything to handle parameter spaces in any way (sub-spacing for certain rules and even depending on wildcard values, the ability to read or generate spaces on the fly or from files via pandas, etc.).
However, it usually would require some boilerplate code for translating a parameter space into wildcard patterns, and translate it back into concrete parameters for scripts and commands.
From Snakemake 5.31 on (inspired by `JUDI <https://pyjudi.readthedocs.io>`_), this is solved via the Paramspace helper, which can be used as follows:

.. code-block:: python

    from snakemake.utils import Paramspace
    import pandas as pd

    # declare a dataframe to be a paramspace
    paramspace = Paramspace(pd.read_csv("params.tsv", sep="\t"))


    rule all:
        input:
            # Aggregate over entire parameter space (or a subset thereof if needed)
            # of course, something like this can happen anywhere in the workflow (not
            # only at the end).
            expand("results/plots/{params}.pdf", params=paramspace.instance_patterns)


    rule simulate:
        output:
            # format a wildcard pattern like "alpha~{alpha}/beta~{beta}/gamma~{gamma}"
            # into a file path, with alpha, beta, gamma being the columns of the data frame
            f"results/simulations/{paramspace.wildcard_pattern}.tsv"
        params:
            # automatically translate the wildcard values into an instance of the param space
            # in the form of a dict (here: {"alpha": ..., "beta": ..., "gamma": ...})
            simulation=paramspace.instance
        script:
            "scripts/simulate.py"


    rule plot:
        input:
            f"results/simulations/{paramspace.wildcard_pattern}.tsv"
        output:
            f"results/plots/{paramspace.wildcard_pattern}.pdf"
        shell:
            "touch {output}"


In above example, **please note** the Python ``f``-string formatting (the ``f`` before the initial quotes) applied to the input and output file strings that contain ``paramspace.wildcard_pattern``.
This means that the file that is registered as input or output file by Snakemake does not contain a wildcard ``{paramspace.wildcard_pattern}``, but instead this item is replaced by a pattern of multiple wildcards derived from the columns of the paramter space dataframe.
This is done by the Python ``f``-string formatting before the string is registered in the rule.
Given that `params.tsv` contains:

.. code-block:: none

    alpha	beta	gamma
    1.0	0.1	0.99
    2.0	0.0	3.9


This workflow will run as follows:

.. code-block:: none

    [Fri Nov 27 20:57:27 2020]
    rule simulate:
        output: results/simulations/alpha~2.0/beta~0.0/gamma~3.9.tsv
        jobid: 4
        wildcards: alpha=2.0, beta=0.0, gamma=3.9

    [Fri Nov 27 20:57:27 2020]
    rule simulate:
        output: results/simulations/alpha~1.0/beta~0.1/gamma~0.99.tsv
        jobid: 2
        wildcards: alpha=1.0, beta=0.1, gamma=0.99

    [Fri Nov 27 20:57:27 2020]
    rule plot:
        input: results/simulations/alpha~2.0/beta~0.0/gamma~3.9.tsv
        output: results/plots/alpha~2.0/beta~0.0/gamma~3.9.pdf
        jobid: 3
        wildcards: alpha=2.0, beta=0.0, gamma=3.9


    [Fri Nov 27 20:57:27 2020]
    rule plot:
        input: results/simulations/alpha~1.0/beta~0.1/gamma~0.99.tsv
        output: results/plots/alpha~1.0/beta~0.1/gamma~0.99.pdf
        jobid: 1
        wildcards: alpha=1.0, beta=0.1, gamma=0.99


    [Fri Nov 27 20:57:27 2020]
    localrule all:
        input: results/plots/alpha~1.0/beta~0.1/gamma~0.99.pdf, results/plots/alpha~2.0/beta~0.0/gamma~3.9.pdf
        jobid: 0


Naturally, it is possible to create sub-spaces from ``Paramspace`` objects, simply by applying all the usual methods and attributes that Pandas data frames provide (e.g. ``.loc[...]``, ``.filter()`` etc.).
Further, the form of the created ``wildcard_pattern`` can be controlled via additional arguments of the ``Paramspace`` `constructor <https://snakemake-api.readthedocs.io/en/latest/api_reference/snakemake_utils.html#snakemake.utils.Paramspace>`_.
In particular, using the argument ``single_wildcard`` the default behavior of encoding each column as a wildcard can be replaced with a single given wildcard name.
This can be handy in case a rule shall serve multiple param spaces with different sets of columns.

.. _snakefiles-checkpoints:

Data-dependent conditional execution
------------------------------------

From Snakemake 5.4 on, conditional reevaluation of the DAG of jobs based on the content outputs is possible.
The key idea is that rules can be declared as checkpoints, e.g.,

.. code-block:: python

  checkpoint somestep:
      input:
          "samples/{sample}.txt"
      output:
          "somestep/{sample}.txt"
      shell:
          "somecommand {input} > {output}"

Snakemake allows to re-evaluate the DAG after the successful execution of every job spawned from a checkpoint.
For this, every checkpoint is registered by its name in a globally available ``checkpoints`` object.
The ``checkpoints`` object can be accessed by :ref:`input functions <snakefiles-input_functions>`.
Assuming that the checkpoint is named ``somestep`` as above, the output files for a particular job can be retrieved with

.. code-block:: python

  checkpoints.somestep.get(sample="a").output

.. sidebar:: Note

    Note that output files of checkpoints that are accessed via this mechanism should not be marked as temporary.
    Otherwise, they would require to trigger reruns of the checkpoint whenever the DAG shall be reevaluated (because they are already missing at that point).

Thereby, the ``get`` method throws ``snakemake.exceptions.IncompleteCheckpointException`` if the checkpoint has not yet been executed for these particular wildcard value(s).
Inside an input function, the exception will be automatically handled by Snakemake, and leads to a re-evaluation after the checkpoint has been successfully passed.

To illustrate the possibilities of this mechanism, consider the following complete example:

.. code-block:: python

  # a target rule to define the desired final output
  rule all:
      input:
          "aggregated/a.txt",
          "aggregated/b.txt"


  # the checkpoint that shall trigger re-evaluation of the DAG
  checkpoint somestep:
      input:
          "samples/{sample}.txt"
      output:
          "somestep/{sample}.txt"
      shell:
          # simulate some output value
          "echo {wildcards.sample} > somestep/{wildcards.sample}.txt"


  # intermediate rule
  rule intermediate:
      input:
          "somestep/{sample}.txt"
      output:
          "post/{sample}.txt"
      shell:
          "touch {output}"


  # alternative intermediate rule
  rule alt_intermediate:
      input:
          "somestep/{sample}.txt"
      output:
          "alt/{sample}.txt"
      shell:
          "touch {output}"


  # input function for the rule aggregate
  def aggregate_input(wildcards):
      # decision based on content of output file
      # Important: use the method open() of the returned file!
      # This way, Snakemake is able to automatically download the file if it is generated in
      # a cloud environment without a shared filesystem.
      with checkpoints.somestep.get(sample=wildcards.sample).output[0].open() as f:
          if f.read().strip() == "a":
              return "post/{sample}.txt"
          else:
              return "alt/{sample}.txt"


  rule aggregate:
      input:
          aggregate_input
      output:
          "aggregated/{sample}.txt"
      shell:
          "touch {output}"

As can be seen, the rule aggregate uses an input function.

.. sidebar:: Note

    You don't need to use the checkpoint mechanism to determine parameter or resource values of downstream rules that would be based on the output of previous rules.
    In fact, it won't even work because the checkpoint mechanism is only considered for input functions.
    Instead, you can simply use normal parameter or resource functions that just assume that those output files are there. Snakemake will evaluate them immediately before
    the job is scheduled, when the required files from upstream rules are already present.

Inside the function, we first retrieve the output files of the checkpoint ``somestep`` with the wildcards, passing through the value of the wildcard sample.
Upon execution, if the checkpoint is not yet complete, Snakemake will record ``somestep`` as a direct dependency of the rule ``aggregate``.
Once ``somestep`` has finished for a given sample, the input function will automatically be re-evaluated and the method ``get`` will no longer return an exception.
Instead, the output file will be opened, and depending on its contents either ``"post/{sample}.txt"`` or ``"alt/{sample}.txt"`` will be returned by the input function.
This way, the DAG becomes conditional on some produced data.

It is also possible to use checkpoints for cases where the output files are unknown before execution.
Consider the following example where an arbitrary number of files is generated by a rule before being aggregated:

.. code-block:: python

  # a target rule to define the desired final output
    rule all:
        input:
            "aggregated.txt"


    # the checkpoint that shall trigger re-evaluation of the DAG
    # an number of file is created in a defined directory
    checkpoint somestep:
        output:
            directory("my_directory/")
        shell:'''
        mkdir my_directory/
        cd my_directory
        for i in 1 2 3; do touch $i.txt; done
        '''



    # input function for rule aggregate, return paths to all files produced by the checkpoint 'somestep'
    def aggregate_input(wildcards):
        checkpoint_output = checkpoints.somestep.get(**wildcards).output[0]
        return expand("my_directory/{i}.txt",
                    i=glob_wildcards(os.path.join(checkpoint_output, "{i}.txt")).i)


    rule aggregate:
        input:
            aggregate_input
        output:
            "aggregated.txt"
        shell:
            "cat {input} > {output}"

Because the number of output files is unknown beforehand, the checkpoint only defines an output :ref:`directory <snakefiles-directory_output>`.
This time, instead of explicitly writing

.. code-block:: python

  checkpoints.somestep.get(sample=wildcards.sample).output[0]

we use the shorthand

.. code-block:: python

  checkpoints.somestep.get(**wildcards).output[0]

which automatically unpacks the wildcards as keyword arguments (this is standard python argument unpacking).
If the checkpoint has not yet been executed, accessing ``checkpoints.somestep.get(**wildcards)`` ensures that Snakemake records the checkpoint as a direct dependency of the rule ``aggregate``.
Upon completion of the checkpoint, the input function is re-evaluated, and the code beyond its first line is executed.
Here, we retrieve the values of the wildcard ``i`` based on all files named ``{i}.txt`` in the output directory of the checkpoint.
Because the wildcard ``i`` is evaluated only after completion of the checkpoint, it is nescessay to use ``directory`` to declare its output, instead of using the full wildcard patterns as output.

A more practical example building on the previous one is a clustering process with an unknown number of clusters for different samples, where each cluster shall be saved into a separate file.
In this example the clusters are being processed by an intermediate rule before being aggregated:

.. code-block:: python

  # a target rule to define the desired final output
  rule all:
      input:
          "aggregated/a.txt",
          "aggregated/b.txt"


  # the checkpoint that shall trigger re-evaluation of the DAG
  checkpoint clustering:
      input:
          "samples/{sample}.txt"
      output:
          clusters=directory("clustering/{sample}")
      shell:
          "mkdir clustering/{wildcards.sample}; "
          "for i in 1 2 3; do echo $i > clustering/{wildcards.sample}/$i.txt; done"


  # an intermediate rule
  rule intermediate:
      input:
          "clustering/{sample}/{i}.txt"
      output:
          "post/{sample}/{i}.txt"
      shell:
          "cp {input} {output}"


  def aggregate_input(wildcards):
      checkpoint_output = checkpoints.clustering.get(**wildcards).output[0]
      return expand("post/{sample}/{i}.txt",
             sample=wildcards.sample,
             i=glob_wildcards(os.path.join(checkpoint_output, "{i}.txt")).i)


  # an aggregation over all produced clusters
  rule aggregate:
      input:
          aggregate_input
      output:
          "aggregated/{sample}.txt"
      shell:
          "cat {input} > {output}"

Here a new directory will be created for each sample by the checkpoint.
After completion of the checkpoint, the ``aggregate_input`` function is re-evaluated as previously.
The values of the wildcard ``i`` is this time used to expand the pattern ``"post/{sample}/{i}.txt"``, such that the rule ``intermediate`` is executed for each of the determined clusters.


.. _snakefiles-rule-inheritance:

Rule inheritance
----------------

With Snakemake 6.0 and later, it is possible to inherit from previously defined rules, or in other words, reuse an existing rule in a modified way.
This works via the ``use rule`` statement that also allows to declare the usage of rules from external modules (see :ref:`snakefiles-modules`).
Consider the following example:

.. code-block:: python

    rule a:
        output:
            "test.out"
        shell:
            "echo test > {output}"


    use rule a as b with:
        output:
            "test2.out"


As can be seen, we first declare a rule a, and then we reuse the rule a as rule b, while changing only the output file and keeping everything else the same.
In reality, one will often change more.
Analogously to the ``use rule`` from external modules, any properties of the rule (``input``, ``output``, ``log``, ``params``, ``benchmark``, ``threads``, ``resources``, etc.) can be modified, except the actual execution step (``shell``, ``notebook``, ``script``, ``cwl``, or ``run``).
All unmodified properties are inherited from the parent rule.

.. _snakefiles-aux_source_files:

Accessing auxiliary source files
--------------------------------

Snakemake workflows can refer to various other source files via paths relative to the current Snakefile.
This happens for example with the :ref:`script directive <snakefiles-external_scripts>` or the :ref:`conda directive <integrated_package_management>`.
Sometimes, it is necessary to access further source files that are in a directory relative to the current Snakefile.
Since workflows can be imported from remote locations (e.g. when using :ref:`modules <snakefiles-modules>`), it is important to not do this manually, so that Snakemake has the chance to cache these files locally before they are accessed.
This can be achieved by accessing their path via the ``workflow.source_path``, which (a) computes the correct path relative to the current Snakefile such that the file can be accessed from any working directory, and (b) downloads remote files to a local cache:

.. code-block:: python

    rule a:
        output:
            "test.out"
        params:
            json=workflow.source_path("../resources/test.json")
        shell:
            "somecommand {params.json} > {output}"


.. _snakefiles-template-integration:

Template rendering integration
------------------------------

Sometimes, data analyses entail the dynamic rendering of internal configuration files that are required for certain steps.
From Snakemake 7 on, such template rendering is directly integrated such that it can happen with minimal code and maximum performance.
Consider the following example:

.. code-block:: python

    rule render_jinja2_template:
        input:
            "some-jinja2-template.txt"
        output:
            "results/{sample}.rendered-version.txt"
        params:
            foo=0.1
        template_engine:
            "jinja2"

Here, Snakemake will automatically use the specified template engine `Jinja2 <https://jinja.palletsprojects.com/>`_ to render the template given as input file into the given output file.
The template_engine instruction has to be specified at the end of the rule.
Template rendering rules may only have a single output file.
If the rule needs more than one input file, there has to be one input file called ``template``, pointing to the main template to be used for the rendering:

.. code-block:: python

    rule render_jinja2_template:
        input:
            template="some-jinja2-template.txt",
            other_file="some-other-input-file-used-by-the-template.txt"
        output:
            "results/{sample}.rendered-version.txt"
        params:
            foo=0.1
        template_engine:
            "jinja2"

The template itself has access to ``input``, ``params``, ``wildcards``, and ``config``,
which are the same objects you can use for example in the ``shell`` or ``run`` directive,
and the same objects as can be accessed from ``script`` or ``notebook`` directives (but in the latter two cases they are stored behind the ``snakemake`` object which serves as a dedicated namespace to avoid name clashes).

An example Jinja2 template could look like this:

.. code-block:: jinja2

    This is some text and now we access {{ params.foo }}.

Apart from Jinja2, Snakemake supports `YTE <https://github.com/koesterlab/yte>`_ (YAML template engine), which is particularly designed to support templating of the ubiquitious YAML file format:

.. code-block:: python

    rule render_jinja2_template:
        input:
            "some-yte-template.yaml"
        output:
            "results/{sample}.rendered-version.yaml"
        params:
            foo=0.1
        template_engine:
            "yte"

Analogously to the jinja2 case YTE has access to ``params``, ``wildcards``, and ``config``:

.. code-block:: yaml

    ?if params.foo < 0.5:
      x:
        - 1
        - 2
        - 3
    ?else:
      y:
        - a
        - b
        - ?config["threshold"]

Template rendering rules are always executed locally, without submission to cluster or cloud processes (since templating is usually not resource intensive).

.. _snakefiles_mpi_support:

MPI support
-----------

Highly parallel programs may use the MPI (:ref: message passing interface<https://en.wikipedia.org/wiki/Message_Passing_Interface>) to enable a programm to span work across an invidual compute node's boundary.
The command to run the MPI program (in below example we assume there exists a program ``calc-pi-mpi``) has to be specified in the ``mpi``-resource, e.g.:

.. code-block:: python

  rule calc_pi:
    output:
        "pi.calc",
    log:
        "logs/calc_pi.log",
    resources:
        tasks=10,
        mpi="mpiexec",
    shell:
        "{resources.mpi} -n {resources.tasks} calc-pi-mpi 10 > {output} 2> {log}"

Thereby, additional parameters may be passed to the MPI-starter, e.g.:

.. code-block:: python

  rule calc_pi:
    output:
        "pi.calc",
    log:
        "logs/calc_pi.log",
    resources:
        tasks=10,
        mpi="mpiexec -arch x86",
    shell:
        "{resources.mpi} -n {resources.tasks} calc-pi-mpi 10 > {output} 2> {log}"

As any other resource, the `mpi`-resource can be overwritten via the command line e.g. in order to adapt to a specific platform (see :ref:`snakefiles-resources`):

.. code-block:: console

  $ snakemake --set-resources calc_pi:mpi="srun --hint nomultithread" ...

Note that in case of distributed, remote execution (cluster, cloud), MPI support might not be available.
So far, explicit MPI support is implemented in the :ref:`SLURM backend <cluster-slurm>`.
