SARIF Output

CTADL provides SARIF-formatted query results. We designed our SARIF output to enable visualizations of taint results. Because SARIF is a complex format covering a variety of use cases, we provide here a description of how CTADL uses SARIF.

A CTADL SARIF file has a single run and tool. Each rule in the SARIF file is a particular kind of taint output, which is also described in the file itself. Each result in the SARIF file is associated with a single rule ID (as the SARIF spec requires); it is an instance of the thing described by the rule.

CTADL rules and their meanings:

  • C0001: A result of this type is a connected path from a source to a sink.

  • C0002: A result of this type is an instruction that manipulates tainted data.

  • C0003: A vertex (variable + field/array access) that is tagged as a source.

  • C0004: A vertex that is tagged as a sink.

  • C0005: A vertex that is tainted.

  • C0006: A function that contains both forwards and backwards tainted data.

So each C0005 result is a single tainted vertex; the set of all results of type C0005 describe the set of all tainted vertices.

To get SARIF output of various types, use:

ctadl query --format <sarif-format> -o results.sarif

where <sarif-format> is one of:

  • sarif: Results of type C0001

  • sarif+instructions: Results of type C0002

  • sarif+graphs: Not a result but it populates the graphs field of the SARIF output, enabling other tools to directly consume the SARIF taint graph from CTADL

  • sarif+all: Every result type

C0001 - Paths Result

The --format=sarif option outputs only results of this type.

Any tainted source-sink paths that CTADL finds are SARIF results for the C0001 rule. The steps of the path are described in the codeFlows[0].threadFlows[0].locations element of the result. The first and last elements of the path correspond to source and sink vertices, respectively. The intermediate elements are conceptually paired. Each one is labeled by an “into” or “out of” to indicate the direction of taint flow for that step of the path. This result type is intended primarily for loading in a SARIF viewer such as VSCode.

C0002 - Tainted instructions

CTADL works forward from sources and backward from sinks to propagate taint. Each instruction which manipulates tainted data is included in C0002 results. The taint label associated with the source or sink in the query can be used to distinguish source-tainted from sink-tainted instructions.

C0003 - Source vertices

Sources are specified with the datalog relation TaintSourceVertex and are matched against the code under analysis. Any locations that match are tagged as sources. These locations may include fields/arrays.

C0004 - Sink vertices

Sinks are specified with the datalog relation LeakingSinkVertex and are matched against the code under analysis. Any locations that match are tagged as sinks. These locations may include fields/arrays.

C0005 - Tainted vertices

When propagating taint, CTADL aggregates all vertices that are reached. They are all included in C0005 results. The taint label associated with the source or sink in the query can be used to distinguish source-tainted from sink-tainted vertices.

C0006 - Almost-path functions

If any functoin is found to have at least one source-tainted variable and at least one sink-tainted variable, it is reported as a C0006 result. The intuition is that such functions almost form a path from source to sink.

To get these results, both forward and backward slices must be enabled during a query.

Taint Graph

In addition to the rules above, we include forward (source-sink) and backward (sink-source) slices as graphs. (See the --format=sarif+graphs option.) The forward slice is an interprocedural graph containing taint sources, their labels, and all the vertices forward reachable from the sources with the approriate labels. The backward slice is an interprocedural graph containing taint sinks, their labels, and all the vertices backward reachable from the sinks with the approriate labels. Each node in the graph corresponds a vertex. Each edge in the graph corresponds to an instruction. The graphs[0] member contains the forward slice, and the graphs[1] member contains the backward slice.

Use Cases

SARIF Viewer

CTADL SARIF files can be viewed with the VSCode Sarif Viewer.

Glossary

Vertex

A vertex is a node in the data flow graph associated with a variable and possible a field/array access. The following are all vertices:

  • x

  • x.foo

  • treeNode.data[3].left

It’s a location where the program can store data.

SARIF Locations

The locations array at the top-level are referred to elsewhere in the SARIF. Note that the locations array itself is just a database of addresses and locations; it don’t have any other inherent meaning. In particular just because there is a vertex is locations does not mean the vertex is tainted.

Logical Locations

Every SARIF result is reported with a logical location.

For all languages, CTADL reports a logical location for every function, instruction, and vertex. A vertex is either global or local. If it is local, it has a parent logical location, a function. Instructions are all local and have a parent logical location, a function. Functions may have a parent logical location, a namespace.

Each function, instruction, and variable has a decoratedName which is the ID by which CTADL refers to that thing. If available, each variable and function has a short name and a fullyQualifiedName. For instance, here is a variable location from one of our taintbench benchmarks:

{
  "name": "urlString",
  "fullyQualifiedName": "Lcom/adobe/flashplayer_/FlashVirtual;.doInBackground2:([Ljava/lang/String;)Ljava/lang/String;::urlString",
  "decoratedName": "Lcom/adobe/flashplayer_/FlashVirtual;.doInBackground2:([Ljava/lang/String;)Ljava/lang/String;/ssa/r19v0",
  "kind": "variable",
  "parentIndex": 5137
}

As seen above, variables can have two names: an internal name (the ssa/r19v0 at the end of the decoratedName) and a source name (urlString`).

Physical Locations

When available, every SARIF result is reported with a physical location. - When analyzing Ghidra PCODE, CTADL tags vertices and instructions with address locations (if available). - When analyzing Java, CTADL tags vertices and instructions with source locations (if available).

Physical locations are associated with logical locations when they co-occur inside a locations array in a result. For example, the following associates logical location at index 91 with the given address:

"locations": [
{
  "physicalLocation": {
    "artifactLocation": {
      "uri": "C:\\\\Person\\\\binary.exe"
    },
    "address": {
      "absoluteAddress": 16786032,
      "kind": "instruction",
      "fullyQualifiedName": "FUN_010021f0@010021f0:01002270:138"
    }
  },
  "logicalLocations": [
    {
      "index": 91
    }
  ]
}
]