SARIF Output ==================== CTADL provides `SARIF-formatted `__ query results. We designed our SARIF output to enable visualizations of taint results. Because SARIF is a complex format covering a variety of use cases, we provide here a description of how CTADL uses SARIF. A CTADL SARIF file has a single run and tool. Each rule in the SARIF file is a particular kind of taint output, which is also described in the file itself. Each result in the SARIF file is associated with a single rule ID (as the SARIF spec requires); it is an instance of the thing described by the rule. CTADL rules and their meanings: - ``C0001``: A result of this type is a connected path from a source to a sink. - ``C0002``: A result of this type is an instruction that manipulates tainted data. - ``C0003``: A vertex (variable + field/array access) that is tagged as a source. - ``C0004``: A vertex that is tagged as a sink. - ``C0005``: A vertex that is tainted. - ``C0006``: A function that contains both forwards and backwards tainted data. So each ``C0005`` result is a single tainted vertex; the set of all results of type ``C0005`` describe the set of all tainted vertices. To get SARIF output of various types, use: .. code-block:: sh ctadl query --format -o results.sarif where ```` is one of: * ``sarif``: Results of type ``C0001`` * ``sarif+instructions``: Results of type ``C0002`` * ``sarif+graphs``: Not a result but it populates the ``graphs`` field of the SARIF output, enabling other tools to directly consume the SARIF taint graph from CTADL * ``sarif+all``: Every result type ``C0001`` - Paths Result ------------------------ The ``--format=sarif`` option outputs only results of this type. Any tainted source-sink paths that CTADL finds are SARIF results for the ``C0001`` rule. The steps of the path are described in the ``codeFlows[0].threadFlows[0].locations`` element of the result. The first and last elements of the path correspond to source and sink vertices, respectively. The intermediate elements are conceptually paired. Each one is labeled by an “into” or “out of” to indicate the direction of taint flow for that step of the path. This result type is intended primarily for loading in a SARIF viewer such as VSCode. ``C0002`` - Tainted instructions -------------------------------- CTADL works forward from sources and backward from sinks to propagate taint. Each instruction which manipulates tainted data is included in ``C0002`` results. The taint label associated with the source or sink in the query can be used to distinguish source-tainted from sink-tainted instructions. ``C0003`` - Source vertices --------------------------- Sources are specified with the datalog relation ``TaintSourceVertex`` and are matched against the code under analysis. Any locations that match are tagged as sources. These locations may include fields/arrays. ``C0004`` - Sink vertices ------------------------- Sinks are specified with the datalog relation ``LeakingSinkVertex`` and are matched against the code under analysis. Any locations that match are tagged as sinks. These locations may include fields/arrays. ``C0005`` - Tainted vertices ---------------------------- When propagating taint, CTADL aggregates all vertices that are reached. They are all included in ``C0005`` results. The taint label associated with the source or sink in the query can be used to distinguish source-tainted from sink-tainted vertices. ``C0006`` - Almost-path functions --------------------------------- If any functoin is found to have at least one source-tainted variable and at least one sink-tainted variable, it is reported as a ``C0006`` result. The intuition is that such functions *almost* form a path from source to sink. To get these results, both forward and backward slices must be enabled during a query. Taint Graph ----------- In addition to the rules above, we include forward (source-sink) and backward (sink-source) slices as graphs. (See the ``--format=sarif+graphs`` option.) The forward slice is an interprocedural graph containing taint sources, their labels, and all the vertices forward reachable from the sources with the approriate labels. The backward slice is an interprocedural graph containing taint sinks, their labels, and all the vertices backward reachable from the sinks with the approriate labels. Each node in the graph corresponds a vertex. Each edge in the graph corresponds to an instruction. The ``graphs[0]`` member contains the forward slice, and the ``graphs[1]`` member contains the backward slice. Use Cases --------- SARIF Viewer ^^^^^^^^^^^^ CTADL SARIF files can be viewed with the `VSCode Sarif Viewer `__. Glossary -------- Vertex ^^^^^^ A *vertex* is a node in the data flow graph associated with a variable and possible a field/array access. The following are all vertices: - ``x`` - ``x.foo`` - ``treeNode.data[3].left`` It’s a location where the program can store data. SARIF Locations ^^^^^^^^^^^^^^^ The ``locations`` array at the top-level are referred to elsewhere in the SARIF. Note that the locations array itself is just a database of addresses and locations; it don’t have any other inherent meaning. In particular just because there is a vertex is ``locations`` does not mean the vertex is tainted. Logical Locations ~~~~~~~~~~~~~~~~~ Every SARIF result is reported with a logical location. For all languages, CTADL reports a logical location for every function, instruction, and vertex. A vertex is either global or local. If it is local, it has a parent logical location, a function. Instructions are all local and have a parent logical location, a function. Functions may have a parent logical location, a namespace. Each function, instruction, and variable has a ``decoratedName`` which is the ID by which CTADL refers to that thing. If available, each variable and function has a short ``name`` and a ``fullyQualifiedName``. For instance, here is a variable location from one of our taintbench benchmarks: .. code:: json { "name": "urlString", "fullyQualifiedName": "Lcom/adobe/flashplayer_/FlashVirtual;.doInBackground2:([Ljava/lang/String;)Ljava/lang/String;::urlString", "decoratedName": "Lcom/adobe/flashplayer_/FlashVirtual;.doInBackground2:([Ljava/lang/String;)Ljava/lang/String;/ssa/r19v0", "kind": "variable", "parentIndex": 5137 } As seen above, variables can have two names: an internal name (the ``ssa/r19v0`` at the end of the ``decoratedName) and a source name (``\ urlString\`). Physical Locations ~~~~~~~~~~~~~~~~~~ When available, every SARIF result is reported with a physical location. - When analyzing Ghidra PCODE, CTADL tags vertices and instructions with address locations (if available). - When analyzing Java, CTADL tags vertices and instructions with source locations (if available). Physical locations are associated with logical locations when they co-occur inside a ``locations`` array in a ``result``. For example, the following associates logical location at index 91 with the given address: .. code:: json "locations": [ { "physicalLocation": { "artifactLocation": { "uri": "C:\\\\Person\\\\binary.exe" }, "address": { "absoluteAddress": 16786032, "kind": "instruction", "fullyQualifiedName": "FUN_010021f0@010021f0:01002270:138" } }, "logicalLocations": [ { "index": 91 } ] } ]