Storm Reference - Advanced - Functions

This section provides an overview of the types of functions available in Storm, along with some tips, caveats, and basic examples. It is meant to introduce the concepts around functions in Storm; it is not meant as a Storm programming tutorial.

Storm Developers can refer to the Synapse Developer Guide for additional information.

Overview

Functions can be used to encapsulate a set of Storm logic that executes when the function is invoked. Functions are declared and then invoked within a Storm query using the function name and any required parameters. Separating the function’s logic from the logic of your “executing” Storm query makes your Storm cleaner and easier to read (and also allows for easier code reuse).

Functions and the Storm Pipeline

We regularly emphasize the Storm Operating Concepts, especially when writing more complex Storm queries and Storm logic. In particular, it is important to pay attention to Storm’s pipeline behavior and the way the pipeline affects your working set.

A function in Storm has its own node pipeline, independent of any Storm logic that invokes the function. Functions do not inherit the pipeline of the invoking query, and do not modify the invoking pipeline by default.

Because the function itself is a Storm pipeline, all caveats about pipelines and awareness of your working set still apply to the Storm within the function.

All of Storm’s features and capabilities are available for use within a function. This includes Storm operations, commands, variables, methods, control flow, libraries, etc.

You can use an init block or a fini block within a function to execute a subset of Storm logic a single time before operating on any nodes in the functions’s pipeline or after any nodes have exited the function’s pipeline, respectively.

There are two subtle but important aspects of function behavior to keep in mind:

Nodes are not “sent inbound” to a function to “cause” it to execute; a function runs when it is invoked as part of a Storm query (i.e., by an invoking Storm pipeline). This means that:
- an invoked function can execute even if there are no nodes in the invoking pipeline, as long as the associated Storm logic executes; and
- a function within an invoking pipeline will run once each time that pipeline executes; and an invoking pipeline may execute multiple times (for example, if multiple nodes are passing through the invoking pipeline). In this case the nodes themselves don’t “cause” the function to execute; the pipeline runs once per node, and the function invoked by the pipeline runs once each time the pipeline runs.
Nodes do not “pass into” functions by default, as the function and the invoking Storm logic are two separate pipelines. It is possible to invoke a function so that it operates on a node or nodes; but the function will not “automatically” do so.

The sections below on Operating on Nodes and Runtsafe vs. Non-Runtsafe Functions discuss these behaviors in more detail.

Function Basics

Storm supports three types of functions. Each is explained in more detail below.

callable functions, which are “regular” functions (similar to functions in other programming languages). Callable functions return a value.
data emitter functions, which emit data.
node yielder functions, which yield nodes.

Declaring Functions

All functions are declared in the same way, using the function keyword.

function myFunction() { <do stuff> }
function myFunction(foo) { <do stuff> }
function myFunction(bar, baz=$lib.null) { <do stuff> }

Invoking Functions

All functions are invoked using the function name preceded by the dollar sign ( $ ), and by passing any required parameters to the function.

$myFunction()
$myFunction(foo)
$myFunction($foo)
$myFunction(bar, baz=wheee)

Parameters can be passed as literals, variables, or keyword=$valu arguments. For example, given a function that takes an organization name as input:

function $myFunction(orgname) { <do stuff> }

The name can be passed directly:

$myFunction("The Vertex Project")

…as a variable:

$name="The Vertex Project"
$myFunction($name)

…or as a keyword/value pair:

$myFunction(orgname="The Vertex Project")

$name="The Vertex Project"
$myFunction(orgname=$name)

Operating on Nodes

Functions do not inherit or operate on the invoking Storm pipeline by default. If you want a function to operate on nodes in a pipeline, you must invoke the function in such a way as to pass the node (or a property or properties of the node) as input to the function.

For example, if your invoking pipeline consists of a set of inet:ipv4 nodes, a function can take the :asn property as input:

$myFunction(:asn)

Or:

$asn=:asn
$myFunction($asn)

Alternatively, you can pass the entire $node object to the function and use the yield keyword within the function to yield the node into the function’s pipeline:

//Declare function
function $myFunction(inboundNode) {

    yield $inboundNode
    <do stuff>
    return()
}

//Invoke function
$myFunction($node)

If another function yields the node(s) you want to operate on, that function can be used as input to a second function. A simple example:

Function 1 (node yielder):

function getIPs() {
    //Lift 10 IPv4 addresses
    inet:ipv4 | limit 10
}

Function 2 (callable function):

//Takes a generator object as input
function counter(genr) {

    //Yield the generator content into the function pipeline
    yield $genr

    //Print the human-readable representation of each node
    $lib.print($node.repr())

    //Return the output of the function
    fini { return() }
}

Function 2 invoked with Function 1 as input:

$counter($getIPs())

When executed, the function produces the following output. Note that the $counter() function simply prints the nodes’ human-readable representation ($node.repr()) as an example; it does not return or yield the inet:ipv4 nodes:

storm>
function getIPs() {
    inet:ipv4 | limit 10
}

function counter(genr) {
    yield $genr
    $lib.print($node.repr())
    fini { return() }
}

$counter($getIPs())
1.1.1.1
2.2.2.2
3.3.3.3
4.4.4.4
5.5.5.5
6.6.6.6
7.7.7.7
8.8.8.8
9.9.9.9
10.10.10.10

Runtsafe vs. Non-Runtsafe Functions

Just as variables may be runtime-safe (runtsafe) or non-runtime-safe (non-runtsafe), functions can be invoked in a runtsafe manner (or not) based on the parameters passed to the function.

If a function is invoked with a runtsafe (typically static) value, the function is considered runtsafe. A function that takes an Autonomous System (AS) number as input and is passed a static AS number as a parameter is invoked in a runtsafe manner:

$myFunction(9009)

Or:

$asn=9009
myFunction($asn)

If the same function is invoked with a per-node, non-runtsafe value or values, the function is considered non-runtsafe, such as the example above where the invoking pipeline contains inet:ipv4 nodes and the function is invoked with the value of each node’s :asn property:

$myFunction(:asn)

Or:

$asn=:asn
$myFunction($asn)

Tip

Keep in mind that functions execute when they are invoked. This has some implications with respect to runtime safety (“runtsafety”):

A non-runtsafe function (i.e., that is dependent on a per-node value) will not execute when invoked if there are no nodes in the invoking pipeline. Synapse will not generate an error but the function will not “do anything”.
A runtsafe function (i.e., one whose parameters are not node-dependent) will still execute once each time it is invoked. If the invoking Storm executes multiple times, this can result in the runtsafe function running repeatedly while simply “doing the same thing” each time (based on its runtsafe input parameters). If the function should only execute once, it can be placed in a fini block (or an init block as appropriate).

Function Output

Functions do not modify the invoking Storm pipeline by default. To access the output of a function (whether nodes, data, or a value), you can:

Assign the output of the function to a variable:
```
$x = $myFunction()
```
Iterate over the function’s output (used with data emitters and node yielders):
```
for $x in $myFunction() { <do stuff> }
```
Add the node or nodes generated by the function directly to the invoking Storm pipeline with the yield keyword (used with node yielders and callable functions that return a node):
```
yield $myFunction()
```

Types of Functions

Note

Because all functions in Storm are declared and invoked the same way, the Storm syntax parser relies on the presence (or absence) of specific keywords within a function to identify the type of function and how to execute it.

Callable functions must include a return() statement (and must not use emit).
Data emitter functions must use the emit keyword (and must not use return()).
Node yielder functions must not include the keywords emit or return().

Both data emitters and node yielders may optionally include the keyword stop to cleanly halt execution and exit the function. (Using stop in a callable function will generate a StormStop error.)

Functions can be declared and invoked on their own, but are most often used when authoring more extensive Storm code to implement a set of related functionality, such as a Rapid Power-Up. A set of functions, each encapsulating Storm logic to perform a specific task, can work together to implement more complex capabilities. Given this architecture, it is common for functions to invoke other functions as part of their code, or to take the output of another function as an input parameter to perform another operation, as seen in some of the examples below.

See the Rapid Power-Up Development section of the Synapse Developer Guide for a more in-depth discussion of how to integrate multiple Storm components into a larger package.

Callable Functions

Callable functions are “regular” functions, similar to those in other programming languages. A callable function is invoked (called) and returns a value using a return() statement. A return() statement must be present for a callable function to execute properly even if the function does not return a specific value.

Callable functions are executed in their entirety before returning. They return exactly one value.

Tip

Callable functions may contain multiple return() statements, based on the function’s logic. The first return() encountered during the function’s execution will cause the function to stop execution and return. If you are performing multiple actions within the function and want to ensure they all complete before the function returns, place the return() in a fini block so it executes once at the end of the function’s pipeline.

Use Cases

Callable functions can be used to:

Check a condition and return a status (e.g., return((0)) vs. return((1))).
Return a value (such as a count).
Return a single node.
Perform isolated operations on a node in the pipeline.
Retrieve data from an external API.

Pseudocode

function callable() {

    <do stuff>
    return()
}

Examples

Return a node

A callable function can take input and attempt to create (or lift) a node.

//Takes a value expected to be an IPv4 or IPv6 as input
function makeIP(ip) {

    //Attempt to create (or lift) an IPv4 from the input
    //Return the IPv4 and exit if successful
    [ inet:ipv4 ?= $ip ]
    return($node)

    //Otherwise, atempt to create (or lift) an IPv6 from the input
    //Return the IPv6 and exit if successful
    [ inet:ipv6 ?= $ip ]
    return($node)

    //If the input is not a valid IPv4 or IPv6, the function
    // will execute but will not return a node.
}

//Invoke the function with the specified input and
// yield the result (if any) into the pipeline
yield $makeIP(8.8.8.8)

Return a node using secondary property deconfliction

When ingesting or creating guid-based nodes, a common deconfliction strategy is to check for existing nodes using one or more secondary properties (known as secondary property deconfliction). A callable function that takes a secondary property value (or values) as input and returns (or creates) the node simplifies this process.

//Create an ou:org node based on an org name (ou:name)

//Declare function - takes 'name' as input
function genOrgByName(name) {

    //Check whether input is valid for an ou:name value
    //If not, return / exit
    ($ok, $name) = $lib.trycast(ou:name, $name)
    if (not $ok) { return() }

    //If name is valid, attempt to identify an existing ou:org
    //Lift the ou:name node for 'name' (if it exists)
    // and pivot to an org with that name (if it exists)
    //Return the existing node if found
    ou:name=$name -> ou:org
    return($node)

    //If an org is not found, create a new ou:org using 'gen' and the name
    // as input for the org's guid; set the :name prop
    //Return the new node
    [ ou:org=(gen, $name) :name=$name ]
    return($node)
}

//Invoke the function with input name "The Vertex Project" and yield
// the result into the pipeline
yield $genOrgByName("The Vertex Project")

Tip

Synapse includes gen.* (generator) Storm commands and $lib.gen APIs that can generate many common guid-based forms using secondary property deconfliction.

Return a value

Some data sources provide feed-like APIs that allow you to retrieve either the entire feed or just retrieve any new items added since your last update. The “last update” time can be stored as Node Data on the meta:source node for the data source. A callable function can retrieve the “last updated” date (e.g., to pass the value to another function used to retrieve only the latest feed data).

function getLastReportDate() {

    //Invoke an existing function to create (initialize) or retrieve the meta:source node
    // and yield the node into the function's pipeline
    yield $initMetaSource()

    //Set the $date variable to the value of the node data key mysource:report:date from
    // the meta:source node.
    $date = $node.data.get(mysource:report:date)

    //If there is no value for this key return the integer 0
    if ($date = $lib.null) { return((0)) }

    //Otherwise return the date
    return($date)
}

//Assign the value returned by this function to the variable $date for use by the
// invoking Storm pipeline. This value can be passed to another function that retrieves
// the latest feed data.
$date = $getLastReportDate()

Data Emitter Functions

Data emitter functions emit data using the emit keyword. The stop keyword can optionally be used to halt processing and exit the function. The emit keyword must be present for a data emitter function to execute properly.

Data emitter functions stream data (technically, they return a generator object that is iterated over). They are designed to emit data to the invoking pipeline as it is available; they may be invoked with for or while loops for this purpose. When data is emitted, execution of the function is paused until the invoking pipeline requests the next value, at which point the function’s execution resumes.

Use Cases

Data emitter functions can be used to:

Consume data from sources that paginate results, where you want to mask the pagination (i.e., a data emitter can consume and emit the first page of results; then consume and emit the next page; and so on).
Consume data from sources that stream results, where the data emitter is used to continue the streaming behavior.

Tip

Data emitters can be used to emit nodes (e.g., emit $node), though this is an uncommon use case. The ability of data emitters to emit data incrementally is useful when consuming large result sets from an API. “Subsets” of results (such as individual JSON objects from a JSON blob) can be made available more quickly (e.g., to another function responsible for creating nodes from the JSON) while the emitter continues to process data.

In contrast, if the same set of API results was consumed by a callable function, the function would need to consume the entire result set before returning.

Pseudocode

function data_emitter() {

    for $thing in $things {
        <do stuff>
        emit $thing
    }
}

Or:

function data_emitter() {

    for $thing in $things {
        <do stuff>
        emit $thing

        if ($thing = "badthing") {
            stop
        }
    }
}

Or:

function data_emitter() {

    while (1) {
        <do stuff>
        emit $thing

        if (<end condition>) { stop }

        <update something to continue while loop>
    }
}

Example

Some data sources may paginate results, returning X number of objects (e.g., in a JSON blob) at a time until all results are returned. A data emitter function can emit individual JSON objects from the blob (e.g., for consumption by another function that processes the object and creates nodes) until all of the results have been received.

function emitReportFeed() {

    //Set variables for the current time and the # of objects to retrieve per page
    $now = $lib.time.now()
    $pagelim = 100

    //Set a variable for API query parameters
    $params = ({
        "limit": $pagelim,
    })

    //Set a variable for the initial offset
    $offset = (0)

    //While loop to retrieve records
    while (1) {

        //Set the value of the 'offset' parameter
        $params.offset = $offset

        //Invoke an existing function to retrieve the JSON using $params as parameters
        // to the API request.
        //Assign the returned JSON to the variable $data
        $data = $getJson("/reports", params=$params)

        //If no data is returned, stop and exit this function
        if ($data = $lib.null) { stop }

        //If data is returned, loop over the JSON and emit each item / ojbect
        for $item in $data.data { emit $item }

        //Set $datasize to the size (number of items) in the returned JSON
        $datasize = $data.data.size()

        //Check whether the # of records returned is less than our page limit
        //If so we have retrieved all available records
        if ($datasize < $pagelim) {

            //Print status to CLI if debug is in use
            if $lib.debug { $lib.print(`Reports ingested up to {$now}`) }

            //Invoke an existing function to update the 'last retrieved' date to the current time
            //E.g., this value may be stored as node data on the feed's meta:source node
            $setReportFeedLast($now)

            //Stop and exit the function
            stop
        }

        //If $datasize is NOT < $pagelim there is more data
        //Update the $offset value and execute the while loop again
        $offset = ($offset + $pagelim)
    }
}

Node Yielder Functions

Node yielder functions yield nodes. If a function does not include either of the keywords return or emit, it is presumed to be a node yielder.

Node yielder functions stream nodes; (technically, they return a generator object that is iterated over). They are designed to yield nodes as they are available while continuing to execute. They may be invoked with the yield keyword or with a for loop for this purpose.

Use Cases

Node yielder functions can be used to:

Isolate different node construction pipelines during complex data ingest logic.

Pseudocode

function node_yielder() {
    <do stuff>
}

Examples

Some data sources allow you to retrieve specific records or reports (e.g., based on a record or report number). A node yielder function can request the record(s) and yield the node(s) created from those records (e.g., a report retrieved from a data source may be used to create a media:news node).

//Function takes one or more IDs as input
function reportByID(reportids) {

    //Loop over report IDs
    for $reportid in $reportids {

        //Invoke an existing privileged function to retrieve the report object (i.e., a JSON response)
        //A privileged module may be invoked to mask sensitive data such as an API key from a normal user
        $report = $privsep.getReportById($reportid)

        //Print the JSON to CLI if debug is in use
        if $lib.debug { $lib.pprint($report) }

        //Yield the node (e.g., media:news node) created by invoking an existing function that
        // creates the media:news node from the $report
        yield $ingest.addReport($report)
    }
}

Functions and Privilege Separation

Functions can be used to support privilege separation (“privsep”) for things like custom Power-Up development. Storm logic that requires access to sensitive information (such as API keys or other credentials) can be encapsulated in a function that is not accessible to unprivileged users. The function can return non-sensitive data that is “safe” for viewing or consumption.

See the Rapid Power-Up Development Guide and in particular the section on privileged modules for more information.

Function Debugging Tips

Functions execute Storm, so standard Storm debugging tips still apply to all code within the function itself (and to the Storm code that invokes the function, of course). The following additional tips apply to functions in particular.

Use the right type of function for your use case. Each Storm function serves a different purpose; be clear on what type of function you need for a given situation.

For example, a node yielder can yield multiple nodes. A callable function can also yield multiple nodes (e.g., by returning a set or list object). But there can be significant (even damaging) performance differences between the two, depending on the nature of the function.

A node yielder yields a generator object that can incrementally provide results (i.e., for a streaming effect). When written as a node yielder, a function to lift every node in a Cortex is workable, even for large result sets:

function allnodes() { .created }

You could write the same function as a callable function, but it would likely blow up your system by consuming all available memory. A callable function can only return exactly one object; it can’t stream results. You could write a callable function to lift each node, add it to a set object, and have the function return the set. But the callable function will need to construct and store the entire set in memory until the object can be returned:

// NEVER DO THIS
function allnodes() {

    $set = ([])
    .created
    $set.add($node)
    fini { return($set) }
}

While this is an extreme example, it serves to illustrate some of the differences between function types.

Ensure necessary keywords are present for your function type. Synapse determines “what kind” of function is present and how to execute it based on keywords (e.g., return() for callable functions, emit for data emitters). If you write a node yielder function with a return() statement, Synapse will attempt to execute it as a callable function. Similarly, a callable function that is missing a return() will not execute properly.

Note

Data emitters and node yielders may fail to emit data or yield nodes, based on the input to the function and the function’s code. In these situations it can be challenging to determine whether a function that is “not doing anything” is a yielder / emitter that is failing to produce output, or a callable function that is missing a return() statement.

Understand pipeline interactions between functions and Storm logic that invokes them. By default, functions do not interact with the Storm pipeline that invokes them.

If you want a function to operate on nodes in the invoking Storm pipeline, you must invoke the function in such a way as to do this.

Note

If a function is written to operate on or iterate over nodes, and there are no nodes in the pipeline (based on previously executing Storm logic), the function will not execute.

If you want the invoking Storm pipeline to operate on the function’s output, you must ensure that the output is returned to the pipeline (e.g., assign the function’s output to a variable; use the yield keyword to yield any nodes into the pipeline; use a for loop to iterate over function results; etc.).