Storm Reference - Advanced - Functions
This section provides an overview of the types of functions available in Storm, along with some tips, caveats, and basic examples. It is meant to introduce the concepts around functions in Storm; it is not meant as a Storm programming tutorial.
Storm Developers can refer to the Synapse Developer Guide for additional information.
Overview
Functions can be used to encapsulate a set of Storm logic that executes when the function is invoked. Functions are declared and then invoked within a Storm query using the function name and any required parameters. Separating the function’s logic from the logic of your “executing” Storm query makes your Storm cleaner and easier to read (and also allows for easier code reuse).
Functions and the Storm Pipeline
We regularly emphasize the Storm Operating Concepts, especially when writing more complex Storm queries and Storm logic. In particular, it is important to pay attention to Storm’s pipeline behavior and the way the pipeline affects your working set.
A function in Storm has its own node pipeline, independent of any Storm logic that invokes the function. Functions do not inherit the pipeline of the invoking query, and do not modify the invoking pipeline by default.
Because the function itself is a Storm pipeline, all caveats about pipelines and awareness of your working set still apply to the Storm within the function.
All of Storm’s features and capabilities are available for use within a function. This includes Storm operations, commands, variables, methods, control flow, libraries, etc.
You can use an init block or a fini block within a function to execute a subset of Storm logic a single time before operating on any nodes in the functions’s pipeline or after any nodes have exited the function’s pipeline, respectively.
There are two subtle but important aspects of function behavior to keep in mind:
Nodes are not “sent inbound” to a function to “cause” it to execute; a function runs when it is invoked as part of a Storm query (i.e., by an invoking Storm pipeline). This means that:
an invoked function can execute even if there are no nodes in the invoking pipeline, as long as the associated Storm logic executes; and
a function within an invoking pipeline will run once each time that pipeline executes; and an invoking pipeline may execute multiple times (for example, if multiple nodes are passing through the invoking pipeline). In this case the nodes themselves don’t “cause” the function to execute; the pipeline runs once per node, and the function invoked by the pipeline runs once each time the pipeline runs.
Nodes do not “pass into” functions by default, as the function and the invoking Storm logic are two separate pipelines. It is possible to invoke a function so that it operates on a node or nodes; but the function will not “automatically” do so.
The sections below on Operating on Nodes and Runtsafe vs. Non-Runtsafe Functions discuss these behaviors in more detail.
Function Basics
Storm supports three types of functions. Each is explained in more detail below.
callable functions, which are “regular” functions (similar to functions in other programming languages). Callable functions return a value.
data emitter functions, which emit data.
node yielder functions, which yield nodes.
Declaring Functions
All functions are declared in the same way, using the function
keyword.
function myFunction() { <do stuff> }
function myFunction(foo) { <do stuff> }
function myFunction(bar, baz=$lib.null) { <do stuff> }
Invoking Functions
All functions are invoked using the function name preceded by the dollar sign ( $
), and by
passing any required parameters to the function.
$myFunction()
$myFunction(foo)
$myFunction($foo)
$myFunction(bar, baz=wheee)
Parameters can be passed as literals, variables, or keyword=$valu arguments. For example, given a function that takes an organization name as input:
function $myFunction(orgname) { <do stuff> }
The name can be passed directly:
$myFunction("The Vertex Project")
…as a variable:
$name="The Vertex Project"
$myFunction($name)
…or as a keyword/value pair:
$myFunction(orgname="The Vertex Project")
$name="The Vertex Project"
$myFunction(orgname=$name)
Operating on Nodes
Functions do not inherit or operate on the invoking Storm pipeline by default. If you want a function to operate on nodes in a pipeline, you must invoke the function in such a way as to pass the node (or a property or properties of the node) as input to the function.
For example, if your invoking pipeline consists of a set of inet:ipv4
nodes, a function can take the
:asn
property as input:
$myFunction(:asn)
Or:
$asn=:asn
$myFunction($asn)
Alternatively, you can pass the entire $node
object to the function and use the yield
keyword
within the function to yield the node into the function’s pipeline:
//Declare function
function $myFunction(inboundNode) {
yield $inboundNode
<do stuff>
return()
}
//Invoke function
$myFunction($node)
If another function yields the node(s) you want to operate on, that function can be used as input to a second function. A simple example:
Function 1 (node yielder):
function getIPs() {
//Lift 10 IPv4 addresses
inet:ipv4 | limit 10
}
Function 2 (callable function):
//Takes a generator object as input
function counter(genr) {
//Yield the generator content into the function pipeline
yield $genr
//Print the human-readable representation of each node
$lib.print($node.repr())
//Return the output of the function
fini { return() }
}
Function 2 invoked with Function 1 as input:
$counter($getIPs())
When executed, the function produces the following output. Note that the $counter()
function simply prints
the nodes’ human-readable representation ($node.repr()
) as an example; it does not return or yield the
inet:ipv4
nodes:
storm>
function getIPs() {
inet:ipv4 | limit 10
}
function counter(genr) {
yield $genr
$lib.print($node.repr())
fini { return() }
}
$counter($getIPs())
1.1.1.1
2.2.2.2
3.3.3.3
4.4.4.4
5.5.5.5
6.6.6.6
7.7.7.7
8.8.8.8
9.9.9.9
10.10.10.10
Runtsafe vs. Non-Runtsafe Functions
Just as variables may be runtime-safe (runtsafe) or non-runtime-safe (non-runtsafe), functions can be invoked in a runtsafe manner (or not) based on the parameters passed to the function.
If a function is invoked with a runtsafe (typically static) value, the function is considered runtsafe. A function that takes an Autonomous System (AS) number as input and is passed a static AS number as a parameter is invoked in a runtsafe manner:
$myFunction(9009)
Or:
$asn=9009
myFunction($asn)
If the same function is invoked with a per-node, non-runtsafe value or values, the function is considered
non-runtsafe, such as the example above where the invoking pipeline contains inet:ipv4
nodes and the
function is invoked with the value of each node’s :asn
property:
$myFunction(:asn)
Or:
$asn=:asn
$myFunction($asn)
Tip
Keep in mind that functions execute when they are invoked. This has some implications with respect to runtime safety (“runtsafety”):
A non-runtsafe function (i.e., that is dependent on a per-node value) will not execute when invoked if there are no nodes in the invoking pipeline. Synapse will not generate an error but the function will not “do anything”.
A runtsafe function (i.e., one whose parameters are not node-dependent) will still execute once each time it is invoked. If the invoking Storm executes multiple times, this can result in the runtsafe function running repeatedly while simply “doing the same thing” each time (based on its runtsafe input parameters). If the function should only execute once, it can be placed in a fini block (or an init block as appropriate).
Function Output
Functions do not modify the invoking Storm pipeline by default. To access the output of a function (whether nodes, data, or a value), you can:
Assign the output of the function to a variable:
$x = $myFunction()
Iterate over the function’s output (used with data emitters and node yielders):
for $x in $myFunction() { <do stuff> }
Add the node or nodes generated by the function directly to the invoking Storm pipeline with the
yield
keyword (used with node yielders and callable functions that return a node):yield $myFunction()
Types of Functions
Note
Because all functions in Storm are declared and invoked the same way, the Storm syntax parser relies on the presence (or absence) of specific keywords within a function to identify the type of function and how to execute it.
Callable functions must include a
return()
statement (and must not useemit
).Data emitter functions must use the
emit
keyword (and must not usereturn()
).Node yielder functions must not include the keywords
emit
orreturn()
.
Both data emitters and node yielders may optionally include the keyword stop
to cleanly halt
execution and exit the function. (Using stop
in a callable function will generate a StormStop
error.)
Functions can be declared and invoked on their own, but are most often used when authoring more extensive Storm code to implement a set of related functionality, such as a Rapid Power-Up. A set of functions, each encapsulating Storm logic to perform a specific task, can work together to implement more complex capabilities. Given this architecture, it is common for functions to invoke other functions as part of their code, or to take the output of another function as an input parameter to perform another operation, as seen in some of the examples below.
See the Rapid Power-Up Development section of the Synapse Developer Guide for a more in-depth discussion of how to integrate multiple Storm components into a larger package.
Callable Functions
Callable functions are “regular” functions, similar to those in other programming languages. A callable
function is invoked (called) and returns a value using a return()
statement. A return()
statement
must be present for a callable function to execute properly even if the function does not return a
specific value.
Callable functions are executed in their entirety before returning. They return exactly one value.
Tip
Callable functions may contain multiple return()
statements, based on the function’s logic. The
first return()
encountered during the function’s execution will cause the function to stop
execution and return. If you are performing multiple actions within the function and want to ensure they
all complete before the function returns, place the return()
in a fini block so it executes
once at the end of the function’s pipeline.
Use Cases
Callable functions can be used to:
Check a condition and return a status (e.g.,
return((0))
vs.return((1))
).Return a value (such as a count).
Return a single node.
Perform isolated operations on a node in the pipeline.
Retrieve data from an external API.
Pseudocode
function callable() {
<do stuff>
return()
}
Examples
Return a node
A callable function can take input and attempt to create (or lift) a node.
//Takes a value expected to be an IPv4 or IPv6 as input
function makeIP(ip) {
//Attempt to create (or lift) an IPv4 from the input
//Return the IPv4 and exit if successful
[ inet:ipv4 ?= $ip ]
return($node)
//Otherwise, atempt to create (or lift) an IPv6 from the input
//Return the IPv6 and exit if successful
[ inet:ipv6 ?= $ip ]
return($node)
//If the input is not a valid IPv4 or IPv6, the function
// will execute but will not return a node.
}
//Invoke the function with the specified input and
// yield the result (if any) into the pipeline
yield $makeIP(8.8.8.8)
Return a node using secondary property deconfliction
When ingesting or creating guid-based nodes, a common deconfliction strategy is to check for existing nodes using one or more secondary properties (known as secondary property deconfliction). A callable function that takes a secondary property value (or values) as input and returns (or creates) the node simplifies this process.
//Create an ou:org node based on an org name (ou:name)
//Declare function - takes 'name' as input
function genOrgByName(name) {
//Check whether input is valid for an ou:name value
//If not, return / exit
($ok, $name) = $lib.trycast(ou:name, $name)
if (not $ok) { return() }
//If name is valid, attempt to identify an existing ou:org
//Lift the ou:name node for 'name' (if it exists)
// and pivot to an org with that name (if it exists)
//Return the existing node if found
ou:name=$name -> ou:org
return($node)
//If an org is not found, create a new ou:org using 'gen' and the name
// as input for the org's guid; set the :name prop
//Return the new node
[ ou:org=(gen, $name) :name=$name ]
return($node)
}
//Invoke the function with input name "The Vertex Project" and yield
// the result into the pipeline
yield $genOrgByName("The Vertex Project")
Tip
Synapse includes gen.* (generator) Storm commands and $lib.gen APIs that can generate many common guid-based forms using secondary property deconfliction.
Return a value
Some data sources provide feed-like APIs that allow you to retrieve either the entire feed or just
retrieve any new items added since your last update. The “last update” time can be stored as Node Data
on the meta:source
node for the data source. A callable function can retrieve the “last updated” date
(e.g., to pass the value to another function used to retrieve only the latest feed data).
function getLastReportDate() {
//Invoke an existing function to create (initialize) or retrieve the meta:source node
// and yield the node into the function's pipeline
yield $initMetaSource()
//Set the $date variable to the value of the node data key mysource:report:date from
// the meta:source node.
$date = $node.data.get(mysource:report:date)
//If there is no value for this key return the integer 0
if ($date = $lib.null) { return((0)) }
//Otherwise return the date
return($date)
}
//Assign the value returned by this function to the variable $date for use by the
// invoking Storm pipeline. This value can be passed to another function that retrieves
// the latest feed data.
$date = $getLastReportDate()
Data Emitter Functions
Data emitter functions emit data using the emit
keyword. The stop
keyword can optionally
be used to halt processing and exit the function. The emit
keyword must be present for a data
emitter function to execute properly.
Data emitter functions stream data (technically, they return a generator object that is iterated over). They are designed to emit data to the invoking pipeline as it is available; they may be invoked with for or while loops for this purpose. When data is emitted, execution of the function is paused until the invoking pipeline requests the next value, at which point the function’s execution resumes.
Use Cases
Data emitter functions can be used to:
Consume data from sources that paginate results, where you want to mask the pagination (i.e., a data emitter can consume and emit the first page of results; then consume and emit the next page; and so on).
Consume data from sources that stream results, where the data emitter is used to continue the streaming behavior.
Tip
Data emitters can be used to emit nodes (e.g., emit $node
), though this is an uncommon use case.
The ability of data emitters to emit data incrementally is useful when consuming large result sets from
an API. “Subsets” of results (such as individual JSON objects from a JSON blob) can be made available
more quickly (e.g., to another function responsible for creating nodes from the JSON) while the
emitter continues to process data.
In contrast, if the same set of API results was consumed by a callable function, the function would need to consume the entire result set before returning.
Pseudocode
function data_emitter() {
for $thing in $things {
<do stuff>
emit $thing
}
}
Or:
function data_emitter() {
for $thing in $things {
<do stuff>
emit $thing
if ($thing = "badthing") {
stop
}
}
}
Or:
function data_emitter() {
while (1) {
<do stuff>
emit $thing
if (<end condition>) { stop }
<update something to continue while loop>
}
}
Example
Some data sources may paginate results, returning X number of objects (e.g., in a JSON blob) at a time until all results are returned. A data emitter function can emit individual JSON objects from the blob (e.g., for consumption by another function that processes the object and creates nodes) until all of the results have been received.
function emitReportFeed() {
//Set variables for the current time and the # of objects to retrieve per page
$now = $lib.time.now()
$pagelim = 100
//Set a variable for API query parameters
$params = ({
"limit": $pagelim,
})
//Set a variable for the initial offset
$offset = (0)
//While loop to retrieve records
while (1) {
//Set the value of the 'offset' parameter
$params.offset = $offset
//Invoke an existing function to retrieve the JSON using $params as parameters
// to the API request.
//Assign the returned JSON to the variable $data
$data = $getJson("/reports", params=$params)
//If no data is returned, stop and exit this function
if ($data = $lib.null) { stop }
//If data is returned, loop over the JSON and emit each item / ojbect
for $item in $data.data { emit $item }
//Set $datasize to the size (number of items) in the returned JSON
$datasize = $data.data.size()
//Check whether the # of records returned is less than our page limit
//If so we have retrieved all available records
if ($datasize < $pagelim) {
//Print status to CLI if debug is in use
if $lib.debug { $lib.print(`Reports ingested up to {$now}`) }
//Invoke an existing function to update the 'last retrieved' date to the current time
//E.g., this value may be stored as node data on the feed's meta:source node
$setReportFeedLast($now)
//Stop and exit the function
stop
}
//If $datasize is NOT < $pagelim there is more data
//Update the $offset value and execute the while loop again
$offset = ($offset + $pagelim)
}
}
Node Yielder Functions
Node yielder functions yield nodes. If a function does not include either of the keywords return
or emit
, it is presumed to be a node yielder.
Node yielder functions stream nodes; (technically, they return a generator object that is iterated
over). They are designed to yield nodes as they are available while continuing to execute. They may be
invoked with the yield
keyword or with a for loop for this purpose.
Use Cases
Node yielder functions can be used to:
Isolate different node construction pipelines during complex data ingest logic.
Pseudocode
function node_yielder() {
<do stuff>
}
Examples
Some data sources allow you to retrieve specific records or reports (e.g., based on a record or
report number). A node yielder function can request the record(s) and yield the node(s) created
from those records (e.g., a report retrieved from a data source may be used to create a media:news
node).
//Function takes one or more IDs as input
function reportByID(reportids) {
//Loop over report IDs
for $reportid in $reportids {
//Invoke an existing privileged function to retrieve the report object (i.e., a JSON response)
//A privileged module may be invoked to mask sensitive data such as an API key from a normal user
$report = $privsep.getReportById($reportid)
//Print the JSON to CLI if debug is in use
if $lib.debug { $lib.pprint($report) }
//Yield the node (e.g., media:news node) created by invoking an existing function that
// creates the media:news node from the $report
yield $ingest.addReport($report)
}
}
Functions and Privilege Separation
Functions can be used to support privilege separation (“privsep”) for things like custom Power-Up development. Storm logic that requires access to sensitive information (such as API keys or other credentials) can be encapsulated in a function that is not accessible to unprivileged users. The function can return non-sensitive data that is “safe” for viewing or consumption.
See the Rapid Power-Up Development Guide and in particular the section on privileged modules for more information.
Function Debugging Tips
Functions execute Storm, so standard Storm debugging tips still apply to all code within the function itself (and to the Storm code that invokes the function, of course). The following additional tips apply to functions in particular.
Use the right type of function for your use case. Each Storm function serves a different purpose; be clear on what type of function you need for a given situation.
For example, a node yielder can yield multiple nodes. A callable function can also yield multiple nodes (e.g., by returning a set or list object). But there can be significant (even damaging) performance differences between the two, depending on the nature of the function.
A node yielder yields a generator object that can incrementally provide results (i.e., for a streaming effect). When written as a node yielder, a function to lift every node in a Cortex is workable, even for large result sets:
function allnodes() { .created }
You could write the same function as a callable function, but it would likely blow up your system by consuming all available memory. A callable function can only return exactly one object; it can’t stream results. You could write a callable function to lift each node, add it to a set object, and have the function return the set. But the callable function will need to construct and store the entire set in memory until the object can be returned:
// NEVER DO THIS
function allnodes() {
$set = ([])
.created
$set.add($node)
fini { return($set) }
}
While this is an extreme example, it serves to illustrate some of the differences between function types.
Ensure necessary keywords are present for your function type. Synapse determines “what kind” of function
is present and how to execute it based on keywords (e.g., return()
for callable functions, emit
for
data emitters). If you write a node yielder function with a return()
statement, Synapse will attempt to
execute it as a callable function. Similarly, a callable function that is missing a return()
will not
execute properly.
Note
Data emitters and node yielders may fail to emit data or yield nodes, based on the input to the
function and the function’s code. In these situations it can be challenging to determine whether a
function that is “not doing anything” is a yielder / emitter that is failing to produce output, or a
callable function that is missing a return()
statement.
Understand pipeline interactions between functions and Storm logic that invokes them. By default, functions do not interact with the Storm pipeline that invokes them.
If you want a function to operate on nodes in the invoking Storm pipeline, you must invoke the function in such a way as to do this.
Note
If a function is written to operate on or iterate over nodes, and there are no nodes in the pipeline (based on previously executing Storm logic), the function will not execute.
If you want the invoking Storm pipeline to operate on the function’s output, you must ensure that the
output is returned to the pipeline (e.g., assign the function’s output to a variable; use the yield
keyword to yield any nodes into the pipeline; use a for loop to iterate over function results; etc.).