Data Model - Form Categories¶
Synapse forms can also be broadly grouped based on how their primary properties (
<form> = <valu>) are structured or formed.
<form> = <valu> must be unique for all forms of a given type. In other words, the
<valu> must be defined so that it uniquely identifies any given node of that form; it represents that form’s “essence” or “thinghood” in a way that allows the unambiguous deconfliction of all possible nodes of that form.
Conceptually speaking, the general categories of forms in Synapse are:
- Simple Form
- Composite (Comp) Form
- Guid Form
- Edge Representations
- Generic Form
This list represents a conceptual framework to understand the Synapse data model.
A simple form refers to a form whose primary property is a single typed
<valu>. They are commonly used to represent an Entity, and so tend to be the most readily understood from a modeling perspective.
- IP addresses. An IP address (IPv4 or IPv6) must be unique within its address space and can be defined by the address itself:
inet:ipv4 = 18.104.22.168. Secondary properties include the associated Autonomous System number and whether the IP belongs to a specialized or reserved group (e.g., private, multicast, etc.).
- Email addresses. An email address must be unique in order to route email to the correct account / individual and can be defined by the address itself:
inet:email = firstname.lastname@example.org. Secondary properties include the domain where the account receives mail and the username for the account.
Composite (Comp) Form¶
A composite (comp) form is one where the primary property is a comma-separated list of two or more typed
<valu> elements. While no single element makes the form unique, a combination of elements can uniquely define a given node of that form. Comp forms are often (though not universally) used to represent a Relationship.
Fused DNS A records. A DNS A record can be uniquely defined by the combination of the domain (
inet:fqdn) and the IP address (
inet:ipv4) in the A record. Synapse’s
inet:dns:aform represents the knowledge that a given domain has ever resolved to a specific IP (fused knowledge):
inet:dns:a = (woot.com, 22.214.171.124).
Web-based accounts. An account at an online service (such as Github or Gmail) can be uniquely defined by the combination of the domain where the service is hosted (
inet:fqdn) and the unique user ID (
inet:user) used to identify the account:
inet:web:acct = (twitter.com, joeuser).
Social networks. Many online services allow users to establish relationships with other users of that service. These relationships may be one-way (you can follow someone on Twitter) or two-way (you can mutually connect with someone on LinkedIn). A given one-way social network relationship can be uniquely defined by the two users (
inet:web:acct) involved in the relationship:
inet:web:follows = ((twitter.com,alice), (twitter.com,bob)). (A two-way relationship can be defined by two one-way relationships.)
Note that each of the elements in the
inet:web:followscomp form is itself a comp form (
Subsidiaries. An organization / sub-organization relationship (e.g., corporation / subsidiary, company / division, government / ministry, etc.) can be uniquely defined by the specific parent / child entities (
ou:suborg = (084e295272e839afcf3f1fe10c6c97b9, 237e88a35439fdb566d909e291339154).
Note that each of the organizations (
ou:org) in the relationship is represented by a 128-bit Globally Unique Identifier (guid), each an example of a Guid Form.
A guid (Globally Unique Identifier) form is uniquely defined by a machine-generated 128-bit number. Guids account for cases where it is impossible to uniquely define a thing based on a specific set of properties no matter how many individual elements are factored into a comp form. A guid form can be considered a special case of a Simple Form where the typed
<valu> is of type
Guid forms can be arbitrary (generated ad-hoc by Synapse) or predictable / deconflictable (generated based on a specific set of inputs). See the guid section of Storm Reference - Type-Specific Storm Behavior for a more detailed discussion of this concept.
While certain types of data could be represented by a comp form based on a sufficient number of properties of the data, there are advantages to using a guid instead:
- in a comp form, the elements used to create the primary property are required in order to create a node of that form. It is not uncommon for real world data to be incomplete. Using a guid allows all of those elements to be defined as optional secondary properties, so the node can be created with as much (or as little) data as is available.
- Some data sources are such that individual records can be considered unique a priori. This often applies to event-type forms for large quantities of events. In this case it sufficient to distinguish the nodes from each other using a guid as opposed to being uniqued over a subset of properties.
- There is a potential performance benefit to representing forms using arbitrary guids in partitcular because they are guaranteed to be unique for a given Cortex. In particular, when ingesting data presumed to be unique, creating guid-based forms vs comp forms eliminates the need to parse and deconflict nodes before they are created. This benefit can be significant over large data sets.
People. Synapse uses a guid as the primary property for a person (
ps:person) node. There is no single property or set of properties that uniquely and unambiguously define a person. A person’s full name, date of birth, or place of birth (or the combination of all three) are not guaranteed to be fully unique across an entire population. Identification numbers (such as Social Security or National ID numbers) are country-specific, and not all countries require each citizen to have an ID number. Even a person’s genome is not guaranteed to be unique (such as in the case of identical twins).
Secondary properties include the person’s name (including given, middle, or family names) and date of birth.
Host execution / sandbox data. The ability to model detailed behavior of a process executing on a host (or in a sandbox) is important for a range of disciplines, including incident response and malware analysis. Modeling this data is challenging because of the number of effects that execution may have on a system (files read, written, or deleted; network activity initiated). Even if we focus on a specific effect (“a process wrote a new file to disk”), there are still a number of details that may define a “unique instance” of “process writes file”: the specific host (
it:host) where the process ran, the program (
file:bytes) that wrote the file to disk, the process (
file:bytes) that launched the program, the time the execution occurred, the file that was written (
file:bytes), the file’s path (
file:path), and so on. While all of these elements could be used to create a comp form, in the “real world” not all of this data may be available in all cases, making a guid a better option for forms such as
Unique DNS responses. Similar to host execution data, an individual DNS response to a request could potentially be uniqued based on a comp form containing multiple elements (time, DNS query, server that replied, response code, specific response, etc.) However, the same issues described above apply and it is preferable to use a guid for forms such as
Recall that a Relationship can be the hypergraph equivalent of an edge connecting two nodes in a directed graph. A standard relationship form (such as
inet:dns:a) represents a specific relationship (“has DNS A record for”) between two explicitly typed nodes (
inet:ipv4). Synapse’s strong typing and type safety ensure that all primary and secondary properties are explicitly typed, which facilitates both normalization of data and the ability to readily pivot across disparate properties that share the same data type. However, this means that types for all primary and secondary properties for a form representing a relationship must be defined in the data model ahead of time.
Some relationships are generic enough to apply to a wide variety of forms. One example is “has”: <thing a> “has” <thing b>. While it is possible to explicitly define typed forms for every possible variation of that relationship (“person has telephone number”, “company has social media account”), you would still need to update the data model every time a new variation of what is essentially the same “has” relationship is identified.
Synapse provides two options to represent generic “edge-type” relationships between arbitrary forms. Both methods allow this data to be incorporated into a Cortex without code modifications to update the data model: the Digraph (Edge) Form and the Lightweight (Light) Edge.
Digraph (Edge) Form¶
A digraph form (“edge” form) is a specialized Composite (Comp) Form whose primary property value consists of two
<form>,<valu> pairs (“node definitions”, or ndefs). An edge form is a specialized relationship form that can be used to link two arbitrary forms in a generic relationship. In the “has” example above, a variety of entities (people, organizations) may “have” a variety of things (email addresses, social media accounts, company cars). It would be nice to have a single generic “has” form that could link two arbitrary objects without having to explicitly define relationship forms such as “person has email address” or “company has office location”.
Synapse addresses this issue by defining a node’s ndef (
<form>,<valu> pair) as a data Type. Properties of type
ndef can thus effectively specify both a type (
<form>) and a
<valu> at the time of node creation. This allows for generic relationship forms (such as
edge:has) that can link two “arbitrary” node types.
Generic edge forms are best suited for representing relationships where you need to capture additional detail about the relationship (via secondary properties) or observations about the relationship (via tags).
Lightweight (Light) Edge¶
Digraph forms are useful, but have some disadvantages in terms of performance, representation, and navigation for many common use cases. Lightweight (light) edges address these limitations.
Similar to edge forms, light edges are used to link two arbitrary forms. However, unlike edge forms, light edges are not forms at all. They consist solely of a user-defined verb (that describes the linking relationship) and the two forms (nodes) being linked. Light edges typically have an implied direction (as many relationships represented by light edges are “one-way”). However, the direction is not an inherent part of the definition of the light edge itself; instead the direction is “defined” via the Storm syntax used to join the nodes. That is, nothing in Synapse prevents you from joining any two forms in any direction via a light edge, but only some of those joins will make sense given the meaning of the edge verb.
Light edges have some advantages over edge forms:
- Because they are nodes, edge forms incur additional performance overhead in general. This overhead is amplified in use cases where the edge represents a many-to-one relationship and the “many” is high. Light edges will always be more efficient than edge forms, and the performance benefit is significant in many cases.
- Edge forms represent generic relationships, but the edge form itself must still exist in the data model before it can be used. Synapse includes edge forms for common generic relationships (e.g.,
edge:has), but introducing additional relationships would require extending the data model. Light edges can be created on the fly (with appropriate permissions) as the need arises.
- The primary property of an edge form is two elements of type Ndef. Because of Synapse’s type-awareness, this may exclude edge forms from certain types of navigation (such as wildcard (“refs out” / “refs in”) pivots - see Storm Reference - Pivoting). This makes it slightly more complicated to “show me all the things” connected to a given node when those connections may include things linked by edge forms vs. things linked by light edges.
Light edges have some disadvantages - namely, since they are not forms, they cannot store any additional “detail” about the relationship they represent outside of their verb. They do not suppport secondary properties, and you cannot apply tags to light edges.
In addition, because light edges are not forms, they cannot be viewed in a Cortex via Synapse’s model introspection features (see Storm Reference - Model Introspection). The Storm model commands allow you to list and otherwise work with the light edges in a Cortex (note that there are no light edges defined in a Cortex by default).
Whether to use an edge form or a light edge to represent data in your Cortex will depend on your specific needs.
“References”. There are a number of use cases where it is helpful to note that a thing “references” another thing. Examples include:
- A report (
media:news) that contains threat indicators, such as hashes (
hash:sha256), domains (
inet:fqdn), email addresses (
- A photograph (
file:bytes) that depicts a person (
ps:person), a location (
geo:place), a landmark (
- A news article (
media:news) that describes an event such as a conference (
“References” is a very simple generic relationship. It is also likely to represent large many-to-one relationships, at least for some use cases; while some blogs may include only a handful of indicators, comprehensive whitepapers or internal documents such as incident reports may contain hundreds or thousands of indicators and referenced objects. “References” is also unlikely to have an associated time element; that is, if a report contains (references) an indicator (such as an FQDN), that relationship is unlikely to change. A report may be revised, but then it is technically a different report; the original still contains the reference.
For these reasons a “references” relationship would be better represented by a light edge vs. an edge form.
“Has”. There are a number of use cases where it is helpful to note that a thing owns or possesses (“has”) another thing. Examples include:
- A company (
ou:org) owns a corporate office (
mat:item), a range of IP addresses (
inet:cidr4), or a delivery van (
- A person (
ps:person) has an email address (
inet:email) or telephone number (
In some cases the relationship of a person or organization owning or possessing (“having”) a resource (a social media account, or an email address) may be indirectly apparent via existing pivots in the Synapse hypergraph. For example, an organization (
ou:org) may have a name that is shared by a social media account (
ou:org:name -> inet:web:acct:realname) where the social media account also references the organization’s web page (
inet:web:acct:webpage -> ou:org:url). However, it may be desirable to more tightly link an “owning” entity to things that it “has”. In addition, there may be things that an organization or person “has” that are not as easily identified via primary and secondary property pivots. In these cases the “has” form can represent this relationship between the “owning” entity and the arbitrary thing owned.
Like “references”, “has” seems like a very simple generic relationship. Whether to use an edge form or a light edge depends in part on the number of many-to-one relationships you need to model, and whether you need to capture additional information about the relationship (such as if something was “had” only for a specific period of time).
If the many-to-one is relatively small AND you need to capture data such as a time interval, an edge form (
edge:has) may be best. For large instances of many-to-one, or cases where things like time are not relevant (or where the time element is captured elsewhere), light edges are preferable.
- An organization (
ou:org) may “have” an office location (
geo:place) only for a period of time; the organization may lease or buy a different space if the business grows, for example. If this time element is relevant, an
edge:hasnode can be used to represent the relationship, with the
.seenproperty capturing the time interval.
- An IP address (
inet:ipv6) may be part of a netblock, either directly (
inet:cidr6) or as part of a netblock referenced in a network registration record (
inet:whois:iprec). Depending on the size of the netblock, the many-to-one relationship may be extremely large. In addition, an IP address may be part of more than one netblock / registration record, given network range suballocations and so on. In some cases a time element is irrelevant (i.e., a defined CIDR block is a fixed thing; an IP that is part of a /24 will never not be part of that /24). In cases of network registration records, the
inet:whois:iprecform contains time values; if that record changes (specifically, if the IP range is allocated differently) that would represent a new
inet:whois:iprecwith a new “has” relationship with the IPs in that range. In these cases (IP as part of CIDR, IP refrenced by netblock in registration record) light edges are preferable - for example,
inet:cidr4 -(has)> inet:ipv4to show an IP is part of a CIDR block or
inet:whois:iprec -(has)> inet:ipv4to show that an IP is part of a netblock referenced in a registration record. These light edges can be represented by a generic verb (“has”) or a more relationship-specific verb (e.g., “hasip”) depending on preference or need.
“Went to”. “Went to” can be used to represent that a thing (often a person, potentially an object such as a bus) traveled to a place (a city, an office building, a set of geolocation coordinates) or that a person attended an event (a conference, a party). It would be natural to want to record “when” this event occured, such as via a “time” secondary property (for a single point in time, such as an arrival time). Alternately, the
.seen universal property could be used to record a start and end time if the “went to” needed to capture a duration. Because of this need to track additional information about the relationship, an edge form (
edge:wentto) would be more appopriate.
The Synapse data model includes a number of “generic” forms that can be used to represent metadata and / or arbitrary data.
In an ideal world, all data represented in a Synapse hypergraph would be accurately modeled using an appropriate form to property capture the data’s unique (primary property) and contextual (secondary property) characteristics. However, designing an appropriate data model may require extended discussion, subject matter expertise, and testing against “real world” data - not to mention development time to implement model changes. In addition, there are use cases where data needs to be added to a Cortex for reference or analysis purposes, but simply does not have sufficient detail to be represented accurately, even if appropriate data forms exist.
While the use of generic forms is not ideal (the representation of data is lossy, which may impact effective analysis), these forms allow for the addition of arbitrary data to a hypergraph, either because that is the only way the data can be represented; or because an appropriate model does not yet exist but the data is needed now.
Generic forms such as
graph:event can be used for this purpose. Similarly, the generic
graph:cluster node can be used to link (via
refs light edges or
edge:refs forms) a set of nodes of arbitrary size (“someone says these things are all related”) in the absence of greater detail.
The Synapse data model includes forms such as
meta:source that can be used to track data sources for data ingested into a Cortex. “Sources” may include sensors or third-party services or connectors. Structures such as
seen light edges or
meta:seen forms can be used to track that a particular piece of data (e.g., a node) was observed by or from a particular source.