HTMLtoJSON

Synapse FileParser includes functionality to convert an HTML file into JSON using a user-defined template.

It is recommended to use the Storm API to invoke HTMLtoJSON:

$fp = $lib.import(fileparser)
($ok, $data) = $fp.htmlToJson($sha256, $template)
if (not $ok) {
    $lib.warn("FileParser HTMLtoJSON failed")
} else {
    // do something with $data
}

The htmlToJson API returns a tuple, with the first value a bool indicating whether the execution was successful. If True, the second tuple item contains the data according to the structure of the template.

Template Schema

To define an HTML element to extract, specify a CSS selector as the select key. For additional details on the CSS selector syntax please see the SoupSieve documentation.

If the element is not found, the return type is None, else the text is extracted and provided to any additional post-processing steps.

type is a required key for all templates, and supported template types are detailed below.

The following are additional post-processing keys that can be provided to text and attr templates:

  • regex: Search the text for a match to a regex pattern. Returns the matching text or None.

  • findall: Return a list of all matches to a regex pattern.

  • strip: If specified, remove leading and trailing whitespace.

  • split: Returns a list of strings using the specified delimiter string.

Type: text

Return the human-readable text inside an element. If the element has children, all strings will be concatenated together.

Example:

HTML:

<html>
    <a id="mylink" href="http://example.com/">\nI linked to <i>example.com</i>\n</a>
</html>

Template:

{
    'type': 'text',
    'select': '#mylink',
    'strip': True,
}

Returns:

'I linked to example.com'

Type: attr

Return the value of the named attribute for the element. If the attribute does not exist, returns None.

name is an additional required keyword.

Example:

HTML:

<html>
    <a id="mylink" href="http://example.com/">\nI linked to <i>example.com</i>\n</a>
</html>

Template:

{
    'type': 'attr',
    'name': 'href'
    'select': '#mylink',
}

Returns:

'http://example.com/'

Type: list

Extract data using a template for all elements that match a CSS selector.

select and template are additional required keywords.

Example:

HTML:

<html>
    <p class="foo">Some foo text</p>
    <p class="bar">Some bar text</p>
    <p class="foo">More foo text</p>
</html>

Template:

{
    'type': 'list',
    'select': 'p.foo',
    'template': {
        'type': 'text',
    },
}

Returns:

[
    'Some foo text',
    'More foo text',
]

Type dict

Extract named fields using individual templates per field.

The fields keyword is required, and must be a dictionary containing <name>: <template> key/value pairs.

Example:

HTML:

<html>
    <p class="foo">Some foo text</p>
    <p class="bar">Some bar text</p>
    <p class="foo">More foo text</p>
</html>

Template:

{
    'type': 'dict',
    'fields': {
        'foos': {
            'type': 'list',
            'select': 'p.foo',
            'template': {
                'type': 'text',
                'regex': '(\w{4})',
            },
        },
        'bar': {
            'type': 'text',
            'select': 'p.bar',
            'split': ' ',
        },
    },
}

Returns:

{
    'foos': ['Some', 'More'],
    'bar': ['Some', 'bar', 'text'],
}

Type groups

Extract groups of named fields using individual templates per field. This type is useful when attempting to extract groups of siblings as a list, since CSS selectors do not allow for selecting siblings from an element.

The fields keyword is required, and must be a dictionary containing <name>: <template> key/value pairs.

The groups keyword is required, which is a list of dictionaries with field and tag attributes. The list index corresponds to the index of the element in a group returned by the selector. field is the field name whose template will be used to parse the element. tag is the element tag that will be validated before parsing the group.

Example:

HTML:

<html>
    <h3>A cool header</h3>
    <h4>Foo subheader</h4>
    <div>Some foo01 text<div>Some cool text</div></div>
    <p>Some foo02 text</p>
    <h4>Bar subheader</h4>
    <div>Some bar01 text<div>More cool text</div></div>
    <p>Some bar02 text</p>
</html>

Template:

{
    'type': 'groups',
    'select': 'h3 ~ h4, h4+div, h4+div+p',
    'groups': [
        { 'field': 'htext', 'tag': 'h4'},
        { 'field': 'dtext', 'tag': 'div'},
        { 'field': 'ptext', 'tag': 'p'},
    ],
    'fields': {
        'htext': {
            'type': 'text',
        },
        'dtext': {
            'type': 'text',
            'select': 'div > div',
        },
        'ptext': {
            'type': 'text',
        },
    },
}

Returns:

[
    {'htext': 'Foo subheader', 'dtext': 'Some cool text', 'ptext': 'Some foo02 text'},
    {'htext': 'Bar subheader', 'dtext': 'More cool text', 'ptext': 'Some bar02 text'},
]