HTMLtoJSON
Synapse FileParser includes functionality to convert an HTML file into JSON using a user-defined template.
It is recommended to use the Storm API to invoke HTMLtoJSON:
$fp = $lib.import(fileparser)
($ok, $data) = $fp.htmlToJson($sha256, $template)
if (not $ok) {
$lib.warn("FileParser HTMLtoJSON failed")
} else {
// do something with $data
}
The htmlToJson
API returns a tuple, with the first value a bool
indicating whether
the execution was successful. If True
, the second tuple item contains the data
according to the structure of the template.
Template Schema
To define an HTML element to extract, specify a CSS selector as the select
key.
For additional details on the CSS selector syntax please see the SoupSieve documentation.
If the element is not found, the return type is None
, else the text is extracted
and provided to any additional post-processing steps.
type
is a required key for all templates, and supported template types are detailed below.
The following are additional post-processing keys that can be provided to text
and attr
templates:
regex
: Search the text for a match to a regex pattern. Returns the matching text orNone
.findall
: Return a list of all matches to a regex pattern.strip
: If specified, remove leading and trailing whitespace.split
: Returns a list of strings using the specified delimiter string.
Type: text
Return the human-readable text inside an element. If the element has children, all strings will be concatenated together.
Example:
HTML:
<html>
<a id="mylink" href="http://example.com/">\nI linked to <i>example.com</i>\n</a>
</html>
Template:
{
'type': 'text',
'select': '#mylink',
'strip': True,
}
Returns:
'I linked to example.com'
Type: attr
Return the value of the named attribute for the element.
If the attribute does not exist, returns None
.
name
is an additional required keyword.
Example:
HTML:
<html>
<a id="mylink" href="http://example.com/">\nI linked to <i>example.com</i>\n</a>
</html>
Template:
{
'type': 'attr',
'name': 'href'
'select': '#mylink',
}
Returns:
'http://example.com/'
Type: list
Extract data using a template for all elements that match a CSS selector.
select
and template
are additional required keywords.
Example:
HTML:
<html>
<p class="foo">Some foo text</p>
<p class="bar">Some bar text</p>
<p class="foo">More foo text</p>
</html>
Template:
{
'type': 'list',
'select': 'p.foo',
'template': {
'type': 'text',
},
}
Returns:
[
'Some foo text',
'More foo text',
]
Type dict
Extract named fields using individual templates per field.
The fields
keyword is required, and must be a dictionary
containing <name>: <template>
key/value pairs.
Example:
HTML:
<html>
<p class="foo">Some foo text</p>
<p class="bar">Some bar text</p>
<p class="foo">More foo text</p>
</html>
Template:
{
'type': 'dict',
'fields': {
'foos': {
'type': 'list',
'select': 'p.foo',
'template': {
'type': 'text',
'regex': '(\w{4})',
},
},
'bar': {
'type': 'text',
'select': 'p.bar',
'split': ' ',
},
},
}
Returns:
{
'foos': ['Some', 'More'],
'bar': ['Some', 'bar', 'text'],
}
Type groups
Extract groups of named fields using individual templates per field. This type is useful when attempting to extract groups of siblings as a list, since CSS selectors do not allow for selecting siblings from an element.
The fields
keyword is required, and must be a dictionary
containing <name>: <template>
key/value pairs.
The groups
keyword is required, which is a list of dictionaries
with field
and tag
attributes. The list index corresponds to the
index of the element in a group returned by the selector. field
is
the field name whose template will be used to parse the element.
tag
is the element tag that will be validated before parsing the group.
Example:
HTML:
<html>
<h3>A cool header</h3>
<h4>Foo subheader</h4>
<div>Some foo01 text<div>Some cool text</div></div>
<p>Some foo02 text</p>
<h4>Bar subheader</h4>
<div>Some bar01 text<div>More cool text</div></div>
<p>Some bar02 text</p>
</html>
Template:
{
'type': 'groups',
'select': 'h3 ~ h4, h4+div, h4+div+p',
'groups': [
{ 'field': 'htext', 'tag': 'h4'},
{ 'field': 'dtext', 'tag': 'div'},
{ 'field': 'ptext', 'tag': 'p'},
],
'fields': {
'htext': {
'type': 'text',
},
'dtext': {
'type': 'text',
'select': 'div > div',
},
'ptext': {
'type': 'text',
},
},
}
Returns:
[
{'htext': 'Foo subheader', 'dtext': 'Some cool text', 'ptext': 'Some foo02 text'},
{'htext': 'Bar subheader', 'dtext': 'More cool text', 'ptext': 'Some bar02 text'},
]