I’ll be using JavaScript here. With it we can write this in <100 lines of code.
Harness
We’ll start by laying out our code. I’ll be using closures, but you could —
and my first version did — use a class instead. We’ll need a
parse_html
function, inside we’ll put a helper function called
pull
and a parse_content
function which we’ll fill
in later.
If you’d like to follow along, the finished code is here: regex-html
export default function parse_html(input) {
const root = { children: [] };
function pull(regex, handler = () => {}) {
const match = regex.exec(input);
if (match !== null) {
const [full_match, ...captures] = match;
input = input.substr(full_match.length);
handler(...captures);
return true;
} else {
return false;
}
}
function parse_content(cursor) {
}
parse_content(root);
return root.children;
}
The cursor will be a reference to a node in our DOM tree. Tag nodes will have a tag property for their name, an attributes map, and a children array. Our root node just has a children array so that it looks like a tag to the parse_content function, but then we’ll just return the children array because we can have multiple top level nodes in HTML.
Tags
The first regular expression we’ll write will parse opening HTML tags. Open
tags look something like this <tag-name>
so we’ll use
/^<([a-zA-Z][a-zA-Z0-9\-]*)>/
. Of course, we’ll also need
to match closing tags, which are the same but with a slash:
/^\/([a-zA-Z][a-zA-Z0-9\-]*)>/.
Since
we’re matching on the beginning of the input, we need the ‘^’ to anchor to the
start of the string.
function parse_content(cursor) {
let run = true;
while (run && input.length > 0) {
// Parse an open tag
const success = pull(/^<([a-zA-Z][a-zA-Z0-9\-]*)>/, tag => {
const new_tag = { tag, attributes: {}, children: [] };
cursor.children.push(new_tag);
parse_content(new_tag);
}) ||
// Parse close tag
pull(/^<\/([a-zA-Z][a-zA-Z0-9\-]*)>/, tag => {
if (cursor.tag !== tag) {
throw new Error("Unmatched close tag");
}
run = false;
});
if (!success) {
throw new Error("Parse error");
}
}
}
Since our pull function returns a boolean, we can use short circuit evaluation to chain calls to pull and stop when one consumes some input. It’s kinda like choice in parser combinators.
One complication of using this pull function is not being able to break out of the loop, which is why there’s this silly run variable.
This gets us past our first test which doesn’t have any text nodes:
<p><b></b><i></i></p><div><span></span></div>
Text Nodes
Let’s add text nodes next. For simplicity, we’ll assume that ‘<’ characters don’t exist in text.
// Parse an open tag
const success = pull(/^<([a-zA-Z][a-zA-Z0-9\-]*)>/, tag => {
const new_tag = { tag, attributes: {}, children: [] };
cursor.children.push(new_tag);
parse_content(new_tag);
}) ||
// Parse close tag
pull(/^<\/([a-zA-Z][a-zA-Z0-9\-]*)>/, tag => {
if (cursor.tag.toLowerCase() !== tag.toLowerCase()) {
throw new Error("Unmatched close tag");
}
run = false;
})
// Parse a text node
|| pull(/^([^<]+)/, text => {
cursor.children.push({
text
});
});
This gets us past our second test which looks like this:
<p>
<b>bold content</b>
</p>
The parsed tree for that test looks like this:
[{ tag: 'p', children: [
{ text: "\n\t" },
{ tag: 'b', children: [
{ text: "bold content" }
]},
{ text: "\n" }
]}]
Attributes
While we could probably extend our open tag regular expression to accommodate attributes, that might require a variable number of captures so I’d rather put it in its own function. We’ll need to adjust our open tag handler to not require a closing ‘>’:
pull(/^<([a-zA-Z][a-zA-Z0-9\-]*)/, tag => {
const new_tag = { tag, attributes: {}, children: [] };
cursor.children.push(new_tag);
parse_attributes(new_tag);
parse_content(new_tag);
})
And then we’ll write a parse_attributes
function. (Please excuse
the wacky while loop):
function parse_attributes(cursor) {
while(pull(/^\s+([a-zA-Z][a-zA-Z0-9\-]+)="([^"]*)"/, (
name,
value
) => {
cursor.attributes[name] = value;
})) {}
if (!pull(/^\s*>/)) {
throw new Error("Malformed open tag");
}
}
Comment Nodes
We won’t go into parsing script tags, style tags, or cdata, but let’s add comments real quick. We’ll put another regular expression between open tags and closing tags like so:
// Parse a comment node:
|| pull(/^<!--((?:[^-]|-(?!->))*)-->/, comment => {
cursor.children.push({
comment
})
})
Void Tags
The last thing we’ll do for our little parser is to correctly handle void
tags. Tags like img
and source
cannot have children
and can’t have a closing tag. For them we’ll just skip parsing content. I
found a
list of void tags
that we’ll use:
const VOID_TAGS = [
// List from: https://riptutorial.com/html/example/4736/void-elements
'area',
'base',
'br',
'hr',
'img',
'input',
'link',
'meta',
'param',
'command',
'keygen',
'source'
];
// Parse an open tag
const success = pull(/^<([a-zA-Z][a-zA-Z0-9\-]*)/, tag => {
const new_tag = { tag, attributes: {}, children: [] };
cursor.children.push(new_tag);
parse_attributes(new_tag);
if (!VOID_TAGS.includes(tag.toLowerCase())) {
parse_content(new_tag);
}
})
Conclusion
Congratulations! You just built a recursive descent parser for a simplified version of HTML!
The moral of this story is that parsing non-regular languages is a pretty small step once you understand regular languages. The first time I heard “You can’t parse HTML with regular expressions” I thought I must need some completely different thing. I think there’s value in a tutorial that is a minimal extension on top of regular expressions and HTML is not syntactically complex.
I wouldn’t recommend using this parser in anything real, but for a little project it might work. I’d recommend adding helpful error messages at the very least.
Links