Hacking at 0300

Extending Parsedown

I've been spending all my intellectual free time on working on my Kavanot site, so I haven't been doing any independent programming. But that site uses raw HTML, which is a pain to type. So I decided to start using Markdown to make writing easier. After a little trial and error, I decided to use Parsedown with Parsedown Extra.

See the code.

This gives me tables and blockquotes along with simple URL's and <em> and <strong>. But it's not perfect.

(As an aside, tables were a bit of work to figure out. They have to start with | whatever | whatever and the next line has to be the divider, |---|---|, with exactly the same number of cells. Only that number of cells will display, so

| first header | second header 
|--------------|--------------
| first element| second element| third element

will only produce

<table>
<thead>
<tr>
<th>first header</th>
<th>second header</th>
</tr>
</thead>
<tbody>
<tr>
<td>first element</td>
<td>second element</td>
</tr>
</tbody>
</table>

losing that third column. Also, there's no way to eliminate the header entirely, but if the header cells are blank, then the empty <thead>
will take minimal space.)

Under the Hood

I wanted to add things that would make my life easier, such as adding language attributes (since I go between English and Hebrew text, with a smattering of Greek and even some Hieroglyphics) and easily entering <cite> and <i> elements.

So that meant looking at the source code. There is a tutorial for creating extensions, but it is not based on the most recent version (which as of this writing is 1.8.0-beta-7), so it's incomplete.

Parsedown has only one useful public method, Parsedown::text($text). It works by breaking the text into lines, then calling linesElements($lines) which iterates over each line with linesElements($lines) (yes, it's confusing to have the only difference being an 's' in the middle of the name) to parse the lines into an array of "element"s, each of which is an array of the form:

array(
  'name' => 'tag name',
  'attributes' => array ('attribute name' => 'attribute value'),
  'rawHTML' => 'a string of HTML that can optionally be escaped as unsafe',
  // OR
  'text' => 'a string of text that will not be further parsed',
  // OR
  'element' => array('a single "element" array that represents the child of this element'),
  // OR
  'elements' => array (array('an array of "element" arrays that represent all the children of this element'))
  // OR
  'handler' => array ('an array that tells Parsedown that further processing is needed')
);

and the 'handler' array is:

array(
  'handler' => 'name of method that will parse the text into markup, which will be either the "lineElements" or "linesElements" methods',
  'argument' => 'the text to be passed to "handler", which is either a string for "lineElements" or an array of strings for "linesElements"',
  'destination' => 'index to insert the parsed text, which will be one of "rawHTML", "text", "element", or "elements"'
);

The method elements(array $Elements) then recursively processes the elements to produce a string of markup.

The Details: Block level elements

Parsing a line consists of looking for a marker of a "block element" as the first character:

 protected $BlockTypes = array(
        '#' => array('Header'),
        '*' => array('Rule', 'List'),
        '+' => array('List'),
        '-' => array('SetextHeader', 'Table', 'Rule', 'List'),
        '0' => array('List'),
        '1' => array('List'),
        '2' => array('List'),
        '3' => array('List'),
        '4' => array('List'),
        '5' => array('List'),
        '6' => array('List'),
        '7' => array('List'),
        '8' => array('List'),
        '9' => array('List'),
        ':' => array('Table'),
        '<' => array('Comment', 'Markup'),
        '=' => array('SetextHeader'),
        '>' => array('Quote'),
        '[' => array('Reference'),
        '_' => array('Rule'),
        '`' => array('FencedCode'),
        '|' => array('Table'),
        '~' => array('FencedCode'),
    );

or no marker, which is either a <p> or a <pre><code> element, depending on if it is indented or not. Parsedown then creates a method name of 'block'.$blockType (for instance blockQuote, and calls that with the line to be parsed and the current state of the parser, which is called a "Block" and is an array:

array(
  'type' => 'the name from the array above'
  'element' => array ('element array as defined above, for the most recently defined element')
  'interrupted' => NULL // or the number of blank lines before the current line. Blank lines separate blocks. It's not clear why he counts them; the only thing that matters is if it is set or not
  'continuable' => TRUE or FALSE // TRUE if this block automatically continues on the next line, like a <table>, or FALSE if it only spans one line, like an <h1>
  'identified' => TRUE or FALSE // TRUE if the function is returning the same block or FALSE if a whole new one 
  // and other aspects of the state.
);

The function returns NULL if it cannot handle the text, returns the original "Block" array (modified as necessary) or returns a new "Block" array (in that case, the last "Block" is processed to produce an array of "element"s).
If the "Block" is marked 'continuable', then the method 'block'.$blockType.Continue (for instance blockQuoteContinue) is called with the next line. When a "Block" is processed, the method 'block'.$blockType.Complete (for instance blockQuoteContinue) is called.

If the handling function returns NULL, the next handler in the $BlockTypes[$marker] is called, until the "Block" is handled, or the paragraph handler is called.

Block-level handlers generally create "elements" that have "handler" == "linesElements", and the continuation handlers append the line to the "argument", so processing will continue recursively and elements can nest.

The Details: inline elements

Once there are no more markers for block elements, each line is scanned for markers for inline elements. For some reason, the program lists these in two places:

$inlineMarkerList = '!*_&[:<`~\\';
// AND
$InlineTypes = array(
  '!' => array('Image'),
  '&' => array('SpecialCharacter'),
  '*' => array('Emphasis'),
  ':' => array('Url'),
  '<' => array('UrlTag', 'EmailTag', 'Markup'),
  '[' => array('Link'),
  '_' => array('Emphasis'),
  '`' => array('Code'),
  '~' => array('Strikethrough'),
  '\\' => array('EscapeSequence'),
);

where he could have just done


$inlineMarkerList = implode ('', array_keys($InlineTypes));

in the constructor. I would do that for any Parsedown extension.

But the handling is similar to that for block elements. For each line, scan for any of the characters in $inlineMarkerList, then for each of the strings for that marker in $InlineTypes, create a method name 'inline'.$inlineType (for instance inlineEmphasis) and calls that with the string to be parsed (starting from the marker, ending at the newline). The handler decides if it wants to handle the line or not. If not, returns NULL. If yes, returns and array with two values:

array(
  'extent' => number of characters that the handler is consuming,
  'element' => array (element array as defined above)
);

Processing then continues with the rest of the line. Any text not handled is left untouched.

Now I know enough to create some extensions.

This entry was posted by Danny on May 22, 2020 at 5:15 pm under Parsedown, PHP. You can leave a response, or trackback from your own site. Follow any responses to this entry through the RSS 2.0 feed.

S	M	T	W	T	F	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Extending Parsedown

Under the Hood

The Details: Block level elements

The Details: inline elements

Leave a Reply

Free Medical Advice

Recent Posts

Pages

Archives

Judaism

Medical Informatics

Web Design

Meta

Hacking at 0300

Extending Parsedown

Under the Hood

The Details: Block level elements

The Details: inline elements

Leave a Reply

Free Medical Advice

Recent Posts

Pages

Categories

Archives

Judaism

Medical Informatics

Web Design

Meta