Skip to content

Extending Parsedown: attributes

Markdown Extra (including Parsedown Extra) allows for attributes to be applied to certain elements: headers, fenced code blocks, links, and images. I'd like to be able to apply them to any element. I'm going to use the syntax of Python Markdown attributes, but the attribute lists go before the elements. For block level elements, they go on their own line before the element.

My attribute lists start with {: (not just {) and end with }. Anything that would be legal in HTML (as is fine, since I didn't write my own parser. I just used DOMDocument. There are three special cases:

  • .foo is changed to class="foo". Note that this is different from the Python code, which appends class names that start with .. Repeated attribute names in actual HTML are ignored, so to use two classes, use class="foo bar", not .foo .bar.
  • #foo is changed to id="foo".
  • Two letters alone are changed to lang=xx, since I use that attribute so much.

Continue reading ›

Extending Parsedown: Block elements

Continuing work on extending Parsedown.

See the actual code.

Adding block-level elements is not much different from adding inline elements. Kavanot.name originally used <footer> elements to indicate the source of a block quote:

<blockquote>
    Do or do not. There is no try.
  </blockquote>
  <footer>Yoda</footer>
</blockquote>

While marking blockquotes this way is now acceptable, for a long time it wasn't, and the recommended way was with <figure> and <figcaption>. KavanotParsedown uses the latter model:

<figure>
  <blockquote>
    Do or do not. There is no try.
  <figcaption>Yoda</figcaption>
<figure>

But I start with creating a <footer> and then modifying the DOM. So I want to have a block element that I will indicate with "--" at the start of the line.

function __construct(){
  $this->BlockTypes['-'][] = 'Source'; // only line needed to indicate a block level element
  // ... rest of the constructor
}

protected function blockSource($Line, $Block = null){
  if (preg_match('/^--[ ]*(.+)/', $Line['text'], $matches)) {
    return array(
      'element' => array(
        'name' => 'footer',
        'handler' => array(
          'function' => 'lineElements',
          'argument' => $matches[1],
          'destination' => 'elements'
        ),
        'attributes' => array('class' => 'source') // for styling, add a class automatically
      )
    );
  }
}

so

>Do or do not. There is no try.
--Yoda

becomes

<blockquote>
    Do or do not. There is no try.
  <footer class="source" >Yoda</footer>
</blockquote>

I realized that I might want to add an attribution to an image as well, without it being in a <blockquote>, as

<p>
  <img src=/blog/blogfiles/pdf/smiley.png alt="Smile!"/>
  <footer class="source" >Some file I found on the web</footer>
</p>

But as it stands,

[Smile!](/blog/blogfiles/pdf/smiley.png)
--Some file I found on the web

doesn't work; the <p> ends before the <footer> starts:

<p>
  <img src=/blog/blogfiles/pdf/smiley.png alt="Smile!"/>
</p>
<footer class="source" >Some file I found on the web</footer>

so we need to check that the previous block wasn't a paragraph. If it was, then parse this line and add it to the paragraph as an internal element:

protected function blockSource($Line, $Block = null){
  if (preg_match('/^--[ ]*(.+)/', $Line['text'], $matches)) {
    if ($Block && $Block['type'] === 'Paragraph'){
      $Block['element']['handler']['argument'] .= "\n".$this->element($this->blockSource($Line)['element']);
      return $Block;
    }
    return array(
      'element' => array(
        'name' => 'footer',
        'handler' => array(
          'function' => 'lineElements',
          'argument' => $matches[1],
          'destination' => 'elements'
        ),
        'attributes' => array('class' => 'source') // for styling, add a class automatically
      )
    );
  }
}

and now it works, except that the footer is a child of the <p> instead of the <blockquote>. We'll have to fix that.

Extending Parsedown: Inline elements

Extending Parsedown involves adding elements to the $InlineTypes and $BlockTypes arrays, then adding methods to handle them.

See the actual code.

Italics

I use <i> a lot, to indicate transliterated words. So I use could use "/" to indicate that:
/Shabbat/ is a Hebrew word becomes <i>Shabbat</i> is a Hebrew word. To do that:
do

class myParsedown extends Parsedown{
  function __construct(){
    $this->InlineTypes['/'] []= 'Italic';
    // after adding all the new inline types, create the list of characters
    $this->inlineMarkerList = implode ('', array_keys($this->InlineTypes));
    // allow the character to be escaped by '\'
    $this->specialCharacters []= '/';
  }

  protected function inlineItalic($excerpt){
    if (preg_match('#^/(.+?)/#', $excerpt['text'], $matches)) {
      return array(
        'extent' => strlen($matches[0]), 
        'element' => array(
          'name' => 'i',
          'handler' => array(
            'function' => 'lineElements',
            'argument' => $matches[1],
            'destination' => 'elements'
          )
        )
      );
    }
}

Now, my transliterated words are almost always Hebrew, so I can automatically add the lang=he attribute:


  protected function inlineItalic($excerpt){
    if (preg_match('#^/(.+?)/#', $excerpt['text'], $matches)) {
      return array(
        'extent' => strlen($matches[0]), 
        'element' => array(
          'name' => 'i',
          'handler' => array(
            'function' => 'lineElements',
            'argument' => $matches[1],
            'destination' => 'elements'
          ),
          'attributes' => array('lang' => 'he') // Add attributes
        )
      );
    }
}

and now /Shabbat/ is a Hebrew word becomes <i lang=he>Shabbat</i> is a Hebrew word.

Cite

I also use the <cite>. I'm running out of single characters to indicate elements, so I'm going to redefine "-". I don't need two different markers for <em>.

  function __construct(){
    $this->InlineTypes['_'] = ['Cite']; // redefinition; I am replacing the old array (which was ['Emphasis'])
    // ... rest of the constructor as above
  }

  protected function inlineCite($excerpt){
    if (preg_match('#^_(.+?)_#', $excerpt['text'], $matches)) {
      return array(
        'extent' => strlen($matches[0]), 
        'element' => array(
          'name' => 'cite',
          'handler' => array(
            'function' => 'lineElements',
            'argument' => $matches[1],
            'destination' => 'elements'
          )
        )
      );
    }

And now _A Tale of Two Cities_ becomes <cite>A Tale of Two Cities</cite>

String Replacement in PHP

Working with Parsedown, I want to string manipulation but only in certain parts. For instance, on text not in HTML tags or not in quotes. The right way to do that is with a real parser. The easy way is by removing the unwanted strings, replacing them with a marker that won't come up in normal text, doing the manipulation, then replacing the markers (it is the replacement step that requires "a marker that won't come up in normal text"; you don't want to replace text that was present in the original).

I would use a marker that can't be typed but still is legal HTML; turns out that U+FFFC (OBJECT REPLACEMENT CHARACTER, ) is perfect for that. So I made a pair of functions, `StringReplace\remove` and `StringReplace\restore` to make that easy.

StringReplace\remove ($re, $target)
Any string that matches the regular expression $re in $target is replaced by a numbered marker, "{number}". The new string is returned. So for instance,

$rawtext = StringReplace\remove ('#</?[^>]*>#', $html);

will remove tags.

StringReplace\restore ($target)
Returns a string with the markers replaced by their original versions.

The code

namespace StringReplace;

define ('OBJECT_REPLACEMENT_CHARACTER', '');
define ('RE_REPLACEMENT', '/'.OBJECT_REPLACEMENT_CHARACTER.'(\d+)'.OBJECT_REPLACEMENT_CHARACTER.'/');

$strings = array();

$remover = function ($matches){
  global $strings;
  $strings []= $matches[0];
  return OBJECT_REPLACEMENT_CHARACTER.count($strings).OBJECT_REPLACEMENT_CHARACTER;
};

$replacer = function ($matches){
  global $strings;
  return $strings[$matches[1]-1];
};

function remove ($re, $target){
  global $remover;
  return preg_replace_callback ($re, $remover, $target);
}

function restore ($target){
  global $replacer;
  return preg_replace_callback (RE_REPLACEMENT, $replacer, $target);
}

Extending Parsedown

I've been spending all my intellectual free time on working on my Kavanot site, so I haven't been doing any independent programming. But that site uses raw HTML, which is a pain to type. So I decided to start using Markdown to make writing easier. After a little trial and error, I decided to use Parsedown with Parsedown Extra.

See the code.

Continue reading ›

H&R Block Updating Errors

(It's been almost a year since I've posted. My intellectual life is busy with other things)

I use H&R Block software to prepare my taxes and have been generally very happy, but this year it would not update the program after January 1. Using the automatic update downloaded the updater but it silently failed. Manually downloading from http://www.hrblock.com/tax-software/updates-back-editions/federal-windows.html then running that would run WinZip to unpack it, then silently fail. The tech support at H&R Block was useless.

Some Google-ing led to this solution:

  1. Run the update installer from the hrblock.com site as though it was going to work
  2. Wait for it to fail
  3. Find the unpacked downloaded installer:
    1. Windows-R key then type %temp% to open the temporary download folder
    2. Find the most recent folder (it will have a GUID-type name, like {deaf-face-1234}. Open it
    3. There should be an msp file like H&R Block Deluxe 2016 Update.msp
  4. Double click that to run it
  5. Run H&R Block

That's all it took! Obviously the unpacking worked correctly, so I don't know why the program wouldn't run the actual installer.

‘ob_gzhandler’ conflicts with ‘zlib output compression’

Nearly Free Speech has been a great hosting service, and they upgrade the stack consistently, which usually doesn't cause problems. But with the most recent upgrade, I started getting the above error. Looks like they turned on compression at the server level, so doing it on each page is redundant.

Evidently they announced it on the blog, but I never keep up with that.

Changing all my ob_start('ob_gzhandler'); to ob_start(); fixes it. Hope this helps someone else.

Amazon Wish List Hack on github

I've put an Amazon Wishlist Widget for WordPress on my github site, that uses the techniques described before. You can see it running on the sidebar here.

Getting multiple pages in the Amazon Wish List

I figured out how to get all the pages from screen-scraping the Amazon wish list. Basically, look for the "Next" button (it's in a <li class=a-last> element). If that element is present, look for the next page.

function getwishlistitems ($listID, $page=1){
	// ignore parsing warnings
	$wishlistdom = new DOMDocument();
	@$wishlistdom->loadHTMLFile("http://www.amazon.com/gp/registry/wishlist/$listID?disableNav=1&page=$page");
	$wishlistxpath = new DOMXPath ($wishlistdom);
	$items = iterator_to_array($wishlistxpath->query("//div[starts-with(@id,'item_')]"));
	if ($wishlistxpath->evaluate("count(//li[@class='a-last'])")) { // this is the "Next->" button
		$items = array_merge($items, $this->getwishlistitems($listID, $filter, $page+1));
	}
	return $items;
}

Note that this creates a complication: the array of items now includes nodes from different documents, so you can't use one saved DOMXPath. Instead, where the original code has $wishlistxpath->evaluate($xpath, $node), use

(new DOMXPath($node->ownerDocument))->evaluate($xpath, $node);

Hope this helps someone.

Hacking My Way Again to an Amazon Wishlist Widget

Amazon long ago elminated its API for getting wishlists. 4 years ago I made a screen-scraping WordPress widget to display my wishlist. Unfortunately, as happens with screen-scraping, Amazon changed their format and URL's. And now I can't seem to get the ItemLookup API to work either.

doitlikejustin has a vanilla PHP wishlist scraper, but PHP 5 now has it's own HTML parser in DOMDocument, so I implemented my own.

The wishlist page has a simple structure, and all links to Amazon products have as part of the URL "dp/{ASIN}", where {ASIN} is the Amazon ID number, and all the individual items are contained in <div>s that have an id that starts with "item_", and the title is in a link that has an id that starts with "itemName". The image and author list are in consistent positions relative to those. Other advertisements for Amazon products that you see on the page are added with Javascript, so they won't show up when we grab the page with PHP.

Images URL images have the format "http://ecx.images-amazon.com/images/I/{idcode}._SL{size}.jpg" (with possibly some extra parameters before the "SL"). I just
pull the relevant idcode out and create my own URL with the desired size.

function wishlist($listID){
	$size = 100;
	$ret = array();
	$wishlistdom = new DOMDocument();
	// ignore parsing warnings
	@$wishlistdom->loadHTMLFile("http://www.amazon.com/gp/registry/wishlist/$listID?disableNav=1");
	$wishlistxpath = new DOMXPath ($wishlistdom);
	// I want to be able to limit and rearrange the list, so I turn it into an array
	$items = iterator_to_array($wishlistxpath->query("//div[starts-with(@id,'item_')]"));
	// filter $items as desired, then pull out the data
	foreach ($items as $item){
		$link = $wishlistxpath->evaluate(".//a[starts-with(@id, 'itemName')]", $item)->item(0);
		$href = $link->attributes->getNamedItem('href')->nodeValue;
		if (preg_match ('|/dp/\w+|', $href, $matches)){
			$href = "http://amazon.com$matches[0]"; // simplify the URL
		}else{
			$href = "http://amazon.com$href";
		}
		$title = $link->textContent;
		$author = $link->parentNode->nextSibling->textContent;
		$image = $wishlistxpath->query(".//img", $item)->item(0)->attributes->getNamedItem('src')->nodeValue;
		if (preg_match ('|http://ecx.images-amazon.com/images/I/[^.]+|', $image, $matches)){
			$image = $matches[0]."._SL$size.jpg";
		}else{
			$image = "http://ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._SL${size}_.jpg";
		}
		$image = "<img src='$image' alt='$title'><br/>";
		$ret[] = "<a href='$href'>$image$title<br/>$author</a>";
	}
	return ret;
}

Now this only gets the first page (25 items) of a wish list. I modified it to allow finding all the items on a wish list.