Archive for April 5th, 2016

Amazon long ago elminated its API for getting wishlists. 4 years ago I made a screen-scraping WordPress widget to display my wishlist. Unfortunately, as happens with screen-scraping, Amazon changed their format and URL's. And now I can't seem to get the ItemLookup API to work either.

doitlikejustin has a vanilla PHP wishlist scraper, but PHP 5 now has it's own HTML parser in DOMDocument, so I implemented my own.

The wishlist page has a simple structure, and all links to Amazon products have as part of the URL "dp/{ASIN}", where {ASIN} is the Amazon ID number, and all the individual items are contained in <div>s that have an id that starts with "item_", and the title is in a link that has an id that starts with "itemName". The image and author list are in consistent positions relative to those. Other advertisements for Amazon products that you see on the page are added with Javascript, so they won't show up when we grab the page with PHP.

Images URL images have the format "http://ecx.images-amazon.com/images/I/{idcode}._SL{size}.jpg" (with possibly some extra parameters before the "SL"). I just
pull the relevant idcode out and create my own URL with the desired size.

function wishlist($listID){
	$size = 100;
	$ret = array();
	$wishlistdom = new DOMDocument();
	// ignore parsing warnings
	@$wishlistdom->loadHTMLFile("http://www.amazon.com/gp/registry/wishlist/$listID?disableNav=1");
	$wishlistxpath = new DOMXPath ($wishlistdom);
	// I want to be able to limit and rearrange the list, so I turn it into an array
	$items = iterator_to_array($wishlistxpath->query("//div[starts-with(@id,'item_')]"));
	// filter $items as desired, then pull out the data
	foreach ($items as $item){
		$link = $wishlistxpath->evaluate(".//a[starts-with(@id, 'itemName')]", $item)->item(0);
		$href = $link->attributes->getNamedItem('href')->nodeValue;
		if (preg_match ('|/dp/\w+|', $href, $matches)){
			$href = "http://amazon.com$matches[0]"; // simplify the URL
		}else{
			$href = "http://amazon.com$href";
		}
		$title = $link->textContent;
		$author = $link->parentNode->nextSibling->textContent;
		$image = $wishlistxpath->query(".//img", $item)->item(0)->attributes->getNamedItem('src')->nodeValue;
		if (preg_match ('|http://ecx.images-amazon.com/images/I/[^.]+|', $image, $matches)){
			$image = $matches[0]."._SL$size.jpg";
		}else{
			$image = "http://ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._SL${size}_.jpg";
		}
		$image = "<img src='$image' alt='$title'><br/>";
		$ret[] = "<a href='$href'>$image$title<br/>$author</a>";
	}
	return ret;
}

Now this only gets the first page (25 items) of a wish list. I modified it to allow finding all the items on a wish list.