if a site doesn't provide an api, you can still access its data through raw html. i was just thinking, what if you could create a generic component that could gather data out of html pages and then quickly normalize this data and wrapper a REST endpoint on top of this (at another domain). would this be of any value? a hacky way to RESTify legacy websites?
i think the data in an html page must be so awfully organized that this would be a difficult task to do, but perhaps if you could provide this service, you could monetize per REST call.
Any thoughts from fellow hackers?
At the end I cam up with a shell script using following UNIX/cygwin tools:
1.) curl to download the HTML side to a file;
2.) iconv to convert the HTML to UTF8 encoding if it was in a different encoding (which was the case once);
3.) tidy -asxml -numeric -utf8 to convert the HTML page to XML;
4.) xmlstarlet (http://xmlstar.sourceforge.net) with the sel command and a bunch of XPath expressions to extract data that I needed from the page and to pipe it to other unix tools. Watch out after you have retrieved data with xmlstarlet might return XML escaped characters, so I run it through "xmlstarlet unesc"
This approach worked pretty fine for me.