Ask HN: What technology do you use to collect data from an HTML file?

reirob · on Nov 26, 2010

I had to do it several times to collect data from HTML pages, to put the data in a small DB for further analysis.

At the end I cam up with a shell script using following UNIX/cygwin tools:

1.) curl to download the HTML side to a file;

2.) iconv to convert the HTML to UTF8 encoding if it was in a different encoding (which was the case once);

3.) tidy -asxml -numeric -utf8 to convert the HTML page to XML;

4.) xmlstarlet (http://xmlstar.sourceforge.net) with the sel command and a bunch of XPath expressions to extract data that I needed from the page and to pipe it to other unix tools. Watch out after you have retrieved data with xmlstarlet might return XML escaped characters, so I run it through "xmlstarlet unesc"

This approach worked pretty fine for me.

_5csa · on Nov 26, 2010

This would vary from site to site, and if the HTML of one changes, the "collector" will need to be updated too.

It's a ton of work even for a few site, making it generic: yeah, you could do that, but then you'd have to provide unorganised data as a result, which wouldn't be useful at all.

epynonymous · on Nov 27, 2010

that's one of the problems that i considered since html is not that descriptive. i could also see sites purposely changing format just to cause incompatibility if they're not excited about you ripping data from them.

i feel that html should be slightly overhauled to add context to all data, so instead of a <td>, it's stock 52 week high or something along those lines.

i just had a thought, perhaps rss is the answer! do sites still use that? i kind of feel rss is a dead technology.

mateo999 · on Nov 26, 2010

open.dapper.net is pretty good for this.

epynonymous · on Nov 27, 2010

wow, thanks matt, this is powerful!