Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: What technology do you use to collect data from an HTML file?
1 point by epynonymous on Nov 26, 2010 | hide | past | favorite | 5 comments
if a site doesn't provide an api, you can still access its data through raw html. i was just thinking, what if you could create a generic component that could gather data out of html pages and then quickly normalize this data and wrapper a REST endpoint on top of this (at another domain). would this be of any value? a hacky way to RESTify legacy websites?

i think the data in an html page must be so awfully organized that this would be a difficult task to do, but perhaps if you could provide this service, you could monetize per REST call.

Any thoughts from fellow hackers?



I had to do it several times to collect data from HTML pages, to put the data in a small DB for further analysis.

At the end I cam up with a shell script using following UNIX/cygwin tools:

1.) curl to download the HTML side to a file;

2.) iconv to convert the HTML to UTF8 encoding if it was in a different encoding (which was the case once);

3.) tidy -asxml -numeric -utf8 to convert the HTML page to XML;

4.) xmlstarlet (http://xmlstar.sourceforge.net) with the sel command and a bunch of XPath expressions to extract data that I needed from the page and to pipe it to other unix tools. Watch out after you have retrieved data with xmlstarlet might return XML escaped characters, so I run it through "xmlstarlet unesc"

This approach worked pretty fine for me.


This would vary from site to site, and if the HTML of one changes, the "collector" will need to be updated too.

It's a ton of work even for a few site, making it generic: yeah, you could do that, but then you'd have to provide unorganised data as a result, which wouldn't be useful at all.


that's one of the problems that i considered since html is not that descriptive. i could also see sites purposely changing format just to cause incompatibility if they're not excited about you ripping data from them.

i feel that html should be slightly overhauled to add context to all data, so instead of a <td>, it's stock 52 week high or something along those lines.

i just had a thought, perhaps rss is the answer! do sites still use that? i kind of feel rss is a dead technology.


open.dapper.net is pretty good for this.


wow, thanks matt, this is powerful!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: