Microformats: they must be important - there's a t-shirt!
|Many, many years ago, I got hired by a small investment firm to write a program that looked up stock information on an online service, extracted the relevant numbers from the displays the service sent down, and loaded them into a database for analysis. I thought it was a really exciting, cutting edge effort.|
Nowadays the automated collection of information from web-pages is called "scraping". For example, shortly after I post this entry, you will be able to go to technorati and find it by searching for new items with the nptech tag. How does technorati index me accurately? It scrapes the tag off of my page. It finds it using a relatively new idea that I think is going to have some significant impact on data-driven websites: Microformats. The microformats.org website is celebrating its first anniversary this week, so it's probably time the concept got out and about a little more.
What are microformats? The microformats webpage says:
Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standardsMore precisely, microformats involves a combination of traditional HTML markup - the way data has traditionally been layed out and formatted on the web - with the purely semantic, non-visual approach of XML. In other words, microformats provide a way of laying out information on a web page that can be both read by a person and interpreted by a program.
To go back to that Technorati example: you've seen the text blocks on blog posts itemizing the tags the authors wishes to assign. Let's get technical for just a minute. In the HTML div tag that begins that block of text, there always appears the attribute rel="tag". Technorati scans blog pages for divs with this attribute, and uses it to index the content. Pretty simple. When I did my scraping project way back when, I had to rely on knowing exactly where the data was on the page... if the vendor changed his display, I was out of luck. With semantic markup, we can find specific content no matter where it is.
More complex microformats are the hCard standard, for contact information, and hCalendar, for appointments and events. These are "translations" of the widely used vCard and iCal standards.
Why would anyone care? A programmer could, for example write a little plug-in for Firefox that alerts you whenever it sees an hCard on a page you are reading, and asks you if you want to copy it into your contact manager. I think tools like this will become widespread as these techniques for adding sematics to the web become more widespread. And the microformat enhances visual display as well as semantic use: a web page designer can write a style sheet to control the display of standard microformats so that anytime one is used in a page, it will conform itself to this site-wide appearance.
Tags: nptech, xml, microformats
image uploaded to Flickr by Tantek