The state of RSS and news feeds

The nice thing about standards is that you have so many to choose from.

This quote comes to mind a lot while I have been trying to build an RSS/news feed reader. For something like RSS and Atom feeds that have nicely defined specs, it seems that everyone doesnt really follow them. A lot of feeds have extra fields that are undefined, custom namespaces, or are missing fields all together. Why is it so hard to follow a spec?

RSS vs. Atom

Atom and RSS set out to solve what is effectively the same problem; provide a means for news syndication. Atom, supposedly, boasts an IETF spec making it better than RSS (whose spec wasnt all that official and had numerous shortcomings that Atom sought to fix) yet Ive seen the same problems and inconsistencies in both. Part of me thinks that these problems stem from the fact that both Atom and RSS are done using XML. At this point in time, XML seems like such an old and verbose markup language. Maybe it’s time to move on to something like JSON which is more lightweight and can be parsed easily by basically every programming language, not to mention it’s easy to read.

Lack of attention to detail

I think the main problem however is that people simply don’t know what they are doing half the time. For example, one feed that I managed to pick up through an automated crawler I built was a feed hosted by Rackspace. This feed seemed to be used for keeping track of the status of something. Turns out they were serving all 13,000 entries (probably since the beginning of time) ALL AT ONCE. I was stumped for a little while as to why my feed collector was choking and taking forever until I realized that it was waiting to download each of these then insert them all one at a time into the database. By the way, this is all within the RSS 2.0 spec.

A channel may contain any number of <item>s.

Now if only there were a way to paginate through past entries…

Pick a mime-type, any mime-type

This makes me think that people set up their RSS feeds, build there own, use Wordpress or whatever and never properly set up their web server to serve the correct type of content. Ive seen feeds served up as:

application/xml
text/xml
application/rss+xml
text/html
application/rdf+xml
application/atom+xml

Sure all of these will produce some kind of valid XML document, but Im of the belief that you should be sending the correct headers for your document. Sending an RSS feed? Your Content-Type should be application/rss+html. Sending an Atom feed? You should be using application/atom+xml. C’mon, is it really that difficult? (hint, no its not).

At least provide the important fields

In the world of news, one of the more important fields is the date on which the article, story, event, item, whatever was published. Some feeds neglect to provide this important piece of information (thats right, Im looking at you HackerNews).

Defining the bare minimum

The way we consume news and media has changed a lot in the last few years. No longer are we looking at just a page of words. As we can see with apps like Flipboard, content is king. People like to see pictures and images. RSS doesn’t have a field to provide such images and the spec for its thumbnail image is too small. Atom has a generic "media:thumbnail" element, but some people (cough Mashable cough) like to be difficult and define their own namespace for their thumbnails (e.g. "mash:thumbnail"). So lets get some things straight here:

On the top level, we need to describe the feed:

title
description
last updated
link
site logo

These are pretty standard. Its the feed/item/article definition where things get a bit messy. But heres what we need in a world like today:

title
publish date
authors name
tags/categories
content
description (should be shortened preview of content)
image
unique id
original link/permalink

One of the more important fields in that list would be the unique id. Currently, it is rather difficult to determine if an article is unique. You can't go on title alone as someone could easily have two articles in the same feed with the same name. So it ends up being a comparison of normalizing a bunch of fields like the permalink/article link, title, and the feed its come from in order to tell if its unique or not. So why not include something like a UUID? With a UUID, you could then determine uniqueness on a feed by feed basis which is more than acceptable.

Personally, in the end Id love to see a new protocol built with JSON that people actually adhere to. The internet is already series of APIs and web services using JSON as a payload medium, so why not extend that to RSS and other news type feeds? Why not make it more like an API where you can actually request a number of entries, or a date range for enries, or at the very least paginate through entries so that you arent sending 13,000 of them all at once?