Sean McGary

Software Engineer, builder of webapps

Using SSL/HTTPS with HAProxy

by on

Update (6/27/2014) - On June 19th, 2014, HAProxy 1.5.x was released and is now considered stable.

Last time I posted about HAProxy, I walked you through how to support domain access control lists (also known as "vitual hosts" for those of you using Apache and Nginx) so that you can route to different applications based on the incoming domain name. Since then, Ive had a few requests on how to support SSL and HTTPS with HAProxy since its not the most obvious thing.

The reason its not obvious is because its not "officially" supported yet in the current stable release (1.4) but it is available in the current 1.5 dev branch. If you intend to use this in a production setting, proceed with caution. As of June 19th, 2014, the 1.5.x branch is considered stable

Compiling

For this example, we'll be using Ubunut 12.04 LTS as our base operating system and will be building HAProxy from source. Before we start to build it, we need to make sure we have the dependencies installed

sudo aptitude update
sudo aptitude install build-essential make g++ libssl-dev

Next, let's download the latest version of HAProxy and compile it with the SSL option.

wget http://haproxy.1wt.eu/download/1.5/src/devel/haproxy-1.5-dev21.tar.gz
tar -xzf haproxy-1.5-dev21.tar.gz
cd haproxy-1.5-dev21
make USE_OPENSSL=1
sudo make install

Setup

Cool, now we have HAProxy installed and its time to setup our config file. In the following example config, we will setup HAProxy to accept connections on a single domain, but it will force redirect to the secure connection.

global
    log 127.0.0.1    local0
    log 127.0.0.1    local1 notice
    maxconn 4096
    user haproxy
    group haproxy
    daemon

defaults
    log    global
    mode    http
    option    httplog
    option    dontlognull
    option forwardfor
    option http-server-close
    stats enable
     stats auth someuser:somepassword
     stats uri /haproxyStats

frontend http-in
    bind *:80
    reqadd X-Forwarded-Proto:\ http
    default_backend application-backend

frontend https-in
    bind *:443 ssl crt /etc/ssl/*your ssl key*
    reqadd X-Forwarded-Proto:\ https
    default_backend application-backend

backend application-backend
    redirect scheme https if !{ ssl_fc }
    balance leastconn
    option httpclose
    option forwardfor
    cookie JSESSIONID prefix

    #enter the IP of your application here
    server node1 10.0.0.1 cookie A check

A lot of the stuff at the top of the config is fairly basic boiler-plate things. We want to pay attention to is everything below the defaults. As you can see, we're telling HAProxy to listen on both ports 80 and 443 (HTTP and HTTPS respectively) and each uses the backend "application-backend" as the default.

A side note here real quick; the things we learned in the previous post on access control lists can be directly applied to this situation.

The new section here is the additional https-in section. This tells HAProxy to listen on port 443 (the default port for HTTPS) and specifies the SSL certificate to use. Generating SSL certificates can be a huge pain in the ass and sometimes depends on the authority that is issuing it. The one thing to know though is that the certificate (unless it's a wildcard cert) MUST be issused for the domain that you are sending through HAProxy.

Now, in our backend definition, the first line is really the only thing thats different. This tells HAProxy that if the incoming request (since we're using the same backend for both HTTP and HTTPS) is not secured over SSL, to redirect to the same route using HTTPS if ssl is available (thats the !{ssl_fc}).

Wrap-up

That pretty much does it. It's not all that different from the config in the previous exercise, but it can be a little tricky to setup and configure, especially if your cert isnt configured correctly or doesnt have the correct permissions.

Beginner's Guide to Elasticsearch

by on

For the uninitiated, Elasticsearch is a schema free, JSON document based search server built on top of the indexing library Lucene. While it does provide a powerful full-text search system, Elasticsearch provides many other features that make it great for things like aggregating statistics. In this post, I am going to walk through how to setup Elasticsearch and the basics of storing and querying data.

Setup

For this setup, we'll be using Ubuntu 12.04 LTS as our base operating system.

Because Elasticsearch is written in Java, we need to install the JVM.

sudo aptitude install openjdk-7-jre-headless

We won't be needing any of the UI related features of Java, so we can same some space and time by installing the headless version.

Next, we need to download Elasticsearch. We're going to just download the standalone tarball, but there is also a deb package available if you wish to install it with dpkg.

wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.9.tar.gz
tar -xvf elasticsearch-0.90.9.tar.gz

Now that we have it downloaded, running it is as simple as executing the included binary.

cd elasticsearch-0.90.9
./bin/elasticsearch

By default, Elasticsearch will run as a background daemon process. For this post, we're going to run it in the foreground so that we can watch the logs. This can be accomplished by providing the -f flag.

./bin/elasticsearch -f

Terminology

Before we start, there is some vocabulary that we need to familiarize ourselves with:

Node

A node is a Java process running Elasticsearch. Typically each node will run on its own dedicated machine.

Cluster

A cluster is one or more nodes with the same cluster name.

Shard

A shard is a Lucene index. Every index can have one or more shards. Shards can be either the primary shard or a replica.

Index

An index is the rough equivalent of database in relational database land. The index is the top-most level that can be found at http://yourdomain.com:9200/<your index>

Types

Types are objects that are contained within indexes. Think of them like tables. Being a child of the index, they can be found at http://yourdomain.com:9200/<your index>/<some type>

Documents

Documents are found within types. These are basically JSON blobs that can be of any structure. Think of them like rows found in a table.

Querying Elasticsearch

Out of the box Elasticsearch comes with a RESTful API that we'll be using to make our queries. Im running Elasticsearch locally on localhost, so all examples will be in regards to it, but simply replace localhost with your fully qualified domain. By default, this means the URL we'll be using is http://localhost:9200/

Creating an index

First thing we need to do is create an index. We're going to call our index "testindex". To do this, we simply need to make a POST request to http://localhost:9200/testindex

curl -X POST http://localhost:9200/testindex -d  '{}'

When you create an index, there are a number of options that you can pass along. Things such as mapping definitions and settings for number of shards and replicas. For now, we're just going to post an empty object. We'll revisit mappings later on in a more advanced post.

Inserting a document

To insert our first document, we need a type. For this example, we'll be using mySuperCoolType and we'll be inserting a document with a full name field and a field for a twitter handle.

curl -X POST http://localhost:9200/testindex/mySuperCoolType -d  '
{
    "fullName": "Sean McGary",
    "twitterHandle": "@seanmcagry"
}'

// response
{"ok":true,"_index":"testindex","_type":"mySuperCoolType","_id":"N_c9-SQ8RRSrRwPIBqG6Ow","_version":1}

Querying

Now that we have a document, we can start to query our data. Since we didnt provide any specifics around field mappings, Elasticsearch will try to determine the type of the field (string, number, object, etc) and run the default analyzers and indexers on it.

To test this, we'll query our collection to try and match the full name field.

curl -X GET http://localhost:9200/testindex/mySuperCoolType/_search -d '
{
    "query": {
        "match": {
            "fullName": "Sean"
        }
    }
}'

// result
{
   "took":2,
   "timed_out":false,
   "_shards":{
      "total":5,
      "successful":5,
      "failed":0
   },
   "hits":{
      "total":1,
      "max_score":0.19178301,
      "hits":[
         {
            "_index":"testindex",
            "_type":"mySuperCoolType",
            "_id":"N_c9-SQ8RRSrRwPIBqG6Ow",
            "_score":0.19178301,
            "_source":{
               "fullName":"Sean McGary",
               "twitterHandle":"@seanmcgary"
            }
         }
      ]
   }
}

When you make a query, Elasticsearch will spit back a bunch of data, like if it timed out, how long it took, how many shards it queried against and how many succeeded/failed. The last field that it returns is the "hits" object. This is where all of you results will appear. In the root hits object, it will tell you the number of matches found, the max score of all the hits, and of course, the array of hits. Each hit includes meta info (prepended with an underscore) such as the (auto assigned) ID, the score and the source, which is the original document data you inserted. Full text searching is one of the stong features of Elasticsearch, so when it performs the search, it will rank all matches based on their score. The score is how close of a match each document is to the original query. The score can be modified if you wish to add additional weight based on certain paramters. We'll cover that in a later, more advanced post.

As you can see here in our results, we got one match by querying for "Sean" in the fullName field. Becuase we didnt specify a mapping, Elasticsearch applied the "standard analyzer" to the fullName field. The standard analyzer takes the contents of the field (in this case a string), lowercases all letters, removes comon stopwords (words like "and" and "the") and splits the string on spaces. This is why when we query "Sean" it matches "Sean McGary".

Lets take a look at another query. This time though, we're going to apply a filter to the results.

curl -X GET http://localhost:9200/testindex/mySuperCoolType/_search -d '
{
    "query": {
        "match": {
            "fullName": "Sean"
        }
    },
    filter: {
        "query": {
            "match": {
                "twitterHandle": "seanmcgary"
            }
        }
    }
}'

This particular request returns exactly what we had before, but lets break it down a little bit. To start, it's important to understand the difference between queries and filters. Queries are performed initially on the entire dataset. Here that is the "mySuperCoolType" type. Elasticsearch will then apply the filter to the result set of the query before returning the data. Unlike queries though, filters are cached which can improve performance.

Conclusion

This concludes our introduction to Elasticsearch. In followup post, I'll introduce some more advanced features such as setting up mappings, custom analyzers and indexers, and get into how to use facets for things such as analytics and creating histograms.

Creating a blogging platform - TryCompose.com

by on

Just a blog

Over the last 5 years or so, Ive often found myself bouncing from blog platform to blog platform in search of something that will make me happy. Ive hosted my own Wordpress blog, Ive used Tumblr, Ive tried Github pages, and Ive built my own blogging "engines" probably about three times by now (usually its to learn something new). But now I decided that Im tired of moving around from place to place and want to build something that not only meets my needs, but hopefully (at least some of) the needs of the community as well.

For years, Wordpress has been the go-to for blog platforms. It's a veteran, (mostly) stable, has plugins for basically everything you need, any theme you can think of, and you can customize it. All if you want to host and manage it yourself. Wordpress.com (the paid and hosted flavor) does exist, but the free version has ads and is very limiting in that you cant install themes (you have to use what they have), you cant use any plugins, you cant use your own domain and you can't use Google Analytics.

Thats a problem....

Wordpress has become a great extendable content management system. People even go so far as to abuse and hack it into something completely different entirely. This problem needs solving...

Ideal features

Here are some features I think a great blogging platform should have:

Completely hosted

As much as I like tending to servers from time to time, I dont want to think about a blog server. Running your own server means you have to not only keep your blog platform up to date, but keep the entire machine (or virtual machine) up to date and secure. I dont want to worry about that. I want to click a button and start writing.

Markdown

Wordpress really shows its age with it's old rich text/html post editor. HTML is a pain to write for anyone, especially when using it to format a blog, and I dont want to have to click on various modifier buttons to format my content. Markdown is great in that it's simple and intuitive enough for anyone to learn and doesn't break the writing flow.

Take ownership of your brand/identity

I want to be able to use my own domain without paying some fee. I dont want a subdomain, I want MY domain. I also want to be able to own the content I create and take it with me anywhere I go should I choose to switch platforms.

Programmer/Hacker friendly

A lot of hosted platforms (aside from Github pages) dont provide support for syntax highlighted code blocks by default. Definitely something that I would love to have.

Google Analytics

Google Analytics is definitely a heavyweight in the world of web analytics as its super simple to setup and use. Just let me enter in my sites ID and start tracking stuff. Dont give me some half-assed baked in solution. Give me the option to use what I want.

Content that looks great

This is an area that I think Medium and Svbtle accel at. Both format your writing in a way that is free of distractions and very easy to read. They both have some drawbacks (Medium lacks control over your personal brand, and Svbtle is invite-only), but they both have a really great content consumption experience. I dont care if your platform has a million themes if theyre all crap. Give me a choice between a few themes that look great.

File/image hosting

Generally I like to include images with various posts, so a way to upload and manage files would be a great feature. This simplifies things greatly as it means I dont need to upload things to thrid parties or try and host them myself on a private file server.

Shut up and take my money

Let me pay for a service like this. Make it affordable, but let me throw money at you so I can have it. I fully believe that services that charge for their product end up being better in the end because they manage to weed out users that expect everything for nothing and can provide great support and features to those that really want to support the product.

Try Compose

Compose strives to include all of the features listed above and that is just the beginning. Over the coming weeks, we'll be opening up a free beta for people to try out so that we can gather feedback to make the platform even better. If you're interested in getting access to the beta, head on over to the signup page, provide us with your email, and we'll let you know when you can start using it!

HAProxy - route by domain name

by on

I tend to build a lot of web applications in NodeJS using the Express.js webserver. When you have a few of these apps running on one server, you generally want to run them on unique ports and put some kind of proxy in front of them. Nginx works great for this and Apache can be another decent, though more bloated, alternative. Recently I decided to branch out for the sake of variety and to learn something new. HAProxy filled that role.

For the uninformed, HAProxy is more than just a reverse proxy; it's a high performance load balancer. Sites with lots of traffic will use something like HAProxy to funnel traffic to a cluster of web servers or even balance taffic between database servers. But what happens when we want to route multiple domains (or subdomains) to different hosts or clusters?

Setting up a single proxy

First lets take a look at the basics of proxying to an app server.

# config for haproxy 1.5.x

global
        log 127.0.0.1   local0
        log 127.0.0.1   local1 notice
        maxconn 4096
        user haproxy
        group haproxy
        daemon

defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        option forwardfor
        option http-server-close
        stats enable
        stats auth someuser:somepassword
        stats uri /haproxyStats

frontend http-in
        bind :80
        default_backend web-app-cluster

backend web-app-cluster
        balance leastconn
        option httpclose
        cookie JSESSIONID prefix
        server node1 10.0.0.1:8080 cookie A check
        server node2 10.0.0.2:8080 cookie A check
        server node3 10.0.0.3:8080 cookie A check

So this is a pretty basic config that will loadbalance across 3 application servers, each of which is on a unqiue IP and probably on its own dedicated machine. Generally you'll also want to run your load balancer(s) on a different server.

So what does this all mean? global and defaults should be pretty self-explanatory, then we have a frontend and abackend. The frontend, as you can see, tells HAProxy what to bind to and defines a default backend. There are a lot of things that can be specified in the front end and you can also have multiple frontend definitions (for example, if you wanted to provide an unsecure route running on port 80 and SSL on port 443 and have different, or the same, backends for each). We'll go over some other options in the multiple domain example.

Diving into multiple domains and ACLs

Now lets take a look at how to route to multiple domains based on matching specific domain names.

# config for haproxy 1.5.x

global
        log 127.0.0.1   local0
        log 127.0.0.1   local1 notice
        maxconn 4096
        user haproxy
        group haproxy
        daemon

defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        option forwardfor
        option http-server-close
        stats enable
        stats auth someuser:somepassword
        stats uri /haproxyStats

frontend http-in
        bind *:80

        # Define hosts
        acl host_bacon hdr(host) -i ilovebacon.com
        acl host_milkshakes hdr(host) -i bobsmilkshakes.com

        ## figure out which one to use
        use_backend bacon_cluster if host_bacon
        use_backend milshake_cluster if host_milkshakes

backend baconcluster
        balance leastconn
        option httpclose
        option forwardfor
        cookie JSESSIONID prefix
        server node1 10.0.0.1:8080 cookie A check
        server node1 10.0.0.2:8080 cookie A check
        server node1 10.0.0.3:8080 cookie A check


backend milshake_cluster
        balance leastconn
        option httpclose
        option forwardfor
        cookie JSESSIONID prefix
        server node1 10.0.0.4:8080 cookie A check
        server node1 10.0.0.5:8080 cookie A check
        server node1 10.0.0.6:8080 cookie A check

So here we are routing between two applications; ilovebacon.com and bobsmilkshakes.com. Each one has its own cluster of app servers that we want to load balance between. Lets take a closer look at where all the magic happens in the frontend.

frontend http-in
        bind *:80

        # Define hosts
        acl host_bacon hdr(host) -i ilovebacon.com
        acl host_milkshakes hdr(host) -i bobsmilkshakes.com

        ## figure out which one to use
        use_backend bacon_cluster if host_bacon
        use_backend milshake_cluster if host_milkshakes

If you've ever used nginx or Apache as reverse proxies, youd generally set things up using virtual hosts. HAProxy uses the notion of access control lists (acl) which can be used to direct traffic.

After we bind to port 80, we set up two acls. The hdr (short for header) checks the hostname header. We also specify -i to make sure its case insensitive, then provide the domain name that we want to match. You could also setup acls to match routes, file types, file names, etc. If you want to know more, feel free to check the docs. So now we effectively have two variables; host_bacon and host_milkshakes. Then we tell HAProxy what backend to use by checking to see if the variable is true or not. One other thing that we could do after the use_backend checks is specify a default_backend <backend name> like we did in the first example to act as a catch-all to traffic hitting our load balancer that doesn't match either of the domain names.

Next time, I'll go over how to add HTTPS/SSL support to your websites.

FeedStash.net - Your news made simple

by on

In less than 5 days, Google will be shutting down Google Reader. A lot of people are now scrambling to find a new service to migrate to. The closing of Reader has opened up a relatively large market for new new reader apps. Some focus on social and sharing, some are bold and are rethinking the news reading experience, and some are trying to be as simple as possible or as close to Reader as possible.

Today, Im launching an app that Ive been working on for the past few months (I actually started it before Google decided to close down Reader) to compete with the others to become the next Google Reader replacement.

Introducing FeedStash.net

FeedStash started out as a side experiment when I wanted to play around with Elasticsearch. I needed a fairly large set of textual data and RSS feeds from blogs seemed to fit the bill quite nicely. I began writing a collection system to grab stories from a set of RSS feeds, quickly discovering that RSS is probably one of the most inconsistent formats in the world and every site does things a little differently than the next. After building out a solid collector, it seemed fairly natural to give it a web interface to use Elasticsearch through. This is where things started moving away from a playing with Elasticsearch and into building an RSS reader.

I had managed to get a basic web UI up and running to view and manage feeds. It was right around this time that Google dropped the bomb that it would be closing Reader on July 1st. Everyone was in an uproar over it and it was very clear that the seemingly dormant RSS community was very much alive and boy was it angry. The few weeks immediately following the apocalyptic announcement, RSS and news reader apps started popping up all over the place. Looks like it was time to get my shit together if I wanted to build something to be used by people.

Keeping things simple

The benefit of all these other people and companies building apps early on is that they ended up doing a bit of market research that would benefit everyone else. The countless posts on Reddit and HackerNews about people launching new apps spawned threads of comments of very useful feedback. Turns out people dont want the super social applications; they just want to be able to read their news in peace with the ability to organize and filter their various subscriptions. They also wanted it to be super simple. Theres no need to take a radical new approach on UI design here. Just make it easy for people to use.

This first iteration of FeedStash is just that - simple. You can import your feeds from Google Reader either by uploading the exported OPML file or by signing in with your Google account and letting FeedStash grab it using the Reader API. This will import and subscribe you to your existing subscriptions.

We decided to keep reading and organizing feeds really simple so that we could get this out the door in time for July 1st. On the right side, you’ll start with a single “folder” called “All Feeds”. This is the master list of all of your subscriptions. By default it will show a stream of all your feeds combined sorted by the date the headline was published. If you want to view just the headlines for a specific feed, simply click on the feed name on the left side of the screen.

You can create as many folders as you’d like. Just like the “Add Folder” link on the left side. It’ll prompt you for a display name then allow you to add any of your current subscriptions to it. When you click on that folder on the left side, it will slide open much like the “All Feeds” folder and will give you a stream of only the feeds in that folder. Again, if you want to view headlines from just one feed, simply click on it.

Headlines are displayed in chronological order based on published date on the right side of the screen. Clicking on the headline title will expand it and show you the preview content. Content quality and length will vary from site to site. Some sites put entire articles rich with pictures in video in their previews, and some put hardly anything at all. Clicking the “continue reading” button will open the headline in a new tab and navigate to its source. Some people actually like sharing, so Ive included some very subtle social share buttons that are visible when you expand the headline. You can share on Facebook, Twitter, Google+, and App.net. Also since a decent number of people were reqeusting it from other apps, I added a link to add the story to Pocket so that you can read it later. None of these require you to connect your account as they are all handled through web intents. When you click on one to share, it will pop open a small window on each site which will render its own share dialog that will be pre-populated with the story title and link.

Personally I like to view the headlines and stories that Ive already read, rather than trying to dig through a huge list to find them, so I have included a dedicated page to displaying headlines that you have read in the past. Headlines are marked as read as soon as you click on them in the stream.

Favoriting is another pretty basic but widespread notion as well, so Ive included a dedicated page for favorites as well. To favorite a headline, just look for the “favorite” link with the heart next to it and click it.

Feeding the servers

With deciding to launch FeesStash publicly, I have also decided to charge $24/year for it. The decision to do is was based on the fact that:

  1. keeping servers running to support the app costs money2. people have been quite vocal that they are willing to pay money for a service like this
  2. Charging for it keeps me on top of everything. All to often I see people create free services that fall by the wayside because the creator either forgot about it or moved on. Having paying customers is far more motivating because often times they are more loyal and demand quality.

The road ahead

There are a number of features that Im looking to add to make the FeedStash experience even better. Heres a look at what’s to come.

A RESTful API

Syncing seems to be a one of the highly desired features of any news feed service. Everyone consumes news and content in their own way and establishing an API that can sync with our core system would allow users to build applications to suit their needs.

Better mobile experience

This first iteration of FeedStash was built to be mostly usable on desktop browsers and tablets (it happens to fit perfectly on an iPad in landscape mode). We hope to bring a better mobile experience to smaller form factor devices like smart phones through the web interface and then (hopefully) eventually through native apps.

Bookmarklet

Sometimes finding that pesky RSS icon on the site you’re browsing is hard to do. With our bookmarklet you wont have to search for it. Thankfully there is a standard for a meta tag to define where the RSS feed for the current page is located. When you click the bookmarklet, it finds those tags, lists the feeds available, and allows you to subscribe to it right on the spot. No copy and pasting it into the app or anything. We want to provide you with the quickest way to subscribe in order to not disrupt your web browsing.

Searching

This is something that I haven’t seen all that often in news and RSS applications. Id love to be able to search for new feeds and headlines or even search through those that I have already read or search through my favorites. FeedStash stores each feed uniquely in the database and keeps records of the posts that it pulls in. Indexing that bit of data to make it searchable could open up a world of new discovery options.

Personalization

Some people like to organize feeds into folders, some prefer to get even more specific and create tags on the post level. We want to expand the amount of personalization possible so that you can organize your feeds in almost any way you see fit. Allowing users to have such fine customization over the content they read could also allow us to further analyze your content to do things such as surface feed and story suggestions.

The state of RSS and news feeds

by on

The nice thing about standards is that you have so many to choose from.

This quote comes to mind a lot while I have been trying to build an RSS/news feed reader. For something like RSS and Atom feeds that have nicely defined specs, it seems that everyone doesnt really follow them. A lot of feeds have extra fields that are undefined, custom namespaces, or are missing fields all together. Why is it so hard to follow a spec?

RSS vs. Atom

Atom and RSS set out to solve what is effectively the same problem; provide a means for news syndication. Atom, supposedly, boasts an IETF spec making it better than RSS (whose spec wasnt all that official and had numerous shortcomings that Atom sought to fix) yet Ive seen the same problems and inconsistencies in both. Part of me thinks that these problems stem from the fact that both Atom and RSS are done using XML. At this point in time, XML seems like such an old and verbose markup language. Maybe it’s time to move on to something like JSON which is more lightweight and can be parsed easily by basically every programming language, not to mention it’s easy to read.

Lack of attention to detail

I think the main problem however is that people simply don’t know what they are doing half the time. For example, one feed that I managed to pick up through an automated crawler I built was a feed hosted by Rackspace. This feed seemed to be used for keeping track of the status of something. Turns out they were serving all 13,000 entries (probably since the beginning of time) ALL AT ONCE. I was stumped for a little while as to why my feed collector was choking and taking forever until I realized that it was waiting to download each of these then insert them all one at a time into the database. By the way, this is all within the RSS 2.0 spec.

A channel may contain any number of <item>s.

Now if only there were a way to paginate through past entries…

Pick a mime-type, any mime-type

This makes me think that people set up their RSS feeds, build there own, use Wordpress or whatever and never properly set up their web server to serve the correct type of content. Ive seen feeds served up as:

  • application/xml
  • text/xml
  • application/rss+xml
  • text/html
  • application/rdf+xml
  • application/atom+xml

Sure all of these will produce some kind of valid XML document, but Im of the belief that you should be sending the correct headers for your document. Sending an RSS feed? Your Content-Type should be application/rss+html. Sending an Atom feed? You should be using application/atom+xml. C’mon, is it really that difficult? (hint, no its not).

At least provide the important fields

In the world of news, one of the more important fields is the date on which the article, story, event, item, whatever was published. Some feeds neglect to provide this important piece of information (thats right, Im looking at you HackerNews).

Defining the bare minimum

The way we consume news and media has changed a lot in the last few years. No longer are we looking at just a page of words. As we can see with apps like Flipboard, content is king. People like to see pictures and images. RSS doesn’t have a field to provide such images and the spec for its thumbnail image is too small. Atom has a generic "media:thumbnail" element, but some people (cough Mashable cough) like to be difficult and define their own namespace for their thumbnails (e.g. "mash:thumbnail"). So lets get some things straight here:

On the top level, we need to describe the feed:

  • title
  • description
  • last updated
  • link
  • site logo

These are pretty standard. Its the feed/item/article definition where things get a bit messy. But heres what we need in a world like today:

  • title
  • publish date
  • authors name
  • tags/categories
  • content
  • description (should be shortened preview of content)
  • image
  • unique id
  • original link/permalink

One of the more important fields in that list would be the unique id. Currently, it is rather difficult to determine if an article is unique. You can't go on title alone as someone could easily have two articles in the same feed with the same name. So it ends up being a comparison of normalizing a bunch of fields like the permalink/article link, title, and the feed its come from in order to tell if its unique or not. So why not include something like a UUID? With a UUID, you could then determine uniqueness on a feed by feed basis which is more than acceptable.

Personally, in the end Id love to see a new protocol built with JSON that people actually adhere to. The internet is already series of APIs and web services using JSON as a payload medium, so why not extend that to RSS and other news type feeds? Why not make it more like an API where you can actually request a number of entries, or a date range for enries, or at the very least paginate through entries so that you arent sending 13,000 of them all at once?

Getting Started with Native NodeJS Modules

by on

NodeJS has quickly gained in popularity since inception in 2009 due to its wide adoption in the web app community, a lot due to the fact that if you already know javascript, very little has to be learned to begin developing with it. As evident by its modules page on github, you can pretty much find the library you are looking for (it’s sort of starting to remind me of PHP where if you could think of something, chances are there is a module for it).

One of the things that I think people forget is the fact that you can develop NodeJS modules not only in javascript, but in native C/C++. For those of you that forgot, NodeJS is possible because it uses Googles V8 javascript engine. With V8 you can build extensions with C/C++ that will be exposed to javascript. Recently, I decided to dive into the world of native modules because I needed a way to use the imagemagick image manipulation library directly from javascript. All of the libraries currently listed on the NodeJS modules page take a rather round-about approach by forking a new process and calling out to the commandline binaries provided by the imagemagick library. This is VERY VERY VERY SLOW and since image manipulation can be very intense, being able to use the library dirctly will make things MUCH faster.

Part one: babby’s first native module

This is the first part in what will hopefully be a multipart tutorial as I write a native module for imagemagick. Today, we will take a look at making the most basic of of native modules and how to use it with Node.

To start off, lets create a file called testModule.cpp. This is where everything (for now) will happen. Heres what we need to start:

Note, this is assuming you have NodeJS installed already and in your path (if you dont, go do that!). We need to import both the Node header and the V8 header.

To build our module, we will be using the node-waf tool that comes bundled with NodeJS. In the same directory as testModule.cpp create a file called wscript and put the following stuff in it:

The wscript file sets up needed environment variables and libraries that need to be linked at compile time. Think of it as some kind of makefile. The t.target property needs to match up to the name of the export property in your module (I’ll point this out when we get there).

Now, to build your module simply run the following:

Alright, now that we have those basics out of the way, lets make a module that when called, returns the string “Hello World”.

So to quote the V8 handbook:

A handle is a pointer to an object. All V8 objects are accessed using handles, they are necessary because of the way the V8 garbage collector works.

A scope can be thought of as a container for any number of handles. When you’ve finished with your handles, instead of deleting each one individually you can simply delete their scope.

So if we were to think of this in javascript, we’d basically have something like:

So now that we have a function that can do some kind of work, how do we expose it to Node? Lets take a look:

The function TestModule takes an object handle and basically shoves our function in it. This is how exports work in C++. In javascript we’d have:

Now, a note on the NODE_MODULE(…) line. Before when I said t.target needed to match, this is where it needs to. The first argument of NODE_MODULE needs to be the same as your target value.

Once you have all of that, build your new node module. To try it out, run node and import your module.

That’s it! You now have your first native module. In my next post, we’ll dig deeper into building something a bit more substantial. One of the main draws of NodeJS is its asynchronous nature, so next time we’ll take a look at how to go about building a module with asynchronous function using libuv that its at the very core of NodeJS.

Drag and Drop File Uploads with Javascript

by on

The other day I was working on building a file upload interface in Javascript where a user could drag and drop files to upload to a server. I already knew that this was possible using the drag and drop api. Users could drag files from their desktop or other folder to a defined dropzone on the page and it would pull a FileList from the drop event. I use Google Chrome as my default browser, so heres what I started with:

First thing to keep in mind is that by default, browsers will try and open a file when you drag it into the window. To prevent that, we need to predent the default action as well as prevent the event from propagating up the DOM tree. After that, we are able to access the FileList. Then I fired up Firefox just to check to make sure it worked across different browsers, knowing sometimes that just because it works in Chrome doesn’t mean it will work in other browsers. Upon trying it in Firefox, it loaded up the file in the browser. Turns out, non-Chrome browsers require a bit of an extra step; you need to listen for the ‘dragover’ event and prevent that from propagating and taking effect. Here’s the revised code:

Now our drop event will work in Chrome, Firefox, and Safari. I havent had a chance yet to try it in Internet Explorer, but according to caniuse.com it looks like IE10 with Windows 8 will support drag and drop events. For those curious, here’s a jsfiddle of the above example.

A Lack of Usability in the Photo Sharing World

by on

Recently Ive noticed that photo sharing sites (eg. flickr, Smugmug, etc) have rather poor user interfaces and user experience. UIs seem to have become overly complex, pushing rudimentary features out of the way to places that are not immediately accessible or take a bit of work to find. When Im using a photo sharing site, there are a couple of really basic features that I believe should be very prominent upon logging in.

The “upload” button should be VERY easy to find.

When I first log into flickr, it is not immediately obvious how I can upload photos or videos.

When I first see this page, I immediately look at the top row where the various navigation items and menus are. Theres nothing at the root level that links to an upload page. However, under the “You” dropdown is an upload link. For a site that relies on users uploading photos, Id think that they would make it a single click away and very obvious. Instead, they nest one of the most import actions in a dropdown menu. As you make it down the page, you’ll realize that there IS an upload link on the main page, however its the same style as their section headers and doesn’t immediately stand out. This really should be styled more like a call to action so that it stands out better. Smugmug, in comparison, has an upload link on their top row of navigation, but they require you to first create a gallery if you dont have one, or pick a gallery to upload to if you have already created one. This is definitely a step in the right direction. I’ll get to my issues with their gallery structure in a little bit.

Galleries, sets, and categories oh my!

For those of you that are old enough, think back ten years or so. Chances are your mother, grandmother, or some other family member has a closet full of photo albums of photos from your childhood, or even their own childhood. Photo albums are the most basic and rudimentary method for organizing photos. Have a bunch of photos that happend all at once? Maybe a vacation, birthday party, or other special event? Put them all in one place, like a photo album, so that you can find them later! Even Facebook has this down. You can upload photos and then organize them into albums. Its simple and easy. Flickr and Smugmug on the other hand, make it a bit more difficult. Flickr has the notion of “sets”. Sets are essentially the same idea as an album; you name the set and select photos to put in the set. Beyond that its really easy to get lost. Organizing photos into sets is relatively simple; select “your sets” from the “organize” dropdown. Though, upon doing this you are brought to an entirely different user interface than you were just on. The entire root site navigation has disappeared and you are shown a pseudo full-screen page. Adding photos is pretty easy - just drag and drop from your “photostream” on the bottom. The flow of this organize process could be better handled as it seems like they are trying to cram too many features into one page and have thus made it a bit complex to navigate. As a note, my mother who is not the greatest with computers, has never been able to figure out how to use this particular interface on flickr.

Well how about smugmug; How does it stack up against flickr? The first thing that drives me nuts is that you HAVE to create a gallery (equivelant of an album) in order to upload any photos at all. Flickr allows its users to simply upload photos then organize later. Smugmug also doesnt allow you to include one photo in multiple galleries. There is absolutely no way to organize your photos a la flickr. Everything MUST go into a gallery, and one gallery only, and only into that gallery by uploading something to it. Smugmug also has categories. An extra level of hierarchy and organization seems like a good idea, but their flow is very limiting, much like their flow for adding photos to albums.

Help help, my photos are being help hostage!

One large point of contention on the internet right now is over reclaiming your own data from a website that you are using. Google+ does a great job with addressing this by allowing users to “liberate” their data by download a zip archive of it. Recently, my flickr pro subscription ran out. Currently I have a few hundred photos hosted through them. However, when your pro subscription runs out, flickr essentially holds your photos hostage allowing you to only have access to the 200 most recent photos. What if I didnt have a backup of my photos? (Yes, stupid I know). The only way to get them back would be to pay $25 to upgrade just to download them all. Flickr also doesnt provide a batch download feature to reclaim all of your uploaded photos. Smugmug is a little bit different. They dont follow a freemium model like flickr. They are purely a subscription service that you pay yearly. So once your subscription runs out, you have to renew it to get back to your photos. Then again, Smugmug targets professional photographers that are using them as a showcase of their work as well as for white label printing. Smugmug, as far as I can tell, also doesnt have a way to batch download photos that you have uploaded.

How I would do it differently

Both flickr and Smugmug have features that are good and features that are not so good. And if they were combined and implemented a little bit differently than you would have one hell of a site. So I am going to attempt to do just that; take features and ideas from both and improve upon them. The internet is a much different place now than it was when flickr and smugmug first launched in the early 2000’s. There is now a larger focus on building social communities and applications that are incredibly easy to use. Here is how I would do it:

Freemium Model

This app is going to be a “pay for what you use” type of deal. There will be a pricing model with a free tier where the amount you pay is based on the amount of space that you are consuming. The free tier will offer a certain amount of space rather than a limit on the number of photos. If you want to compress your photos down to a few kilobytes and upload a few hunderd, go for it! However, if you want to upload photos at their full resolution that are a few megabytes a piece, then you may want to look into one of the paid tiers. This way, people that are into “casual” photography have a place to upload and share photos for free or cheap, and professionals have an affordable way to host their photos as well. Pricing will be focused around space consumed, and not necessarily features available to you. There might be a point where some features appear that are more geared toward professionals and might be offered to paying users only, but for the most part everyone will be on an equal playing field as far as features are concerned

Focus on the basics

As I explained above, doing the simple things uploading and organizing photos and albums has gotten rather difficult. In this application, viewing photos and albums, uploading photos, and organizing your photos will be a very primary focus. The interface is designed in a way that even my own mother will be able to use it without having to call me up to walk her through the process. I figure if she can do it without help, then most everyone else should be able to as well.

Building communities

Whats use would a hosting site be if you couldnt interact with people and talk about your love for photography? Users will have the option of enabling commenting on albums and photos. Flickr has some very large communities because of their commenting system, but there is also a high level of spam comments as well. Users will be able to monitor and moderate comment threads on their own photos and albums to hopefully prevent spam from coming up. Users will also be able to favorite photos and albums as well as follow other users. If two users follow each other, they will be classified as friends. On your dashboard, you will see activity on the things you have followed; when a new photo is added to an album, when comments are made, when a user creates a new album or uploads some photos. User activity and engagement will play a key role in this new application.

Liberate your data

Users will be able to download a zip file containing all of the photos that they have uploaded. ‘nuff said.

Building an Editor for MarkdownWiki

by on

The MarkdownWiki Editor

In my free time lately, I have been building a web application to refresh the wiki market. MarkdownWiki is a new cloud hosted platform that allows users to create and collaborate wiki pages. It preserves the original purpose of wikis - provide a place for users to present their knowledge, information, notes, documentation. The possibilites have become endless.

The reason I started building MarkdownWiki was to build a wiki platform that was up-to-date with today’s latest and greatest technologies. The first thing that I decied to start with was building an editor that makes editing and creating wikis easy for everyone (maybe even for my own mother!). In this post on the MarkdownWiki blog, I talk about the built in editor and how it will make creating wiki pages many times easier than it currently is.