An alternative Approach to Tagging

Avatar von Martin Brotzeller

The Term Tagging

The popular feature of ‚tagging‘ content is nothing new. The average
netizen should have encountered it by now. Tagging was made popular
by sites like del.icio.us and flickr, where users can attach free-form
strings, so-called ‚tags‘ to their bookmarks and images. The viewer can
then use these tags to navigate through one or more user’s contents and
locate related content.

Scientific Background

The most-cited work on tagging is this research paper from HP, which
starts categorizing tagging as ‚folksonomy‘ (folk taxonomy) in contrast
to the conventional term taxonomy. A taxonomy is usually a categorization
of content according to a hierarchical and exclusive tree of attributes,
while the folksonomy is neither. Also, usually a taxonomy is created by
an individual or a small group, whereas the folksonomy can be created
and expanded by any number of users.

Use of tagging

The most obvious use of tags attached to a set of items is to be able
to quickly search for a subset without having to care about spelling
or being limited because the items are only indexed by title. The most
common user interface for searching according to tags is the ‚tag cloud‘,
a cluster of tags sorted alphabetically. For ease of use the font size
in which each individual tag is shown corellates to the number of items
it is attached to.

It is quite convenient to search in a tag cloud with a couple of hundred
items. A drawback of the cloud, though is that one can only search for
one tag at a time. When the search hits are shown, a list of related
tags is shown that share at least one item with the chosen tag. The
user can then proceed to view the items attached to one of these tags,
but the first tag is subsequently forgotten.

To understand why tagging was invented it is neccessary to look at the
prior status quo of hierarchical ordering like it is done in directories
like dmoz.org, ISP portals or most probably in your browser’s bookmarks
feature. Items are sorted into
a kind of taxonomy. This leads to difficulties since the universe of
topics is multidimensional.
One might have a folder ‚Tutorials‘ for sites that contain tutorials,
and a folder ‚Software‘ for sites on software. Consider further, that
he might want to divide the bookmarks between ‚Windows‘ and ‚Linux‘. The
hierarchical model now makes it necessary to duplicate one set of the
tree’s nodes to attach them to both branches, i.e. you either have a
first-level dichotomy between Windows and Linux and two second level
ones between Tutorials and Software or the other way round.
Since there’s more aspects to expect, this is not a feasible solution,
not even if you flatten out the hierarchy by combining the two aspects
and create four folders.

The Alternative: Instant Hierarchy

A method that combines the flexibility of tagging with the
search-narrowing power of a deep hierarchy is to combine the tags to an
‚instant hierarchy‘: The user chooses from a pool of keywords. Like in
tag clouds, he gets to see some items then and a list of subsequient
keywords. He can then choose a second keyword and get the items that
are tagged with both chosen keywords, and so on. This instant tree is
at it’s deepest in the path of the item with the most tags.




These four images show an example of navigation through the instant
hierarchy. The choice of keywords in the selectboxes narrows downn, in
each step the items that match the exact set of keywords are shown. The
best results are achieved when the items are tagged diversely and
with a diverse number of keywords, because every step then show some
information. If all items have, for instance, three keywords, the first
two steps show nothing new.

Database-wise, this approach is a lot more complicated than the
single-tag navigation of del.icio.us or technorati. Depending on the
database available, n-fold joining or subquerying can be used. In both
cases we start with a set of three tables, ‚bookmarks‘, ‚keywords‘ and
‚relations‘ since we want to map an n:m relationship:

+--------------+  +--------------+ +--------------+
| bookmarks    |  | keywords     | | relations    |
+--------------+  +--------------+ +--------------+
| ID           |  | ID           | | ID           |
| url          |  | keyword      | | bookmark     |
| title        |  +--------------+ | keyword      |
+--------------+                   +--------------+

The ID columns are the primary keys (although the relations table might
not need an extra key), the keywords should be unique anyway as long as
no ownership is implemented.

N-fold Joins

The query to retrieve the first set of keywords is obvious:

SELECT DISTINCT k.id,k.keyword
FROM relations as r, bookmarks as b, keywords as k
WHERE r.keyword=k.id
AND r.bookmark=b.id

That might look complicated, but it easily sorts out the keywords that have nothing attached, and it fits in with the recursion that follows
The second step looks like this:

SELECT DISTINCT k.id,k.keyword
FROM relations as r, bookmarks as b, keywords as k, relations as r0
WHERE r.keyword=k.id
AND r.bookmark=b.id
AND r0.keyword=5

since the id of keyword „Programming“ is 5 (you can do this with another join of course, but the information has to retrieved to get the bookmarks beforehand)
The third step is where it gets complicated – we need to make sure the database looks for both keywords and does not return either

SELECT DISTINCT k.id,k.keyword
FROM relations as r, bookmarks as b, keywords as k, relations as r0, relations as r1
WHERE r.keyword=k.id
AND r.bookmark=b.id
AND r0.bookmark=r.bookmark
AND r.keyword!=5
AND r0.keyword=5
AND r1.bookmark=r.bookmark
AND r.keyword!=7
AND r1.keyword=7

From there, it’s simply recursion.
The queries for the list of bookmarks are similar, but a bit simpler. The fourth list of bookmarks is retrieved with this query:

SELECT DISTINCT b.ID,b.url,b.title
FROM bookmarks as b, relations as rc, relations as r1, relations as r2, relations as r3
WHERE rc.bookmark=b.id
AND r1.keyword=’5′
AND r1.bookmark=b.id
AND r2.keyword=’7′
AND r2.bookmark=b.id
AND r3.keyword=’12‘
AND r3.bookmark=b.id
GROUP BY b.id

Subqueries

Newer MySQL versions offer the use of subqueries to replace some joins. A query to get a list of keywords with a prior choice looks like this:

SELECT DISTINCT k.id,k.keyword,
(SELECT count(r0.keyword)
FROM relations r0
WHERE r0.bookmark=b.id
AND r0.keyword IN (5,7,12)) x
FROM bookmarks b,
relations r,
keywords k
WHERE k.id=r.keyword
AND b.id=r.bookmark
AND k.id not in (5,7,12)
HAVING x=3
ORDER BY k.id

We retrieve all keywords that have x other keywords attached, where the other keywords are chosen from our list.
In the same way we can subselect and count the attached keywords to retrieve an appropriate list of bookmarks here.

Implementation

I assembled a small Addon for our OpenSource software PHProjekt and
plan to release it on PHProjekt.com as soon as it’s presentable
(it’s a hobby project so it proceeds a little slower than I would wish
it to). If you are interested in a snapshot, drop me a mail (brotzeller at mayflower dot de).
Currently it lets the user add and delete keywords and bookmarks
and assign keywords to the bookmarks. The above screenshots show the
navigation, which can also return xml and can therefore be used as sidebar
in gecko browser. I also added a feature to get RSS feeds for keyword
choices, so firefox users can add dynamic bookmarks. Since it’s about
bookmarks, usage is observed via redirection, so the RSS feeds can also
show new or most used or last used bookmark lists.

Software-Modernisierung

Avatar von Martin Brotzeller

Kommentare

13 Antworten zu „An alternative Approach to Tagging“

  1. Good article. It points out the most important problem with tags. However, del.icio.us tags can be combined by using the ‚+‘ sign. These ‚+‘-characters are shown in the list of related items.

  2. Avatar von Tom Ward
    Tom Ward

    This is pretty much off the top of my head, but I think you can avoid the subselect with a query like (where the count equals the number of keywords):

    SELECT linked_keywords.keyword, linked_keywords.id
    FROM keywords, bookmarks, relations, keywords linked_keywords, relations linked_relations
    WHERE keywords.id = relations.keyword_id
    AND bookmarks.id = relations.bookmark_id
    AND keywords.keyword IN (‚a‘, ‚d‘)
    AND linked_keywords.id = linked_relations.keyword_id
    AND linked_keywords.id != keywords.id
    AND linked_relations.bookmark_id = bookmarks.id
    GROUP BY linked_keywords.keyword, linked_keywords.id
    HAVING COUNT(DISTINCT keywords.id) = 2

  3. GameSpot.com uses this sort of hierarchy approach in their ‚tag browser.‘

  4. This idea of an instant hierarchy by combining Tags is something I started to develop 6 years ago. I called the product XTend and it employeed the concept of ‚Instant Hierarchy‘ to allow you to manage your files. I gave up on the project because at the time ‚tagging‘ was not a common term and no one understood what I was talking about. I have some macromedia flash demos of the XTend here: http://www.base4.net/Blog.aspx?ID=14
    I’d be interested to here what you think…

  5. I found a post about an alternative approach to tagging today and thought it would be interesting if someone made a wordpress plugin that would use this idea. It seems like it might be a nice alternative to the normal site map that most sites have when…

  6. ewww, the horror you have to go through with mySQL. Or perhaps you don’t know the JOIN syntax in SQL? I think mySQL supports it in some basic forms.

    1. Avatar von Martin Brotzeller
      Martin Brotzeller

      Which „it“ are you referring to? Those commata in the line starting with „FROM“ are inner joins.

  7. I may not be understanding you correctly but this is already being done at delicious. Select any top level tag – the next page then gives you all the tags that are related and the ability to combine them (click on the tiny + symbol)and so on.

    It’s also a form of multifaceted navigation that’s popular with many shopping sites.

  8. Avatar von Martin Brotzeller
    Martin Brotzeller

    Ok, i see del.icio.us can do it. But apparently it only works on your own bookmarks. When i’m not logged on i don’t have that + sign.

  9. del.icio.us direc.tor also provides the dynamic cascading listbox-like interface you describe.

    http://johnvey.com/features/deliciousdirector/

  10. Just to add another data point to the discussion, I believe that the majority of large scale tagging implementations are done with full-text indexing.

    The „+“ in the del.icio.us tagging is indicative of MySQL Boolean Full Text searches.

  11. How is this an „alternative approach to tagging“? The „alternative approach“ you are talking about IS tagging. This article simply talks about one of the many ways of PRESENTING tagged data to the user.

    1. If you feel the title is misleading, feel free to come up with an alternative.
      If i like it i might even consider changing the title :-)

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert


Für das Handling unseres Newsletters nutzen wir den Dienst HubSpot. Mehr Informationen, insbesondere auch zu Deinem Widerrufsrecht, kannst Du jederzeit unserer Datenschutzerklärung entnehmen.