Academia.eduAcademia.edu
HT06, Tagging Paper, Taxonomy, Flickr, Academic Article, To Read Cameron Marlow1, Mor Naaman1, danah boyd1,2, Marc Davis1,2 2 1 Yahoo! Research Berkeley 1950 University Avenue, Suite 200 Berkeley, CA 94704-1024 UC Berkeley School of Information 102 South Hall Berkeley, CA 94720-4600 {cameronm, mor, danah, marcd}@yahoo-inc.com ABSTRACT 1. INTRODUCTION In recent years, tagging systems have become increasingly popular. These systems enable users to add keywords (i.e., “tags”) to Internet resources (e.g., web pages, images, videos) without relying on a controlled vocabulary. Tagging systems have the potential to improve search, spam detection, reputation systems, and personal organization while introducing new modalities of social communication and opportunities for data mining. This potential is largely due to the social structure that underlies many of the current systems. Web-based tagging systems such as Del.icio.us, Technorati and Flickr allow participants to annotate a particular resource, such as a web page, a blog post, an image, a physical location, or just about any imaginable object with a freely chosen set of keywords (“tags”). In this paper, we aim to articulate a framework for studies of such systems. One approach to tagging has emerged in “social bookmarking” tools where the act of tagging a resource is similar to categorizing personal bookmarks. In this model, tags allow users to store and collect resources and retrieve them using the tags applied. Similar keyword-based systems have existed in web browsers, photo repository applications, and other collection management systems for many years; however, these tools have recently increased in popularity as elements of social interaction have been introduced, connecting individual bookmarking activities to a rich network of shared tags, resources, and users. Despite the rapid expansion of applications that support tagging of resources, tagging systems are still not well studied or understood. In this paper, we provide a short description of the academic related work to date. We offer a model of tagging systems, specifically in the context of web-based systems, to help us illustrate the possible benefits of these tools. Since many such systems already exist, we provide a taxonomy of tagging systems to help inform their analysis and design, and thus enable researchers to frame and compare evidence for the sustainability of such systems. We also provide a simple taxonomy of incentives and contribution models to inform potential evaluative frameworks. While this work does not present comprehensive empirical results, we present a preliminary study of the photosharing and tagging system Flickr to demonstrate our model and explore some of the issues in one sample system. This analysis helps us outline and motivate possible future directions of research in tagging systems. Social tagging systems, as we refer to them, allow users to share their tags for particular resources. In addition, each tag serves as a link to additional resources tagged the same way by others. Because of their lack of predefined taxonomic structure, social tagging systems rely on shared and emergent social structures and behaviors, as well as related conceptual and linguistic structures of the user community. Based on this observation, the popular tags in social tagging systems have recently been termed folksonomy [22], a folk taxonomy of important and emerging concepts within the user group. Categories and Subject Descriptors Social tagging systems may afford multiple added benefits. For instance, a shared pool of tagged resources enhances the metadata for all users, potentially distributing the workload for metadata creation amongst many contributors. These systems may offer a way to overcome the Vocabulary Problem – first articulated by George Furnas et al in [8] – where different users use different terms to describe the same things (or actions). This disagreement in vocabulary can lead to missed information or inefficient user interactions. The taxonomy of tagging systems articulated in this paper, and the results of our preliminary experiments on the relationship between tag overlap and social connection, both point to the possibility that thoughtful sociotechnical design of tagging systems may uncover ways to overcome the Vocabulary Problem without requiring either the rigidity and steep learning curve of tightly controlled vocabularies, or the computational complexity and relatively low success of purely automatic approaches to term disambiguation. H.1.1 [Information Systems]: Models and Principles – Systems and Information Theory. General Terms: Algorithms, Design, Human Factors. Keywords Tagging systems, taxonomy, folksonomy, tagsonomy, Flickr, categorization, classification, social networks, social software, models, incentives, research. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HT’06, August 22–25, 2006, Odense, Denmark. Copyright 2006 ACM 1-59593-417-0/06/0008...$5.00. Figure 1 shows a conceptual model for social tagging systems. In this model, users assign tags to a specific resource; tags are 31 metrics; identifying trends and emerging topics globally and within communities; and locating experts and opinion leaders in specific domains. represented as typed edges connecting users and resources. Resources may be also be connected to each other (e.g., as links between web pages) and users may be associated by a social network, or sets of affiliations (e.g., users that work for the same company). Variations in the model described in Figure 1 are possible. For example, links between resources could be absent, and likewise for users. Nevertheless, in these circumstances, we can still observe connections between users, tags, and resources. These connections define an implicit relationship between resources through the users that tag them; similarly, users are connected by the resources they tag. In order to better frame the space of social tagging systems, we describe two organizational taxonomies for social tagging systems, developed by analyzing and comparing the design and features of many existing social tagging systems. The taxonomies describe: • System design and attributes. We claim that the place of a tagging system in this taxonomy will greatly affect the nature and distribution of tags, and therefore the attributes of the information collected by the system. • User incentives. User behaviors are largely dictated by the forms of contribution allowed and the personal and social motivations for adding input to the system. The place of a tagging system in this taxonomy will affect its overall characteristics and benefits. Figure 1. A model of tagging systems. The three individual elements of the model depicted in Figure 1 have been studied independently in the past, usually in the context of web-based systems: To demonstrate how these classifications affect the properties of tags and users, we will present a study of Flickr, one of the most popular tagging systems on the web today. We compare our findings from the Flickr study to the work of Golder and Huberman [9] on Del.icio.us. Flickr and Del.icio.us are complementary examples of tagging-systems in our taxonomies; we present initial evidence that the dynamics of these systems are quite different. • Resources. The relationship between resources and links is a well-researched area. Most prominently, PageRank [18] has made analysis of link structure on the web a household name. • Users. Analysis of social ties and social networks is an established subfield of sociology [25] and has received attention from physicists, computer scientists, economists, and numerous other areas of study. To this end, the next section provides details on related work, mostly concentrating on academic research work in related areas. Section 3 briefly outlines a number of current tagging systems used as illustration in different parts of this paper. Section 4 describes our taxonomies of tagging system design choices and incentives. In Section 5 we present the results of our study of tagging in Flickr, a photo-sharing tagging system. We present a summary and outline future directions of research in Section 6. • Tags. Recently, the aggregation and semantic aspects of tags have been discussed and debated at length [16]. This discussion has mainly focused on the quality of information produced by tagging systems and the possible tradeoffs between folksonomies and crafted ontologies [17, 20]. Furthermore, the challenges of shared vocabularies for description have been studied in the information science and library science communities for many years [8]. 2. RELATED WORK Despite a considerable amount of attention in academic circles, as represented in various blog posts [17,20], little academic research work has been invested in tagging systems to date. Despite these individual contributions (which we will revisit in more detail in Section 2), to fully understand tagging systems we believe a holistic approach is necessary. Walker [24] describes tagging as “feral hypertext”, a structure out of control, where the same tag is assigned to different resources with different semantic senses, and thus associates otherwise unrelated resources. However, by considering the entire model, computer systems could make inferences that “domesticate” (to use Walker’s terms) these “feral” tags. For example, tag semantics and synonyms could potentially be inferred by analyzing the structure of the social network, and identifying certain portions of the network that use certain tags for the same resource, or related resources, interchangeably. These tags may be synonymous. Perhaps the most significant formal study of tagging systems appears in the work of Golder and Huberman [9]. The authors study the information dynamics in “collaborative tagging systems”–specifically, the Del.ic.ious system. The authors discuss the information dynamics in such a system, including how tags by individual users are used over time, and how tags for an individual resource (in the case of Del.ic.ious, web resources) change—or more specifically, stabilize—over time. We refer to their findings again in Section 5. Golder and Huberman also discuss the semantic difficulties of tagging systems. As they point out, polysemy (when a single word has multiple related meanings) and synonymy (when different words have the same meaning) in the tag database both hinder the precision and recall of tagging systems. In addition, the different A unified user-tag-resource approach might be useful for many key web technologies, including: search and information retrieval; information organization, discovery and communication; spam filtering; reducing effects of link spam, and improving on trust 32 as an influencer or opinion leader. Structural equivalence describes the similarity between two users based on the overlap in their personal networks [4], and can be used to find analogous users within the system. Partitioning a network into smaller structures can be helpful to both users and researchers; clustering addresses this problem by finding cohesive subgroups [25], while blockmodeling finds groups of users with similar roles within the network [26]. expertise and purposes of tagging participants may result in tags that use various levels of abstraction to describe a resource: a photo can be tagged at the “basic level” of abstraction [14] as “cat” or at a superordinate level as “animal” or at various subordinate levels below the basic level as “Persian cat” or “Felis silvestris catus longhair Persian.” In [11], the authors of Connotea provide a hands-on description of tagging systems. The study includes a snapshot of the tagging systems available, as of early 2005, and a breakdown of the key technologies behind these systems into a two-dimensional taxonomy. The first two facets of the first dimension in their taxonomy represent the identity of taggers: “tag user” and “content creator”. Both facets can be classified in a second dimension as either “self” or “others”. Other categorizations that the authors offer divides the space of tagging systems according to the “audience” (scholarly or general) and the “type of object store in the system” (URLs versus actual content). The same authors describe their own system—a social tagging system for academic articles—in a second article [12]; the technological and interaction techniques are described in depth, and an initial study of tag distribution is offered. The taxonomy we provide in Section 4 will expand upon the dimensions noted in their classifications. Like tagging systems, collaborative filtering (CF) is concerned with the relationships between people and resources, and the extent to which these connections can be leveraged to help users find new resources and people they would otherwise miss. Some of these systems have leveraged user-contributed metadata in the matching process, but this extra information is typically used as a filter after a match has been made [15]. To this extent, social tagging systems could be seen as complementary to CF, as tags are the primary means of finding similar resources; people have stipulated that these two systems would marry well, feeding each other with recommended content [21]. CF techniques have been studied extensively [3], and many are employed in popular tools, such as Amazon.com. The research related to tagging systems separately covers each part of our model—people, resources, tags and the pairwise connections between them. To accurately describe the properties of systems including and connecting all of these components, we have integrated and extended background research for each of these components, spanning the fields of computer science, information science, and social networks. Each of these components is necessary to understand the relationship between objects, the words that describe them, and the motivations people have to do so. In the following section we will introduce a number of example tagging systems, followed by a descriptive taxonomy that shows how all of these pieces fit together in practice. Inherent in our model of tagging systems are connections or links between resources. As mentioned above, research on link-based systems in the context of the web is hardly new [1]. Obviously, the PageRank algorithm [18] had a significant impact on the field and on the way we use the web today, by supplying a mechanism to assess the importance of web pages. Lately, link analysis has been suggested to help fight web spam [10] by identifying trusted resources and propagating trust to resources that are linked from trusted resources. In tagging systems, similar concepts can utilize the information and trust in the social network and the links from users to resources (as well as between resources as before) to reason about the importance and trust of users and resources. 3. EXAMPLE TAGGING SYSTEMS Perhaps more closely related to our tagging system model, Kleinberg [13] suggested an algorithm to identify web pages that are “hubs” and nodes that are “authorities” in a linked graph of resources, given a query term. In his model, Kleinberg views the hubs and authorities as a bi-partite graph, similar to the way we depict users and resources in our model in Figure 1. Taking the same hubs and authorities approach an inch closer to our model, Chakrabarti et al [5] extended Kleinberg’s work to include anchor text. Anchor text, the text that appears around a link to a certain resource, can be considered to have a similar role to tags in our model. Traditionally, the anchor text is associated with the resource the link is pointing to. The exact way the text is picked and associated with the resource varies between systems. Tags have the potential to increase comprehensiveness and accuracy of anchor-text based methods by treating the user and the resource separately in relevance metrics. In this paper, we reference numerous tagging systems to show variations in architecture and incentives. We do not analyze most of these potentially ephemeral sites in depth, although we provide references to them in order to ground the reader with examples. For the sake of legibility, here is a brief description of sites we reference. There are many other tagging systems in existence, but we chose twelve that are representative of the diversity of those that are currently well used. • Del.icio.us (http://del.icio.us): a “social bookmarking site,” allowing users to save and tag web pages and resources. • Yahoo! MyWeb2.0 (http://myweb.yahoo.com): similar Del.icio.us, but including a social network of contacts. to • CiteULike (http://www.citeulike.org/): a site allowing users to tag citations and references, e.g. academic papers or books. • Flickr (http://www.flickr.com): a photo sharing system allowing users to store and tag their personal photos, as well as maintain a network of contacts and tag others photos. Also inherent in our model of tagging systems are relationships between users, a form typically described as a social network. While the social network literature related to tagging systems is too broad for the focus of this paper, we will summarize some of the important contributions. Social networks can be used both as a methodology for studying the social nature of tagging in these systems, as well as a tool for systems to expose relationships to users. A number of measures are applicable to each of these tasks, both from systemic and user-based perspectives. Centrality is a measure of how integral an individual is to a network [7], and can expose users whose social ties or tagging practices establish them • YouTube (http://www.youtube.com): a video sharing system allowing users to upload video content and describe it with tags. • ESP Game (http://www.espgame.org/) [23]: an internet game of tagging where users are randomly paired with each other, and try to guess tags the other would use when presented with a random photo. 33 • Last.fm (http://www.last.fm): a music information database allowing members to tag artists, albums, and songs • Yahoo! Podcasts (http://podcasts.yahoo.com/): a site that indexes podcasts (regularly updated audio content), and allows users to tag them. • • Odeo (http://www.odeo.com/): another podcast information system supporting tagging and search. • Technorati (http://www.technorati.com/): a weblog aggregator and search tool allowing blog authors to tag their posts. • LiveJournal (http://www.livejournal.com/): a weblog and community website allowing users to tag their personal profile, along with individual blog posts • Upcoming (http://upcoming.org/): a collaborative events database where users can enter future events (e.g., concerts, exhibits, plays, etc.) and tag them. 4. TAXONOMY OF TAGGING SYSTEMS While we sometimes refer to social tagging systems as a coherent set of applications, it is clear that differences between tagging systems have a significant amount of influence on resultant tags and information dynamics. It is also clear that the personal and social incentives that prompt individuals to participate affect the system itself in various ways. We have developed two tagging taxonomies to analyze how 1) characteristics of system design and 2) user incentives and motivations may influence the resultant tags in a tagging system. • Different designs and user incentives can have a major influence on the usefulness of information for various purposes and applications, and in a reciprocal fashion, on how users appropriate and utilize these systems. The design of the system may solicit tagging useful for discovery, retrieval, remembrance, social interaction, or possibly, all of the above. 4.1 System Design and Attributes We describe some key dimensions of tagging systems’ design that may have immediate and considerable effect on the content and usefulness of tags generated by the system. For each dimension in our taxonomy, we note the ways in which the location of a system on this dimension may impact the behavior of the system. Some of these dimensions listed below interact; a decision along one of them may determine, or at least be correlated with, the system’s placement in another. • Tagging Rights. Possibly the most important characterization of a tagging system design is the system’s restriction on group tagging. A tagging system can be restricted to selftagging, where users only tag the resources they created (e.g., Technorati) or allow free-for-all tagging, where any user can tag any resource (e.g., Yahoo! Podcasts). This is not the apparent dichotomy that it seems, as systems can allow varying levels of compromise. For instance, systems can choose the resources users are to tag (such as images in the ESP Game) or specify different levels of permissions to tag (as with the friends, family, and contact distinctions in Flickr). Likewise, systems can determine who may remove a tag, whether no one (e.g., Yahoo! Podcasts), anyone (e.g., Odeo), the tag creator (e.g., Last.fm) or the resource owner (e.g., Flickr). The implication for the nature of the tags that emerge is that free-for-all systems are obviously broad, both in the magnitude of the group of tags assigned to a resource, • 34 and in the nature of the tags assigned. For instance, tags that are assigned to a photo may be radically divergent depending on whether the tagging is performed by the photographers, their friends, or strangers looking at their photos. Tagging Support. The mechanism of tag entry can have great impact on tagging system behavior. Observed systems fall into three distinct categories: blind tagging, where a tagging user cannot view tags assigned to the same resource by other users while tagging (e.g., Del.icio.us); viewable tagging, where the user can see the tags already associated with a resource (e.g., Yahoo! Podcasts); and suggestive tagging, where the system suggests possible tags to the user (e.g., Yahoo! MyWeb2.0). The suggested tags may be based on existing tags by the same user, tags assigned to the same resource by other users. Suggested tags can also be generated from or other sources of related tags such as automatically gathered contextual metadata, or machine-suggested tag synonyms. The implication of suggested tagging may be a quicker convergence to a folksonomy (see [9]). In other words, a suggestive system may help consolidate the tag usage for a resource, or in the system, much faster than a blind tagging system would. A convergent folksonomy is more likely to be generated when tagging is not blind. But it is not clear that consolidation is necessarily a good thing; arguably, a suggestive model may be applied carefully so that the agreement is not too widespread. As for viewable tagging, implications may be overweighting certain tags that were associated with the resource first, even if they would not have arisen otherwise. Aggregation. Another related feature of group dynamics comes from the aggregation of tags around a given resource. The system may allow for a multiplicity of tags for the same resource which may result in duplicate tags from different users; we term this approach the bag-model for tag entry (e.g., Del.icio.us). Alternatively, many systems ask the group to collectively tag an individual resource, thus denying any repetition; this interface we call a set-model approach for tag input (e.g., YouTube, Flickr). In the case that a bag-model is being used, the system is afforded the ability to use aggregate statistics for a given resource to present users with the collective opinions of the taggers; for instance, the tags around a popular link on Del.icio.us can be shown to the user to help characterize the breadth of opinions of the taggers. Furthermore, these data can be used to more accurately find relationships between users, tags, and resources given the added information of tag frequencies. Type of object. The type of resource being tagged is an important consideration. Sample objects types that are prominent in today’s systems include, but are far from being restricted to, web pages (e.g., Del.icio.us, Yahoo! MyWeb2.0), bibliographic material (e.g., CiteULike), blog posts (e.g., Technorati, LiveJournal), images (e.g., Flickr, ESP Game), users (e.g., LiveJournal), video (YouTube) and audio objects such as songs (e.g., Last.fm) or podcasts (e.g., Yahoo! Podcasts, Odeo). In reality, any object that can be virtually represented can be tagged or used in a tagging system. For example, systems exist that let users tag physical locations or events (e.g., Upcoming). The implications for the nature of the resultant tags are numerous; a trivial example is that we suspect tags given to textual resources may differ from tags for resources/objects with no such • • • textual representation, like images or audio, although this has not yet been empirically tested. the design choices on the resultant tags and the type of benefits that can be derived from the system. Source of material. Resources to be tagged can be supplied by the participants (e.g., YouTube, Flickr, Technorati, Upcoming), by the system (e.g., ESP Game, Last.fm, Yahoo! Podcasts), or, alternatively, a system can be open for tagging of any web resource (e.g., Del.icio.us, Yahoo! MyWeb2.0). Some systems restrict the source through architecture (e.g., Flickr), while others restrict the source solely through social norms (e.g., CiteULike). 4.2 User Incentives Incentives and motivations for users also play a significant role in affecting the tags that emerge from social tagging systems. Users are motivated both by personal needs and sociable interests. The motivations of some users stem from a prescribed purpose, while other users consciously repurpose available systems to meet their own needs or desires, and still others seek to contribute to a collective process. A large part of the motivations and influences of tagging system users is determined by the system design and the method by which they are exposed to inherent tagging practices. While tagging has the potential to be valuable for numerous applications, users can be unaware of or uninterested in the broader design motivations; they might instead be persuaded by the norms of their friends and how they think that a particular system fits into their use. Tagging can be a public and sociable activity, but not all tags emerge with an intended audience. Many users begin with the conception that they are tagging for themselves; some begin to appreciate the sociable aspects over time, while others have no interest in that component. Since user incentives are influenced by the design of a given system, the motivations underlying tagging vary both by people and by systems. Evaluating these practices requires an understanding of why people contribute and the resulting effects on output and performance of the tagging system. In this section we will articulate the various incentives that can be outwardly observed in current social tagging systems and show how they can influence the use and utility of tags. The motivations to tag can be categorized into two high-level practices: organizational and social. The first arises from the use of tagging as an alternative to structured filing; users motivated by this task may attempt to develop a personal standard and use common tags created by others. The latter expresses the communicative nature of tagging, wherein users attempt to express themselves, their opinions, and specific qualities of the resources through the tags they choose. Both of these practices differ based on intended audience and future expectation of use. The following list of incentives expresses the range of potential motivations that influence tagging behavior. They are not intended to be mutually exclusive; instead we expect that most users are motivated by a number of them simultaneously. • Future retrieval: to mark items for personal retrieval of either the individual resource or the resultant collection of clustered resources (examples: tagging a group of papers on Del.icio.us in preparation for writing a book, tagging songs on Last.FM to create an adhoc playlist, tagging Flickr photos `home’ to be able to find all photos taken at home later). These tags may also be used to incite an activity or act as reminders to oneself or others (e.g., the “to read” tag). These descriptive tags are exceptionally helpful in providing metadata about objects that have no other tags associated. • Contribution and sharing: to add to conceptual clusters for the value of either known or unknown audiences. (Examples: tag vacation websites for a partner, contribute concert photos and identifying tags to Flickr for anyone who attended the show). • Attract Attention: to get people to look at one’s own resource because they are common tags. When “tag clouds” or other such lists that reflect popularity of tags are visible in the Resource connectivity. Resources in the system can be linked to each other independent of the user tags. Connectivity can be roughly categorized as linked, grouped, or none. For example, web pages are connected by directed links; Flickr photos can be assigned to groups; and events in Upcoming have connections based on the time, city and venue associated with the event. Implications for resultant tags and usefulness may include convergence on similar tags for connected resources, especially in suggested and viewable tagging support scenarios. Social connectivity. Some systems allow users within the system to be linked together. Like resource connectivity, the social connectivity could be defined as linked, grouped, or none. Many other dimensions are present in social networks, for example, whether links are typed (like in Flickr’s contacts/friends model) and whether links are directed, where a connection between users is not necessarily symmetric (in Flickr, for example, none of the link types is symmetric). Implications of social connectivity include, possibly, the adoption of localized folksonomies based on social structure in the system. Table 1. Dimensions in the tagging system design taxonomy and possible implications Dimension Main categories Summary of Potential implications Tagging Rights Self-tagging, permissionbased, Free-forall Nature and type of resultant tags; role of tags in system Tagging Support Blind, suggested, viewable Convergence on folksonomy or overweighting of tags Aggregatio n model Bag, set Availability of aggregate statistics Object type Textual, nontextual Nature and type of resultant tags Source of material Usercontributed, system, global Different incentives, nature and type of resultant tags Resource connectivity Links, groups, none Convergence on similar tags for linked resources Social connectivity Links, groups, none Convergence on localized folksonomy The design options taxonomy for tagging systems is summarized in Table 1, including a brief summary of the potential impact of 35 environment, where tags act as a primary navigational tool for finding similar resources and people. As previously noted, the most extensive analysis of a tagging system has been completed on data collected from the social bookmarking site Del.icio.us [9]. We have chosen Flickr to provide an alternate interpretation to the conclusions derived from this study. In nearly every category within our system taxonomy, Flickr occupies an alternative space from Del.icio.us: it contains user-contributed resources as opposed to global; tagging rights are restricted to self-tagging (and at best permission-based, although in practice self-tagging in most prevalent) instead of a free-for-all; tags are aggregated in sets instead of bags; and finally, the interface mostly affords for blind-tagging instead of suggested-tagging. These design decisions shape the incentive structures that drive people to tag resources. Since Del.icio.us is largely task-focused, namely storing bookmarks for future retrieval, organizational motivations are most dominant. While the social element of tagging is evident from the leveraging of the community contribution, a lack of communication systems (e.g. messaging or explicit social networks) deemphasizes non-organizational social incentives. Flickr users, on the other hand, are also likely to tag for their own retrieval, but coupled with an abundance of communication mechanisms, the system design encourages gaming and exploration of tag use. Users are primarily motivated by social incentives, including the opportunities to share and play. In the following analysis we present a preliminary analysis of tag usage within Flickr. We have had the opportunity to work directly with a subset of the database used by Flickr, specifically information about photos, tags, and the explicit social relationships between users (i.e., the “contact” network). Because our focus is on the usage of tags, we have selected only those users who have utilized this feature (i.e., used at least one tag to describe a photo) and only those photos that have had at least one tag applied. Of the millions of Flickr users, we have randomly selected a set of 25,000 for our analysis of individual behaviors; for the more complicated case of network analysis, we have chosen a further subset of 2,500. This study is only a preliminary look at the dynamics of the Flickr system and is meant to expose interesting trends and topics in the Flickr data. These topics illustrate various aspects of tagging systems and their incentive structure, but we do not attempt to prove or assert any general conclusions about all tagging systems. system, users may be incentivized to contribute tags that might affect that global view (and even to create spam tags.). • Play and Competition: to produce tags based on an internal or external set of rules. In some cases, the system devises the rules such as the ESP Game’s incentive to tag what others might also tag. In others, groups develop their own rules to engage in the system such as when groups seek out all items with a particular feature and tag their existence. Some users take advantage of what is available and try to alter the system in the way they see fit. Knowing that tags appeared in a tag cloud based on the frequency of a given tag for a podcast, Odeo users attempted to construct sentences by adding and removing tags to change the order of the tags in the interface. • Self Presentation: to write a user’s own identity into the system as a way of leaving their mark on a particular resource. (for example, the “seen live” tag in Last.FM marks an individual’s identity or personal relation to the resource.) • Opinion Expression: to convey value judgments that they wish to share with others (for example, the “elitist” tag in Yahoo!’s Podcast system is utilized by some users to convey an opinion.) This range of motivations in turn affects the types of tags that are produced for a given resource. Golder and Huberman have outlined 7 individual types of tags observed in their study of Del.icio.us [9]. The first five types they mention roughly identify properties of the objects, such as the source, attributes, category membership or qualitative properties; these tags could arise from organizational motivations, social ones, or both depending on the perceived audience. The sixth tag type, self-reference (e.g., mystuff or mywork), reflects a probable intent to communicate this ownership to an outside audience, or alternatively to be used for personal organization. The final type, task-organization (e.g. toread or jobsearch) suggests an intent for personal organization. The architecture of a social tagging system reflected by the taxonomy provided in Section 4.1 does not explicitly affect the type of tag that users produce; instead, the design may influence the incentives that drive individuals to use the system. The types of tags observed can be seen as a resulting artifact of the different forms of motivation expressed through the resulting interaction. 5. Case Study: Flickr Due to their popularity, social tagging systems have grown to cover a wide range of resources and communities, spanning the entire range of incentives described in the previous section. Instead of simply classifying a long list of potentially ephemeral tools, we will give a complementary example to those provided in previous work. The system we have chosen to investigate is Flickr, a popular photo-sharing site that considers tags as a core element to the sharing, retrieval, navigation, and discovery of user-contributed images. Flickr allows users to upload their personal photos to be stored online, but unlike other online photo tools, Flickr makes these photos publicly viewable and easily discoverable by default. This design decision, along with the emphasis on tagging, has allowed the site to expand quite rapidly over its short lifespan. This growth has in part been due to the wide array of social interactions Flickr supports: in addition to uploading photos, users can also create networks of friends, join groups, send messages to other users, comment on photos, tag photos, choose their favorite photos, and so on. This abundance of communication tools and forms of social organization creates a highly interconnected media ecology that can lead users to distant people and places with only a few clicks. Tags are an important part of this 5.1 Tag Usage Tags are not mandatory in the Flickr usage model. Within a social tagging system, tags are typically an optional feature in a larger resource organization task. Like Del.icio.us, the Flickr interface prompts users for metadata about each resource identified: a title, a caption, and a list of tags. In the case of both systems, the tag input comes third in the input interface, but also differentiates them from other resource management tools. In addition to tagging one’s own photos, the Flickr system also allows users to tag their friends’ photos. However, this feature is not largely used; of the 58 million tags we have observed, only a small subset are of this type; an overwhelming majority of tags are applied by the owners of photos. Tag usage patterns vary quite drastically among Flickr users, and as expected, so does the adoption of tagging behavior. Figure 2 shows the cumulative distribution function (CDF) for tag vocabulary size across the set of users. The value at a given value 36 do they continue to grow as her experiences change? In studying Del.icio.us, Golder and Huberman show examples as to how certain users’ sets of distinct tags continue to grow linearly as new resources are added. At the same time, they claim that the tags for a given resource tend to stabilize after only a few users have tagged it [9]. Since Flickr uses a set-model for representing tags, we cannot reexamine the latter observation, but we can look at the growth of a user’s tags over time. is the probability (Y-axis) that a random user has a set of distinct tags (X-axis) that is larger than that collection size. For example, the probability that a Flickr user has more than 750 distinct tags is roughly 0.1%. This distribution illustrates the fact that most users have very few distinct tags while a small group has extremely large sets of tags. The relationship between tag usage and other types of input can be a good indicator of how useful or important users believe tags are to the experience of using the system. Within Del.icio.us, Golder and Huberman found that there was not a strong association between the number of bookmarks made and the number of tags used to annotate those bookmarks [9]. We studied three activities within the Flickr environment: the number of uploaded photos, the count of the user’s distinct tags, and the number of contacts designated by the user. For example, a certain user can have 100 photos with a total of 200 distinct tags across these photos, and be connected to 50 different contacts. Figure 3 shows the growth of distinct tags for 10 randomly selected users over the course of uploaded photos. The users were selected as both frequent uploaders (greater than 100 photos) and frequent taggers (greater than 100 tags). Each point on this graph shows the number of distinct tags (Y-axis) for a given user after the given photo number (X-axis). It is apparent from this plot that a number of different behaviors emerge from this social tagging system. In some cases (such as user A in Figure 3), new tags are added consistently as photos are uploaded, suggesting a supply of fresh vocabulary and constant incentive for using tags. Sometimes only a few tags are used initially with a sudden growth spurt later on, suggesting that the user either discovered tags or found new incentives for using them, as with user B. For many users, such as those with few distinct tags in the graph, distinct tag growth declines over time, indicating either agreement on the tag vocabulary, or diminishing returns on their usage. Despite the heavy usage of tags for each of the individuals whose tags are depicted in the figure, a number of classes of behavior have arisen, implying that the interaction between user, tag, and utility is a varied one. 10-1 1 10 100 1000 10000 10-2 10-3 10-4 10-5 10-6 Distinct tags over time 10-7 250 Number of distinct tags Figure 2. Distribution of distinct tag collections, represented as the probability that a r 200 Table 2 shows the pair-wise Pearson correlation [19] between photo collection size, distinct tags and number of contacts across the set of users. We computed this correlation for a set of 25,000 users randomly selected from our dataset. For example, the correlation between tags and photos is 0.518, suggesting a strong linear relationship between these variables, i.e. an increase in photo collection size implies an increase in the number of distinct tags. The strongest relationship between these three items (photos, distinct tags, and contacts) comes between photos and distinct tags, a likely relationship due to the fact that tagging ones’ own photos is the dominant form of tags. The association between contacts and photos is much weaker than the one between contacts and distinct tags, which might suggest that tagging is related to social activity to some degree. Table 2. Flickr usage correlation 150 Tags Photos Contacts Tags 1 .518 .386 Photos .518 1 .192 Contacts .386 .192 1 A 100 B 50 0 1 21 41 61 81 Photo index by time Figure 3. Number of distinct tags at given points in 10 random users’ collections Whereas Golder highlighted one form of tag vocabulary growth, namely growing at a diminishing rate over time, the graph illustrates two additional use classes each with several possible explanations. Is the case of linear growth related to the type of media being tagged, namely photos that are taken of constantly evolving subject matter? Or does it evolve from a motivation to continually attract new individuals to the users’ photos? Likewise, the case of gradual increase could reflect a change in personal motivations (e.g., a need to start organizing photos once the collection grows above a certain size), or a social one (e.g., a sudden realization that tags can bring new people to see one’s photos). These questions could be answered by looking at the relationship between the growth of users’ tag collections and various forms of participation, such as the popularity of their photos or their use of the social network system. * N = 25,000 ** p < 0.001 for all values. In addition to social implications, another feature of tags worth investigating is an individual’s use of tags over time. How does the frequency of tags change as a user becomes acclimated to the system? Do her tags become a cohesive taxonomy over time, or 37 ”lects” within a larger sociolinguistic system. Some of these example lects include: dialect (a lect used by a geographically defined community); sociolect (a lect used by a socially defined community); ethnolect (a lect spoken by a particular ethnic group); ecolect (a lect spoken within a household or family); and idiolect (a lect particular to a certain person). If we conceptualize social tagging systems within the theoretical frame of sociolinguistics, these and other “lects” seem especially applicable to understanding and classifying the apparent isomorphism between social and linguistic structures we observed in Flickr. The structures, changes, and diffusion within and amongst various “lects” in social tagging systems will likely have similar patterns to those found in social network analyses and in sociolinguistic language maps. Considering these sociolinguistic categories as we attempt to compute structural isomorphism and the interactions between social structures and tagging structures (for example, hubs, bridges, and diffusion) may prove exceptionally useful in explaining the formation, efficacy, and dynamics of social tagging systems. 5.2 Vocabulary Formation All of the tagging systems we have mentioned in this paper are arguably social in nature; in some cases the social aspect comes from leveraging the community’s collective intelligence, and in others there is explicit social interaction around the use of tags. Because Flickr allows users to enumerate social networks and develop communities of interest, there is a huge potential for social influence in the development of tag vocabularies. One feature of the contact network is a user’s ability to easily follow the photos being uploaded by their friends. This provides a continuous awareness of the photographic activity of their Flickr contacts, and by transitivity, a constant exposure to tagging practices. Do these relationships affect the formation of tag vocabularies, or are individuals guided by other stimuli? To expand on this question, we have randomly chosen 2500 users with a considerable number of tags (greater than 100) and paired them with two other individuals: one randomly chosen from the rest of the set, and the other from their list of contacts. From these pairings we have calculated the overlap in their tag sets; the overlap is computed as ⏐A∩B⏐/⏐A∪B⏐, where A and B are the sets of tags from our two users. These questions call for a much deeper investigation of this phenomenon, a study that could answer many questions about the relationship between people, communication, and the emergence of common lects in social tagging systems. The results of this inquiry are depcited in Figure 4. This graph shows two frequency distributions for the overlaps between sets of users: the overlap between the given user and another randomly chosen one, shown with a dashed blue line, and the overlap between the same user and one of their contacts, shown with a solid red line. The random users are much more likely to have a smaller overlap in common tags, while contacts are more distributed, and have a higher overall mean. 6. CONCLUSIONS Social tagging systems have the potential to improve on traditional solutions to many well-studied web and information systems problems. Such problems include personalized or biased link analysis, organizing information, identifying synonyms and homonyms, building networks of trust to combat link spam, monitoring trends and drift in information systems and more. The prospects of reasoning about tags, users, and resources in unity are encouraging. 300 User vs. Random User vs. Contact 250 In order to study these systems, researchers should observe the system’s place within the taxonomy of architectures described in Section 3.1. Studies should also consider the incentives driving participation, and the extent to which the system supports or restrains these motivations. In studying Flickr, we showed that the dynamics of interaction and participation are different than those of Del.icio.us. Indeed, Flickr and Del.icio.us are rather distinct when positioning them in the dimensions of our taxonomy. Del.icio.us is a free-for-all, suggestive, bag-model (to mention just three key dimensions) system. Del.icio.us is therefore likely to generate a different use model and output than Flickr, a (mostly) self-tagging, viewable, set-model system. Moreover, the incentive models of Flickr and Del.icio.us are also substantially disparate, suggesting even more expected differences in the systems’ output. 200 150 100 50 0 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Tag vocabulary overlap (%) Figure 4. Vocabulary overlap distribution for random users and contacts (n=2500) This result, while still preliminary, shows a relationship between social affiliation and tag vocabulary formation and use even though the photos may be of completely different subject matter. This commonality could arise from similar descriptive tags (e.g., bright, contrast, black and white, or other photo features), similar content (photos taken on the same vacation), or similar subjects (co-occurring friends and family), each suggesting different modes of diffusion. We hope that system designers will consider these design decisions in architecting their tagging systems. By laying out the implications of the choices in each dimension of our hierarchy, we hope to assist planners as well as researchers and academics. Finally, by no means do we contend that the design taxonomy and incentive taxonomy we describe are complete. New uses for tagging systems are invented every day; users of such systems appropriate them with an ever-changing set of goals, motives, and aspirations. We hope that our taxonomy can serve as a foundation for researchers and enable a more complete understanding of the constraints and affordances of tag-based information systems. Other likely explanations for the observed correlation between social connection and common tag usage may be found in the descriptive categories of sociolinguistics which studies how different geographic and social formations structure the coherence and diffusion of semantic and syntactic structures in various 38 [13] Kleinberg, J. M. 1998. Authoritative sources in a 7. ACKNOWLEDGMENTS hyperlinked environment. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, 1998). The authors would like to thank the members of the Flickr team, and the users of Flickr for providing us with fascinating data to study. [14] Lakoff, G. Women, Fire and Dangerous Things. University of Chicago Press, Chicago, 2005. 8. REFERENCES [1] Baeza-Yates, R. and Ribeiro-Neto, B.. Modern Information [15] Malz, D. and Ehrlich, K. Pointing the way: Active Retrieval. Addison-Wesley, 1999. collaborative filtering. In the Proceedings of CHI 1995. [2] Brieger, R.L., 1991. Explorations in Structural Analysis: [16] Mathes, A. Folksonomies – Cooperative Classification and Dual and Multiple Networks of Social Structure. New York: Garland Press. Communication Through Shared Metadata. UIC Technical Report, 2004. [3] Breese, J.S., Heckermen, D. and Kadie, C.M. Empirical [17] Merholz, P. Clay Shirky's Viewpoints are Overrated. analysis of predictive algorithms for collaborative filtering. Microsoft Research Technical Report, (MSR-TR-98-12), October 1998. http://www.peterme.com/archives/000558.html [18] Page, L., Brin, S., Motwani,R. and Winograd, T.. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998. [4] Burt, R. 1992. Structural Holes: The Social Structure of Competition. Cambridge, MA: Harvard University Press. [19] Rice, J.A., Mathematical statistics and data analysis. [5] Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Belmont, CA: Duxbury Press (1995) Gibson, D., and Kleinberg, J. 1998. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the Seventh international Conference on World Wide Web 7 (Brisbane, Australia). [20] Shirky, C. Ontology is Overrated: Categories, Links, and Tags. http://shirky.com/writings/ontology_overrated.html [21] Udell, Jon. Collaborative filtering with Del.icio.us. June 23, 2005. http://weblog.infoworld.com/udell/2005/06/23.html [6] Coates, T. Two cultures of fauxonomies collide. June 4 2005. http://www.plasticbag.org/archives/2005/06/two_cultures_of _fauxonomies_collide.shtml [22] Vander Wal, T. Folksonomy Definition and Wikipedia. November 2, 2005. http://www.vanderwal.net/random/entrysel.php?blog=1750 [7] Freeman, L. C. 1979. Centrality in Social Networks: [23] von Ahn, L. and Dabbish, L. 2004. Labeling images with a Conceptual Clarification. Social Networks. 1, 215-239 computer game. CHI 2004 (Vienna, Apr. 2004). ACM Press, 319-326. [8] Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. T. The vocabulary problem in human-system communication. Commun. ACM 30, 11 (1987). [24] Walker, J. Feral hypertext: when hypertext literature escapes control. In Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia (Salzburg, Austria, Sept. 2005). HYPERTEXT '05. ACM Press, New York, NY, 46-53. [9] Golder, S., and Huberman, B. A. The Structure of Collaborative Tagging Systems. HP Labs technical report, 2005. Available from http://www.hpl.hp.com/research/idl/papers/tags/ [25] Wasserman, S. and Faust, K.. Social Network Analysis: Methods and Applications. Cambridge: Cambridge University Press, 1994. [10] Gyongi, Z., Garcia-Molina, H., Pederson, J. Combating spam with trustrank. n Proceedings of the 30th International Conference on Very Large Databases (VLDB), 2004. [26] White, H.C., Boorman, S.A., and Breiger, R.L. 1976. Social structure from multiple networks: Blockmodels of roles and positions. American Journal of Sociology. 81, 730-779 [11] Hammond, T., Hannay, T., Lund, B. and Scott, J. Social Bookmarking Tools – A General Overview. D-Lib Magazine 11, 4 (April 2005) [12] Hammond, T., Hannay, T., Lund, B. and Scott, J. Social Bookmarking Tools – A Case Study. D-Lib Magazine 11, 4 (April 2005) 39