I recently boiled down some of the advice I try to give students about how to carry out searches and formulate research questions, which I’ll reproduce here.
I start with the basic insight that I’ve picked up from Swarthmore’s library staff, that the point where many students struggle in research is not with finding credible or authoritative sources once they’ve settled on a topic but with understanding what is researchable or knowable within the constraints of the assignment, the resources, the disciplinary framework and so on. I feel as if too many of my colleagues are still focused on the former issue rather than the latter one, still too worried that students aren’t finding the “right” sources that have scholarly legitimacy in favor of Wikipedia or whatever they can find on as full-text at 2 a.m. I don’t think this is a big issue both because I have a much higher opinion of Wikipedia and such than many of my colleagues and because I find that students actually have fairly good skills for finding properly authoritative sources and material. As long as they’ve gotten the research framed correctly at the outset, that is.
So what I focus on is processes of discovery that students should use to find out what’s known and knowable, how researchable a particular question is, what the shape or character of information about that question looks like, and how to make smart decisions about where to invest labor and time in developing a research assignment.
Here’s the most important points I usually make:
1. Rapid iterability of search is important.
Especially early in a possible project, when a researcher is testing its viability, it is important to move rapidly through multiple searches. For this reason, it is always legitimate to favor a fast database over a slow one. (And a good browser and a fast Internet connection, etcetera.) A database that returns results slowly or in a form that is difficult to read or digest quickly is a database that must confer some extraordinary advantage in quality and type of information over fast, efficient databases in order to justify using it. Much of the time, those kinds of advantages only pay off in a big way late in a research project, when a researcher has a very well-developed sense of the topic and is looking for extremely specific kinds of information to round out their analysis or inquiry.
2. Always prefer databases that default to simple interfaces. Databases that default to advanced interfaces (or worse yet, require them) are committing aggression towards anyone who is not already an expert user of such a database.
Particularly early in a project, if a student researcher goes to a new database and is immediately greeted by an interface that spams the whole browser window with fifteen different data fields and Booleans-a-plenty, they should close the browser window and find a new database. Advanced search UIs should always be on a toggle, and never a default for a new user. (It’s fine if the database allows users to set persistent preferences so that the researcher who prefers the advanced UI can default to it if they like.) This advice goes hand-in-glove with my point about rapid iteration. Discovery practices early in a project require simplicity and speed because the student should be trying to get an overall sense of an entire information ecology, not to find single authoritative sources or understand a particular topic.
3. Work consciously on developing and refining heuristics for interpreting lists of search results.
I spend a lot of time in class showing students how I make sense of a list of search results and make decisions about whether the results are showing me a viable topic. I show them how to determine whether I’ve got the right keyword or search string, and how to evaluate the type and nature of the information or knowledge that I’m seeing listed. I often try to do this live, without rehearsal, in response to suggestions from the students, so that I’m not “salting the mine” and picking a search where I know in advance what kinds of results I’m going to get. Just looking at my desk for a similar unrehearsed term, I see HÃ¤mÃ¤lÃ¤inen’s Comanche Empire, mentioned several postings back. Let’s say a student had read the first part of the book and got interested in Uto-Aztecan migrations after the fifteenth century and about how to read or understand various oral traditions and records relating to AztlÃ¡n, the Uto-Aztecan “homeland”.
So if a student entered the keyword “Uto-Aztecan” in Tripod, our local library database, what they get is 13 results. If the student knows nothing about linguistics, they’re probably going to find the titles of most of the results baffling or obscure. Here’s where personal heuristics enter the picture. What I’d point out to the student is that he knows what he wants: history, oral tradition, Aztlan. What he has discovered, even if he doesn’t know what morphology, cognate sets or sytactic change are, is that this search term is used by some other knowledge community besides historians. (In fact, if the student researcher does know linguistics, they’ll know instantly that this term is first and foremost a designation of a family of languages, much as “Bantu” is in African history.) There is a history-themed title at the bottom, but one thing I’d be pointing out in the class discussion is that it’s from 1937. (Actually, it’s a Ph.D dissertation from 1933.) If the student then tries “Aztlan” as a search term in Tripod, they’re going to find two separate search results that are very limited. But the second that returns just one title is a catalog from an exhibition which “contains nineteen essays by an international team of scholars and artists who investigate the concept of Aztlan as a metaphoric center and allegorical place of origin for the various peoples of the Southwest and Mexico”.
So what has the student learned? That this may be a difficult topic to research, but also that there is one title which is worth looking at immediately to further test out the viability of what the student had in mind in the first place. (HÃ¤mÃ¤lÃ¤inen’s footnotes are another place to start, and I often point that out to students as well.) It’s not important just yet that the student understand Uto-Aztecan migrations, Aztlan as a place, linguistic history and archaeology as methodologies, or oral tradition and conceptions of origin. What they’re trying to find out by parsing a screen of search results is what kinds of information are produced by different queries, and how to read those results at a glance.
4. How to generate and harvest keywords across multiple searches.
The metaphor I sometimes use to describe this process is tacking into and against the informational wind, as in sailing. A searcher needs to learn how to explore all the permutations and variants of a useful keyword, how to get a feel for when the discovery potential of a keyword concept has been exhausted, and how to leap to a completely new keyword concept and begin again. Some of this involves working through all the variations of a keyword concept inside a single digital database, sometimes it involves trying the same keyword across five or six databases.
Take the Aztlan example above. A student who pursued that keyword in a larger database than Tripod, say, the Library of Congress catalog or WorldCat, would probably realize fairly quickly that the concept is hugely important to the cultural imagination of Chicano activists, writers and intellectuals. The student researcher would have to decide at this point whether they’re going to refocus the paper on this use of the concept, or whether they want to study the historical migrations of Uto-Aztecan peoples out of the southern Sierra Nevada mountains from the original “Aztlan” southward and northeastward (the Aztecs and Shoshones respectively). Re-reading HÃ¤mÃ¤lÃ¤inen, the student would see that there was another name for this territory, Teguayo, as well as learning the ethnonyms Numic and Shoshone.
This is a stage where I often advise students to use Wikipedia aggressively, to generate a rich base of keywords. Looking at the entry on AztlÃ¡n, the student should harvest Nahua, Mesoamerica, Chicomoztoc, Mexica, and Ute as being of interest, as well as getting a much clearer picture of the two uses of the concept and some of the scholarly debates about the migrations and history.
If the student tried “Teguayo”, they would find almost nothing. “Numic”, on the other hand, turns up works that are clearly close to the student’s interests as well as works that are primarily about linguistics. (Such as Daniel Myers, Numic Mythologies and David Madsen & David Rhodes, eds., Across the West: Human Population Movement and the Expansion of the Numa.) An important part of keyword harvesting is to know when it’s time to go and read materials for a deeper understanding of the topic. At this point the student should go and read those books as well as sources garnered from an “AztlÃ¡n” search that are about the Uto-Aztecan homeland rather than the contemporary Chicano concept of the term. At the end of that process, the student will (hopefully) understand many of the historiographical issues. Now they’re facing a new choice: is this paper going to focus on Uto-Aztecan migrations in general, on oral traditions of AztlÃ¡n among many or some particular descendent people, on Spanish colonial interpretations of those oral traditions, on debates about the actual location of the historical AztlÃ¡n, or some other focus. Each of those emphases leads to a different set of branching keywords, some of them very general in nature. For example, if the student wants to think about oral traditions of origin and migration, maybe there are broader texts drawn from anthropology, history, Native American studies and so on which will be of great use. So the next round of searching and harvesting begins from that point, and requires a completely fresh take. “Oral tradition” + migration might be one interesting starting point, and that would turn up in the LC catalog Wesley Bernandini, Hopi Oral Tradition and the Archaeology of Identity, which looks very promising for developing this line of research.
5. Associations and folksonomies are underutilized and powerful (but donâ€™t forget bibliographies)
This takes me back to a point that I made some years ago that turned out to be more provocative to many librarians than I anticipated, namely, that Amazon.com’s search tools that associate books through the aggregated preferences of consumers were exceptionally powerful tools for research discovery whose only analogues in conventional library catalogs were difficult-to-use citational databases. Seven years later, I think that’s still the case. Most conventional databases are still only barely Web 2.0-like in what they offer to researchers, or have tried to leapfrog into a Semantic Web-compliant form which aids cataloguers but not most users.
In the case of the search example I’ve been using here, this approach is going to help somewhat less well than it might for other searches, as the student is developing the project towards more scholarly works that are only rarely purchased on Amazon or have significant folksonomies on a site like LibraryThing. Let’s just say my hypothetical student is a gifted researcher and notices on the Amazon page for Myers’ Numic Mythologies a recommendation to look at Peter Jones, Respect for the Ancestors, which engages the contentious relationship between the repatriation of human remains and archaeological evidence and Native American oral traditions. This is a connection that would have been more difficult to find through conventional LC-subject heading search strategies, but once the student has made the jump into this new literature, they may recognize where the most exciting or lively analytic stakes of this topic lie. After all, an undergraduate doing research on this subject is going to find it very difficult to say much about scholarly debates about the archeological and linguistic evidence for the particular location or nature of a Uto-Aztecan polity prior to the fifteenth century. Once the researcher has found the Jones book on Amazon, a big range of interesting, relevant works opens up via the “Customers who bought…also bought” tool, such as David Thomas’ Skull Wars. The LC-subject headings for Jones, in contrast, don’t lead to that debate in any direct way. Here Amazon is showing the student researcher something that readers “know”, but authority-driven cataloging does not know, which is what the Jones book is “really” about, and therefore also, one of the best answers to the question “So what?” in reference to debates about the location and character of AztlÃ¡n.
I also point out, however, that bibliographies and footnotes in an existing authoritative source are another fantastic version of this kind of discovery tool, basically a guide to what a researcher read and considered. I suggest that the most recent source with the scope that most closely matches a student’s interests is especially useful in this way.
6. Balancing triage and intellectual depth
I talk a lot with my students about how to apply pragmatic judgments about when to end a search and discovery process in order to concentrate on the completion of an assignment. Discovery can go on endlessly, and never become clearly irrelevant or unimportant. The hypothetical student in my example could decide to actually link up debates about oral tradition and repatriation with the Chicano use of AztlÃ¡n, perhaps via reading about the politics and production of collective memory. They could do a comparative analysis of different debates about migration and oral tradition in the historiography of Native Americans across the Americas, or in relationship to immigrant communities in North America. Or comparative with other debates about indigeneity and autochthony elsewhere in the world. And so on.
Everything a student does, or any researcher does in any context, has a point at which there are diminishing returns to discovery simply because of limitations of time, attention, ability and purpose. The important thing for a student to understand is that they shouldn’t feel guilty when they bring a research process to a halt for this reason. There’s no set way to know when you have enough, or what counts as thorough. So I also talk a good deal about how to judge what is required for a given assignment. Some of that is dependent upon a student’s impression of the total space of information about a given topic: is it a huge, contentious, sprawling kind of space or is it a relatively placid, narrow, constrained space? If it’s the latter, for a comprehensive or ambitious research assignment like a senior thesis, a student might be expected to have some knowledge of every source or publication. If it’s the former, not so much. I also talk a lot about developing intuitions about what professors or audiences really want or expect, as opposed to what they might formally say on an assignment sheet, and letting that intuition dictate how much time or effort to put into a research process.
7. Authority and quality assessment: what you know, what I know,what you can learn to know (at this moment)
Only at the very end of talking about research do I talk about how to assess the quality or authority of the sources and information that a discovery process turns up.
Again, this is partly because I think many contemporary students at Swarthmore are already fairly skilled at recognizing basic signs of unreliable or authoritative information. So what I focus on is how to work on developing and refining more sophisticated, semi-scholarly guesses about authority and influence: how to read a search result for signs that a particular author is especially active in a given field of research (present as an author or editor in many anthologies, prominent in citational databases) or that a particular older source has retained its importance or centrality (still in print, cited or referenced in many later works).
I also encourage students to recognize that some of the judgments their professors make at a glance about the authority or influence of sources are not reasonable expectations for undergraduates. I have ways of making guesses about the reputation or influence of authors that relate to my membership in various “invisible colleges”, my understanding of the sociology of academic publishing and so on. I show students how I read a screen of search results and try to distinguish what takes a lifetime of scholarly practice to “know” at a glance and what they can reasonably learn how to do through their own experiences in a given class or over the course of four years of study of a discipline.
I often research questions that I don’t have the background for. I commonly start with CiteSeer and do a bunch of searches to look around, read a few papers semi-randomly to get an initial idea of what’s going on in the vicinity. (Is this area big or small? Active? Contentious? In a creative or in a consolidating mode?) Then if there’s any likelihood I look for review papers, which are normally easy to find if they exist. Without a review, it takes me much longer to suss out the map.
But I think the best plan varies a lot by field, because they all have different communication cultures. You have to look for the information where it is. Physicists put tons of stuff online freely, and I gather that historians are less comfortable with that. I tried “uto-aztecan” on CiteSeer and got 27 hits, every one in linguistics.
Thanks for a great post. I have long wanted to talk about the heuristics of search, and have been shy since I’m not sure I know what I’m talking about.
A bunch of librarians are discussing this post on FriendFeed. What’s the deal with your preference for simple searches? It seems like students usually ignore the parts they don’t need, but might find them a useful cue as to what else they might be able to do with their searches in the database when needed. You don’t agree?
My beef with advanced search interfaces (which are sometimes what librarians set, sometimes what vendors set) is two-fold.
First, most of what an advanced search UI is offering an undergraduate researcher, or any researcher who is trying to get an initial sense of the overall information ecology around a given topic, are options which are not yet needed. Most options that winnow or narrow search results already require some knowledge both of the topic at hand AND knowledge about information itself to be used effectively.
Take, for example, date limits. If you set a date limit very early in a search, it implies that you already intuit something about where the best bang for the buck is to be found. I only want a student doing that when they use a keyword and what comes back is an enormous result set, and even then, I’d rather they spend some time first trying to find a keyword that by itself and of itself limits the results. Advanced search UIs are great on the second or third pass of a search process, but in the very beginning they can often confuse and prevent a researcher from understanding the big informational picture.
Second, my hating on advanced search UIs has a lot to do with my deep irritation with the constant fiddling with those UIs that vendors (and yeah, librarians) have been doing for years, and the extent to which there are few common standards and the design of many search UIs is cluttered. Every time a search UI gets changed, even a little bit, is time I have to spend learning where a particular box or button or toggle has gone to. Vendors really don’t seem to give a crap about imposing that cost on users, and their changes often have as much do with branding as anything else. I wrote about this a bit in a past entry.
Sometimes I think it’s important to inform students about the ways in which the assumptions of older research tools lurk skeletally beneath the smooth (or not so smooth) surface of a searchable database.
E.g. L’AnnÃ©e Philologique. Not easy to use, unfortunately. I always tell students to take a little time just to play around with it when using it for the first time. But something else that helps is to show how the hard-copy volumes are organized and indexed, and how that reflects previous assumptions about what categories of work would be most common in the field. Then it’s a bit clearer why some search terms are more likely to be productive than others.
Thanks for the response. I agree that setting too many (any?) limiters early on is a mistake, I just wonder if students will find those limiters later if they have to decide to click on an “advanced search” tab. For that reason, I’m liking the trend toward faceted browsing that makes it easy to start with a keyword search and then see at a glance what limiting by date or subject heading or author or whatever might do for you.
Faceted browsing is a good way to do it. I should also be clear that I show students how to use advanced search limits and where they tend to hide in simple-UI defaults. But I really don’t want them messing with those tools until they’re very clear what the usefulness of putting limits on a search might be. Keywords themselves should provide limits as you develop them–a keyword that brings you more than you can handle is a bad one, and is telling you something about a problem you have with your conceptualization of your interests. Take the hypothetical student in my post: if they start off saying, “Well, I’m interested in migration” or “I’m interested in oral traditions”, I want them to figure out why that’s not going to fly before they start messing around with limits to their searching.
As others have said, this is a wonderful post. I am totally down with everything except I have a different opinion about #2. This really depends on the search engine. In Google, for example, one can type in a simple keyword search and get results. From there, one can narrow by adding additional keywords, using options in the left menu bar, etc., so this works okay. In PubMed, I’ve noticed that the automated term mapping has greatly improved over even the past five years. So, again, a simple keyword search gets pretty good results and one can narrow, mine for keywords and subject headings, iteratively refine the search (which I always strongly stress), look at similar articles, etc. In EBSCO databases, though, simple keyword searches in the basic search box often retrieve too few results. In my experience, students expect the basic search in EBSCO to behave like Google, but the search algorithm is totally different. In my opinion, in order to get decent recall in the basic search box, one needs to do some fancypants nesting, Boolean, truncation, etc., which I think is much harder than using the advanced search screen. I tell them, “This is a structured database that doesn’t behave like Google. So, you need to do a structured search.” I teach students to put one concept per line, to leave the default ANDs between the rows alone and then to use ORs between synonyms/closely related terms within each row. I might also tell them about truncation and they tend to love certain limits (although, it it true that they can get carried away and check too many limits and end up with nothing). Students take to this strategy and it has the extra bonus of helping them to define their research question. So, I guess what I’m saying is until vendors like EBSCO greatly improve their term mapping &/or vastly improve their relevancy ranking algorithms, I’ll continue to teach the advanced search interfaces. (FWIW, all of our databases are set to the simple search interfaces by default).
Also, much as I would like to, I don’t think I could convince our students that they “should be trying to get an overall sense of an entire information ecology, not to find single authoritative sources or understand a particular topic.” I’m all for this, don’t get me wrong, but students are much more interesting in finding and obtaining full text quickly than in searching or understanding the search process or lay of the information landscape. Some students are willing to back up the truck and spend some time getting the lay of the land but others just want their five freaking articles, thank you very much. 🙂
I teach an information literacy class and I’m already thinking about how I can incorporate some of the concepts you articulate so well in this post. Very helpful, thank you.
This for me would be a reason why undergraduates should never begin a search in an EBSCO database. 🙂 I’m only being a bit flip here. those are precisely the databases I’m trying to keep them away from unless they know they’re working with a topic where their use is specifically recommended and justified, or they’re at a stage in the research where they want or need to drill down a bit. There are research communities and research traditions where this advice wouldn’t hold, but in most of my classes, starting in a “big” environment like the LC catalog, WorldCat, our own Tripod, Google, Amazon, is the right move.
On the bigger point of information literacy: well, yeah, now THAT is the really big challenge. Students often just want their five articles now because they also accurately gauge what I mention above, namely, what the faculty actually want. The demand for this kind of understanding of information is going to have to be driven by faculty expectation, which in turn takes faculty thinking this way about information. Most faculty don’t, in my experience, while many librarians and information scientists do. I don’t really know how we’re going to get to where we ought to be.
Yes, I teach students to start with background and overview sources, when I have a chance. Often, they try to jump straight to a research question (or, even worse, a thesis statement) and straight to finding scholarly articles, which leads to all kinds of problems.
Also, searching, reading and writing are integrated, iterative process. You know, find some initial background material, read some of it, think about it, write something about it, figure out what interests you, do some searching for sources on that, mine the source for potential search terminology, write some more, refine the topic, start thinking about potential theses, lather, rinse, repeat.
Oh, one more thing: I HATE that certain vendors have monopolies on certain indexes. We give these vendors too much power and then are stuck with their sucky interfaces (e.g., CINAHL, my current pet peeve, as the whole universe likely knows by now). I do like Google Scholar very much indeed.
Certainly one of my major pet peeves. We ought to be the whip hand in these relationships. Who else is going to buy most of these vendor products? And yet we passively accept terrible design decisions, hoarding of information, aggressive attempts to break interoperability and much else. This is another reason why faculty need to be more literate about contemporary information environments, so that we can insist that things change or demand that our institutions stop buying the products that don’t service our needs or specifications.
Amen. Academic librarians around the world hug you (not all at the same time, of course).
Over at the Friendfeed site, the librarians really disagree with your number 2. However, I think you are absolutely right for History and related disciplines. My high school students are completely flummoxed by most advanced search features and have no idea how to use them. It’s onl
It’s only after they get a gazillion hit results that the categories start to make sense.
Right. I think using an advanced interface makes sense only when you start with a very specific need, you’re working in a disciplinary environment where there are more constraints on information and where research traditions don’t necessarily involve working extensively with textual or archival data at the discovery stage. My colleague Colin Purrington has written up advice on how to get primary literature citations for lab reports, for example, where he directs students very appropriately to specialized databases right at the start of their inquiry.
One thing I’ve learned during my indexing training is that there are underlying reasons so many of the big databases are so researcher unfriendly, and why it’s worth trying different variants of keywords during your search. Basically, they are not seamless across the content; most of them are made of conglomerations of smaller databases from different sources, and not all of them are indexed consistently over time. When those smaller databases are later combined and aggregated by the central hosting service, the indexing that’s done at that level is perforce more simplistic, because the indexing “thesaurus” – the lexicon of acceptable keywords – is more restricted and has to apply widely across disciplines, instead of focusing narrowly on the needs of a single field. Moreover, because they are indexing so many articles so rapidly, most of them are limited to a handful of keywords at best, and the accuracy of the keywords depends both on the experience of the indexer and the clarity of the keyword descriptions. Imagine a series of drop-down menus suggesting possible terms, and only five fields that you can fill with terms, and that’s more-or-less what you’re dealing with. Moreover, as meanings of terms and concentrations of concepts evolve over time, new keywords are added to the thesaurus – but they are not back-added to older entries. So, for example, if you search under “cell phone” you will not get older entries using the keyword phrase “mobile phone,” the phrase that “cell phone” came to replace.
The result is that you have at least three overlapping sets of searchable text – the original keywords from the original database, the new keywords imposed by the aggregating service, and the keywords embedded in the text of the article – and each of those sets evolves and changes over time. Expecting an accurate and perfect set of results from a single search is therefore unreasonable, even if one has a high degree of experience coming up with viable search strings. Instead, the most useful way is to run at the topic from as many directions as you can, and pin it down that way.