Chemical Information Sources/General Search Strategies
Introduction: Search Engines versus Databases
The most common first step in finding information of any type is to use an Internet search engine, such as Google. A search engine is a computer program designed to retrieve Internet-based resources (web pages, files, images, etc.) that correspond to an entered search term. Usually, there is little to no additional information provided with the search results. The search results themselves may differ from engine to engine, depending on the program used to compile and return resources. For specialized or scholarly information (and chemical information in particular), general search engines fall short in two key aspects:
- They are, at a basic level, very broad. This leads to user frustration when an unrefined search for information retrieves too many irrelevant results, some of which may not be appropriate for an academic or industrial research project.
- They are by their nature limited to items available online. Thus, in the case of journals or books that do not have online representation (rare but still a factor particularly for older titles), search engines will not find them.
There are several search strategies that can be employed for any given text-based electronic search engine to help alleviate Problem 1. These include using Boolean operators to narrow or widen a search with specific terms, employing truncation or “wildcard” symbols to provide for variable matches on a base search term, and enclosing phrases in quotation marks to ensure exact phrase matching. Section 2 - Search Strategies (in progress) describes these techniques in more detail.
The use of subject-specific databases helps with both Problem 1 and Problem 2. A database is a searchable repository of actual information, in contrast to a search engine, which only provides links to information. Databases may exist in print, electronically (but not online - as a DVD for example), or online. Databases are usually (but not always) maintained by individuals or organizations that control how the database is structured, what preferred words are used within the database, and what types of searches can be performed. By choosing an appropriate database - one that covers the subject of interest - the search is more likely to initially return more relevant results. Subject-specific databases also focus on all the literature available, even from journals or resources still produced or more popularly available in print, or archival materials that have not yet been digitized.
Two key skills are important to navigating subject databases successfully:
- Knowing what the database covers with respect to subdisciplines (and even individual journals) and time periods;
- Knowing the language of the subject (and database), including preferred search terms (referred to as index terms, supplementary terms, keywords, etc.), as well as categorization terms (which may vary from database to database).
Developing Skill 1 involves familiarization with specific databases. Usually coverage information can be found on the company or database website. Skill 2 is naturally developed by researching a topic, and taking advantage of information provided at both the article level and database level. Skill 2 development also involves the use of the search strategies mentioned earlier, and will be included throughout this chapter in the relevant database or resource sections. In particular, the chemical literature can be searched by visual terms (i.e. chemical structures) in addition to text, which poses its own challenges and opportunities, and will be discussed in Section X (to add).
Section 3 - Types of Electronic Information Sources (in progress) outlines different electronic sources available for searching different types of information, and recommends an appropriate approach for each one.
Section 4 - Chemistry Databases and Search Engines (in progress) provides an overview of some of the most popular chemical information databases and associated access portals. Some databases have several access portals.
Finally Section 5 (in progress) contains links to further reading and supplemental information.
This section will list a few helpful search strategies for using online search engines and databases.
Boolean Search Operators
Online search systems offer BOOLEAN SEARCH OPERATORS that show the logical relationship among different concepts or words in the search.
Let's assume that we are sending orders for home delivery to Doc's Gourmet Bakery using Boolean operators to express our orders. In the examples below, assume that the plates on which the desserts are delivered are documents, and the pie, cake, and ice cream are words in those documents. The tray on which the plates are sitting represents the answer set.
The most common Boolean operators are:
- OR - Concepts linked with the OR operator are synonymous or related in some fashion.
The OR operator broadens the scope of the search by including acronyms, abbreviations, and similar terms that may be used in the indexing of the documents in the database. One document in the answer set might contain only one of the terms, a different document might have another one, and a third might contain two, three, or all of the terms in an OR statement. The OR Boolean operator puts all of these documents into the final answer set, even if only one of the terms is actually present in a given document.
The normal use of the English word "or" implies a choice, with only one thing possible in the final selection. In a Boolean sense, OR really grabs all of the items and puts them into a set. A special variant of the OR operator is XOR. XOR retrieves a document only if one of the terms in the OR statement is present, but would skip any documents that have both terms.
Example: pie OR cake
If each of the pieces of pie and each of the pieces of cake in Doc's Gourmet Bakery were placed on its own plate and arranged on a huge tray, we would satisfy the search (pie OR cake), and the tray would represent our answer set. Since the XOR operator was not used, there could even be some plates on which both pie and cake were found. In the Venn diagram, everything that is represented by the top two circles would be pulled and delivered in the order. The overlapping segment of the top two circles implies that some of the plates would definitely have both pie and cake on them.
- AND - Different concepts are combined with the AND operator to insure that both are found in the same document(s).
In conversational English, "and" is used to group things that may or may not be similar. In a Boolean search, all terms connected with the AND operator must appear in each document in the answer set.
Example: cake AND ice cream
In this example, each of the pieces of cake in our order would be on its own plate with some ice cream on top in order to satisfy the search, and only those plates would be on the tray that is delivered. The two segments of the bottom circle that are shaded in the upper right-hand area represents this search.
- NOT - A concept is excluded from the final answer set with the use of the NOT operator.
Example: (cake AND ice cream) NOT chocolate
Now, let's add a further refinement to the search that is not really illustrated in the Venn diagram. Let's assume that you are allergic to chocolate, but that Doc's Gourmet Bakery at the time of your order has only chocolate cake left. You would not get any dessert because the NOT completely eliminates the subset when one of the terms satisfies it. It throws out each of the plates containing the chocolate cake even if the ice cream on top is your favorite, vanilla.
Let's try another search for pie on the same day that Doc's has only chocolate cake on the shelves.
Example: (pie AND ice cream) NOT chocolate
In this case, our order would get us some pie (as long as it wasn't chocolate pie or the pie didn't have chocolate ice cream on it).
From the examples, you should realize that the NOT command must be used with caution in online searching since it could eliminate some documents that are of interest if they also happen to discuss aspects of a topic that are not of interest. In the last NOT example, for instance, you would not get any plate that had both pie and chocolate cake on it.
There are more specific variants of the AND command that can be used to define the spatial relationships of search terms. These are called POSITIONAL or PROXIMITY OPERATORS. On STN, they are:
- (A) - terms must be adjacent without regard to order
- (W) - terms must be in the order specified
- (L) - terms must occur in the same logical unit (field)
- (S) - terms must be in the same sentence within the same field.
Note that on STN the (A) and (W) operators mean the same in all files; other proximity operators may yield different results depending on the file. STN assumes that multi-word phrases are to be searched using the (W) operator in the absence of explicit positional or other Boolean operators.
See "Operators for Relating Search Terms" for some examples of Boolean search operators on the STN system.
Some of the examples illustrate the use of NESTING, placing terms in parentheses so that the search system knows to perform those functions first before moving on to other operators.
Truncation (Masking) of Characters to Expand a Search
In many cases where subject searches are concerned, we are looking for topics that involve words built on a common root word, or that have some other variations that are easily signaled to a computer by means of a special symbol. TRUNCATION is the technique that tells the computer to form an answer set consisting of all records that contain words with the characters input for the search, but could also contain related words with suffixes (or, in some cases, prefixes) or variable characters at a given point in the word. It is NOT possible to use the truncation technique on SciFinder research topic searches. However, it can be applied on command-driven searches such as those done on STN. For examples, see:
Truncation can occur at the left end or the right end of a word stem or within the word. STN now allows all three types of truncation in the CA File Basic Index, an index of subject words from the title words, words in the abstracts, or index terms (including Registry Numbers for compounds discussed in the documents). The limit of terms that can be gathered in a set by truncation is 30,000 stems. For left truncation the search term must have at least four characters.
On the STN system, truncation symbols are:
|exclamation point (!)||Exactly one character||cataly!e|
|hash mark (#)||One or no character||alcohol#|
|question mark (?)||Any number of characters||?therap?|
As noted in the table, the # sign can be used at the end of a word to pick up both singular and plural forms of a word. Another way of accomplishing the same thing on STN using the command language option is to enter SET PLURALS ON at the system prompt. Both left- and right-hand truncations are allowed with the "?".
There are limits to the number of terms that can be gathered into a set using truncation. Therefore, caution must be exercised in using truncation to prevent too many search terms (or unexpected words) from entering the answer set.
Novice searchers and even professionals sometimes make gross errors with truncation, especially in systems that allow both left- and right-hand truncation. Think what would happen if a search were run with these character strings truncated on both sides:
Every occurrence of the word "chemical" or "chemistry" or "biochemical," etc. would be pulled in the first search, but also documents containing words such as "hemisphere". In the second case, every document that contains an English word that ends in -ION would be pulled. Probably not what the searcher would have wanted!
Unfortunately, there is no uniformity of symbols used to designate truncation among different vendors or search engines, although often we find an asterisk (*) used to indicate the right-hand truncation point. That is the case with the Web of Science, for example.
With SciFinder, no truncation is used. The searcher simply types into the Research Topic search window the natural language expression that defines the search, without even trying to insert Boolean search terms. The SciFinder search algorithm has some built-in intelligence to look for relevant word forms for the search. For instance, the search system automatically searches for both singular and plural subject words.
Let's consider the results of a research topic search run a few years ago on SciFinder for the analytical technique "Electron Spectroscopy for Chemical Analysis (ESCA)," including results from both the CAplus and Medline databases.
At the time it was run, the search as entered found 4395 references where the two concepts "electron spectroscopy" and "chemical analysis" were closely associated with each other and only 582 where the phrase as entered was found. In this case, let's repeat the search using the acronym for the analytical technique (ESCA) and also use a synonymous acronym, XPS. (The technique is also known as X-Ray Photoelectron Spectroscopy.) We have the option of entering synonymous words in parentheses, following a term or phrase. Thus, entering the research topic search on SciFinder as:
would imply to the system that you are looking for synonymous terms (an OR search). This search found considerably more documents: 114,511 at the time of the search on October 3, 2004. However, many of the 35,609 records pulled by the ESCA part of the search were false drops that match the word "escape"! Entering ESCA by itself pulls 7516 records with the term "as entered," and it appears that all but the oldest (a 1918 record) are relevant. Thus, the technique of entering synonyms in parentheses must be used with caution on SFS.
Enclosing a phrase in quotation marks considerably narrows a search by limiting results to those in which the exact phrase appears, in the order in which it is entered. A basic example would be searching for polymer nanorods versus "polymer nanorods":
polymer nanorods: Most search engines will perform an AND serch with the terms polymer and nanorods, resulting in extraneous results.
"polymer nanorods": Enclosing the term in quotes will ensure the results returned contain polymer nanorods as adjacent terms.
Types of Electronic Information Sources
Bibliographic versus Non-Bibliographic
When searching for peer-reviewed scientific information, two broad types of search engines and databases can be distinguished:
This includes sources such as property databases, chemical structure databases, dictionaries, and encyclopedias that provide actual answers to questions without having to consult another source.
Examples: Encyclopedia Britannica, the CRC Handbook of Chemistry and Physics, SciFinder, ChemSpider
These databases includes records of published works, perhaps with abstracts, and increasingly with links to the full texts of the primary documents.
Examples: Web of Science, SciFinder, Compendex
Web search engines do not have access to library databases such as the Web OPACs that tell you the holdings of the libraries, nor can they access any of the commercial vendors' offerings. Nevertheless, the search engines are very powerful tools, and for certain types of questions, they can be very useful in a search for information. For example, many people, including chemists, maintain their own personal Web pages nowadays. For locating someone and perhaps finding a full or selective bibliography or a curriculum vitae (CV) of a chemist, the Web may offer the best route to reliable, up-to-date information. Likewise, very new or hot topics may be discussed in Web news groups, discussion lists, or blogs long before they appear in traditional journals and, later, in abstracting and indexing services. For all of these reasons, we are beginning to see the commercial vendors add options to transfer the search strategy used in a commercial database search to the Internet for further information. One example is Elsevier's Scirus, with which you can search both Elsevier journals and the Web.
In spite of the ease of accessing the Web, it ought to be a fairly rare case that you begin a subject search for information with a Web search engine if you have easy access to online commercial databases in your organization. Databases such as the Web of Science (including Science Citation Index potentially all the way back to 1900), Elsevier's Reaxys databases (among which are the Gmelin and Beilstein databases that cover the literature of modern inorganic, organic, and organometallic chemistry back to their beginnings in the 18th and 19th centuries), and Chemical Abstracts (that covers all areas of chemistry in a comprehensive manner back to 1907 and even earlier in some cases) are usually much better first choices, if they are available to you.
The options for database searching include:
- ONLINE SEARCHING of commercial databases located outside your organization.
Vendors of online search services (for example, STN International) lease or acquire databases from the database producers (such as Chemical Abstracts Service or Thomson Reuters) and make them available on remote computers. For a given vendor, which may have dozens or hundreds of databases on its computers, the databases are all searched by a common command language or graphical user interface. In the vast majority of these cases, there is a fee for searching the databases.
- WEB SEARCH ENGINES.
As noted above, the powerful search engines of today can provide a useful supplement to traditional online searches.
- Free Chemistry Databases On The Web.
Some databases that are available for searching free on the Internet are of very high quality, for example, those produced by the National Library of Medicine or other government agencies or commercial organizations. However, the quality of most databases that are freely accessible on the Internet is likely not to be as high as that of commercial databases. In addition, there are many differences in the search interfaces that the user encounters among free Internet databases. Nevertheless, they should not be ignored for certain types of searches.
- IN-HOUSE SEARCHING of databases within the organization.
Chemical and pharmaceutical companies now routinely load databases on their own computers.
Chemistry Databases and Search Engines
Costs and Benefits of Online Searching
The costs of a commercial online search are usually not fixed, but are dependent on several factors, including telecommunications network charges (even a connection via the Internet is not free on a commercial system), connect time on the vendor's computer, royalties charged for the information extracted from the database (known as HIT CHARGES), and on some systems, charges for the search terms input in the search strategy.
The benefits of using an online vendor to search databases include:
- Command language is uniform across all databases on that vendor's system. (Unfortunately, there is little movement toward adoption of a Common Command Language among vendors.)
- More years of the database are searchable than with other formats of the database, such as CD-ROM, and those years can be searched simultaneously.
- Trained Help Desk personnel will assist you when problems arise.
STN International is at present the only online vendor to have available the abstracts from Chemical Abstracts. The abstract's summary of the document provides a quick way to assess whether the document itself should be read for further information.
The CA, Registry, and Other CAS-Produced Files on STN: CAS Databases
Chemical Abstracts is the largest and most nearly comprehensive abstracting service for information in chemistry. It covers a very broad range of topics and has been published since 1907. Chemical Abstracts Service (CAS) has also added machine-translated records from the old German abstracting service Chemisches Zentralblatt for the 1897-1906 time period. Over 180,000 records from the pre-1907 era were thus added to the CA database. Other sources of 19th-century chemical literature have pushed the coverage for some categories back to 1840. See CAS Content at a Glance for more details.
At present Chemical Abstracts Service creates two main files and several related databases. These include the CA File of literature that extends back to 1907. The Registry File contains searchable information that leads to the rapid identification of a compound, when a name, molecular structure, or other pertinent data is known about it. The Registry File also links these substances to the information that is indexed in the CA File and other chemical databases on the STN system through the Registry Numbers assigned by Chemical Abstracts Service to chemical substances. The CAS REGISTRY NUMBER is a unique number assigned to each chemical substance in the Registry File. For isatin, it is 91-56-5. To accommodate the continuing growth of substance information in the Registry file, CAS began to assign 10-digit CAS Registry Number (CAS RN) identifiers for newly registered substances in mid-January 2008. With more than 135,000,000 chemical substances in the Registry File, it is the largest compilation of such data in the world.
Also produced by CAS are the CASREACT file of organic reaction data, the CHEMCATS file that links chemical substances with commercial suppliers, the CHEMLIST file of regulatory data, a special variant of the CA File, CAplus, that offers rapid coverage of the articles in the main journals of chemistry, and the MARPAT file that facilitates retrieval of structures covered in patents through a technique called Markush searching.
The CA File covers chemical literature found in journals, patents, patent families, technical reports, books, conference proceedings, and dissertations from all areas of chemistry, biochemistry, chemical engineering, and related sciences from 1907 to the present. The CAplus file is a special version of the CA File that even has records for articles published before 1907. Since October 1994 it contains all articles from more than 1,500 core chemical journals, including records for document types not covered in Chemical Abstracts (CA): biographical items, book reviews, editorials, errata, letters to the editor, news announcements, product reviews, meeting abstracts, and miscellaneous items. Bibliographic information and abstracts for the articles from the key chemical journals are added within one week of journal receipt. Both the CA and CAplus files were retrospectively converted to include earlier information. By the end of 2002, all CA bibliographic data that appeared in the printed Chemical Abstracts was included in the CA and CAplus files.
There are low-cost learning files that correspond to:
- CA File, the bibliographic file that now has more than 32,990,000 records dating from 1907 to the present. It includes full indexing and abstracting of the original documents. Examples are found in the LCA Database Summary Sheet.
- Registry File, the file containing information on more than 135,230,000 substances, including the CAS Registry Number, the CAS Index Name, other Chemical Names, Molecular Formula, and structural depiction for each substance. (Examples of the Learning Registry File searches and records are found in the LREGISTRY Database Summary Sheet.)
SciFinder and Other Front-end Software and WWW Access
Learning the command language of STN Interntional, DIALOG, or other vendors can be a significant barrier to online searching for some. There are programs that can help the novice searcher. One such FRONT-END program is STN Express, a product that is free to all STN customers. Questel's IMAGINATION software is another front-end software package.
The most recent efforts by the major vendors to win online searchers have been directed toward the Internet. For example, STN EASY allows direct access to many STN databases on a pay-as-you-search basis with a relatively straightforward graphical user interface. Most recently, STN has developed for professional searchers STN on the Web. The U.S. National Library of Medicine's PubMed gives free and easy access to a version of the NLM's main database, Medline. NLM's PubChem is a free database that provides some of the same coverage and searching capabilities as the CAS Registry File.
Another CAS product is SciFinder, which make the searching of some of the CAS databases (CAplus, Registry, CHEMLIST, CHEMCATS, and CASREACT) relatively effortless.
The Explore References option is used to conduct a subject search. Note that the search is entered in a natural language statement instead of using Boolean logic. The searcher also has the option to limit the search by the type of document you want to see in the answer set (see below for the document types in the CA database). You can also search the database by author using the Explore References screen. The result will be a bibliography of articles and other documents, complete with abstracts.
SciFinder's Explore Substances option lets you draw a 2D representation of the compound as a search key. Other options are to enter a Molecular Formula or an identifier, such as the CAS Registry Number for a compound. The searches can be limited to particular types of compounds (e.g., polymers), and further defined by whether the substance must be a single component (not part of a mixture), commercially available, and have at least one reference in the bibliographic database. (Note that there are some substances in the CA files that have no documents linked to them.) The result will be a file of chemical substances, complete with much information about them (variant names, molecular formula, some property data, etc.)
Explore Reactions lets you search the CASREACT database to find articles in which the chemical substance of interest was a participant in the reaction (starting material, product, etc.). A number of options are available to further refine the search, and those will be discussed in a later chapter.
SciFinder removes the need to know the STN search commands. It even makes it unnecessary to know the proper use of Boolean operators in a subject (Research Topic) search or to know how to use truncation symbols. It employs sophisticated built-in intelligence to deduce the relationships the searcher desires among the various words and phrases. Nevertheless, many online search systems, including Internet search engines, require at least a passing knowledge of these techniques in order to use them effectively.
Formats: Document Types
In the printed "Chemical Abstracts," a B or P immediately before an abstract number designates a book or a patent respectively. In the online CA file, these and other documents are found in the Document Type (DT) field of the CA File:
Thus, combining an answer set number with one or more codes or words can either limit the answer set to a particular document type (or perhaps eliminate an unwanted type), e.g.,
=> S L4 NOT P/DT
=> S L4 AND J/DT
Eight new document types (biography, book review, editorial, errata, letter, miscellaneous, news announcement, and product review) were introduced to the CAPlus file in 1994.
SciFinder allows you to refine the answer set by many parameters, among them the document types seen when you click the link to the image below.
SciFinder searches can be refined by many other options besides the document type, including research topic, author, company name, publication year, language, and database.
On the Web of Science (Science Citation Index), General Searches can be limited to:
- Abstract of published item
- Biographical item
- Correction, Addition
- Meeting Abstract
- News item
and several other choices.
Further Reading and Resources
The commercial databases offer many advantages over free Web searches, including much more in-depth indexing of the material and more sophisticated search techniques. These include the use of Boolean operators and truncation to hone in on a search topic, as well as the inclusion of the document type and other search options to limit the search results.
CIIM Link for further study (Major Tools or Databases)
Chemical Abstracts Databases vs. the Printed Chemical Abstracts
David Flaxbart, chemistry librarian at the University of Texas, pointed out some of the reasons to retain the printed Chemical Abstracts volumes in library collections (CHMINF-L, 8 June 2010). He notes: SciFinder is not identical to Chemical Abstracts. All (or nearly all) the content of the latter is included in the CAPLUS file and robustly substance-indexed via the Registry file. But it is an oversimplification to say that you can do everything in SciFinder that you could do in the print.
- The Collective subject/substance/formula indexes allow browsing of chemical names, formulas, and subject headings in a way that isn't possible in SciFinder. SciFinder is great for snapshots, but it doesn't provide any view of the hierarchical structure of the CA database, or its indexing and nomenclature practices; nor does it allow browsing for derivatives, salts, and other variants of a parent structure. In other words, you can't browse online for nearby entries like you can in the print, which removes a serendipity factor. For some purposes, this is an important distinction. (Browsing index entries is possible in STN.)
- When you can't figure out how CAS has defined the structure or formula of certain types of compounds, especially inorganic (salts, hydrates, ions, decimals, etc.), coordination compounds, and multicomponent substances, SciFinder can be frustrating. Using the Index Guide and Chemical Substance Index can actually save some time, and when you find the Registry number then you can go back to SciFinder, locate the substance record and complete the literature search. (Of course, this method only works for compounds registered before your last Collective Index.)
- Pre-1967 CA abstract numbers are not searchable or displayed in SciFinder, and can only be looked up or verified in the print or on STN. These numbers are occasionally cited in the older literature, especially as stand-ins for obscure and foreign documents.
- Some printed abstracts may contain structure graphics that aren't duplicated online.
- Some older CA records were not properly converted and are missing from SciFinder or merged with adjacent records. CAS will fix these when notified, and it seems to be a rare occurrence.
- SciFinder is not available to unaffiliated users, per license restrictions. CA print is a potential fallback. (Unless it's in storage. Indexes stored remotely will almost certainly never be used again, and can't be used for their intended purpose, so this is essentially no different than discarding them.) Naturally, CA in print is only for historical searching. Even if you were to lose access to SciFinder, print CA could not fill the gap or be an acceptable substitute for modern users.
- Even if you decide to discard the bulk of CA, consider retaining the most valuable parts, such as the Index Guides (very useful for finding index terms, synonyms, controlled vocabulary, Registry Numbers, etc.); patent indexes; formula and name indexes; and the Ring Systems Handbook. Also, general wisdom suggests that the older (and smaller) pre-1967 portion of CA is more valuable archivally than the post-1967 volumes, which are somewhat more expendable.
See also from Chemical Abstracts Service: Transitioning from CA Print to CAS' Electronic Products for more information.