Chemical Information Sources/Structure Searches

Introduction

STRUCTURE SEARCHING utilizes a graphic depiction of the chemical structure as input for a search. Such searches are generally run against structural data in online chemical substance files, such as STN's Registry File, Reaxys, or the freely available ChemSpider. Depending on the type of structure search allowed by the system, the complete molecule or any compound containing the embedded structure of the molecule will be retrieved as an answer set. Unlimited substitution of the input molecule may be allowed at free sites on the molecule (a FULL SUBSTRUCTURE SEARCH) or substitution may be limited to certain sites (a CLOSED SUBSTRUCTURE SEARCH). On the STN system, once an answer set is formed in the Registry File, it can be crossed over to the CAPlus or other literature database files to conduct further subject searches of the compounds thus isolated in a structure search. In these cases, it is actually the CAS Registry Numbers (a unique identification number for each chemical) for the compounds that is being searched in the crossover files. The ability to access additional information about compounds such as toxicity, spectra, and literature references is a common feature of nearly all databases providing structure searching.

Note that it is now possible to conduct a search that takes into account the stereochemistry of the chiral centers and double bonds. Stereo searching can also be performed in the Registry File and the REAXYS File on STN or on the Reaxys system (that includes both the older Beilstein and Gmelin content as well as significant newer material). A SIMILARITY SEARCH finds target molecules that are like the query structure in some respects. That might be some biological property such as drug absorption or toxicology, with respect to metabolism. Often, it is the similarity in functional groups that is measured. Finally, MARKUSH STRUCTURE SEARCHING, an important technique in patent searches that allows for considerable variability in the structures retrieved, is another option in some files.

Why Use Structure Searching?

There are many reasons to do a substructure search, among them:

Particular structural features can be focused on.
Unwanted features can be excluded.
The complexities of nomenclature can be avoided.
The novelty of a compound can be assessed.
The structure(s) can be correlated with chemical or physical properties or biological activities.
The structure(s) can be linked to chemical reaction databases to see model compounds or to look for specific reaction conditions.
Competitive products or market leads can be found.

In combination with other types of searches, structure searching is a very powerful supplement.

Structure Searches in the STN Registry and Other Files

As of December 31, 2013, over 78 million registered chemical substances and over 65 million biosequences appear in the Chemical Abstracts Service Registry File. Most of those have been registered since 1965, but, of course, not all of the compounds in the Registry File were discovered since that date. In 2002, Chemical Abstracts Service embarked on a project to retrospectively index all documents in the CA database. Thus, many compounds that have had no new information published about them since the establishment of the CA or CAplus Files (i.e., since 1967) have now been added to the Registry File.

Most of the millions of compounds in the Registry File have their Registry Numbers linked to the databases on the STN system. The LC (File Locater) field of a Registry File record tells in which STN databases the Registry Number is found. In addition to the Registry File, structure searches can be conducted in such databases on STN as REAXYSFILE, CASREACT, and others. A similar file locater function is included in other chemical dictionary files, such as NLM's ChemIDplus.

There are several types of structure searches possible in the Registry File, as well as different options for views of the molecules and different methods of inputting the structure. SciFinder masks to a certain extent the relationship between the Registry File and the CAplus File, CASREACT, and other databases intertwined with its software.

Within the SciFinder search stage itself, considerable information can be gleaned about the answer set to be retrieved. In the Preview option, the sample answer set can be analyzed by atom attachments, or, if the drawn structure contains them, by system-defined or user-defined variable groups. Once the structure is built and the answer set is retrieved, such information can also be found for the full answer set. At this point, the search can proceed as it would if the compounds had been identified by name or molecular formula searches, allowing you to "Get References" from the CAPlus part of the SciFinder system or to link to any of the icons in the retrieved Registry File records.

The structure search can be further refined with additional structural features or by limiting it to commercially available substances. Once refined, the references can be retrieved that have the Registry Number of the compounds in their indexing.

The following types of structure searches are possible on STN:

EXACT SEARCH--retrieves the substance as drawn plus any stereoisomers, ionic substances, or homopolymers, as well as isotopically labeled compounds with that structure
FAMILY SEARCH--retrieves the same set of compounds as the EXACT search, but will also retrieve any multi-component compounds represented in the Registry File (salts, mixtures, or copolymers)
CLOSED SUBSTRUCTURE SEARCH--allows variable nodes at certain defined positions only
FULL SUBSTRUCTURE SEARCH--retrieves any record in the file that has the structure input as the search key.

For more details on structure searching of STN databases, see the STN training web site.

With SciFinder, EXACT, SUBSTRUCTURE, SIMILARITY, AND MARKUSH searches are possible. Again for more details, visit the SciFinder training web site.

There are actually several stages of a Registry File structure search. The first stage involves a screening of the huge file for compounds that have the requisite substitutents and other features, without regard to their position on the molecule. The much more computer-intensive iteration stage involves an atom-by-atom, bond-by-bond look at the candidate molecules isolated in the screen search. Since this stage requires so much of STN's computer resources, there are limits on the number of compounds that can be looked at during the iterative stage. A sample search must be run on approximately 5% of the file, after which a prediction as to whether the full file search will run to completion is given. Assuming the prediction is favorable, the candidate molecules found in the screening of the full file can be compared to the structure. Otherwise, the structure must be modified to be able to run to completion. With SciFinder, there is some built-in intelligence that offers to "autofix" a molecule that might give the system trouble.

Structure Searching on Reaxys

It is also possible to do very precise structure searching on Elsevier's Reaxys system, which contains the vast majority of the legacy information from the Beilstein Handbook of Organic Compounds and the Gmelin Handbook of Inorganic and Organometallic Compounds, a patent database segment, and on-going indexing of substances, reactions, and property data in the current chemical literature. Reaxys provides extensive coverage of chemical research from the 18th century to the present. As of November 2013, the Reaxys database contained more than 22 million compounds, 35 million reactions, and 45 million literature references.

Reaxys Structure Editing Screen with Isatin Molecule (above) and Two of the Search Results (below)

Reaxys has very similar structure drawing and search options to SciFinder. Exact and substructure searches can be executed. Variable groups and atoms can be included in the structure. Specific sites can be locked preventing any additional atoms being attached to that atom. As with most vendors, Elsevier provides a number of excellent training videos and guides including this Reaxys guide, Creating Structure Queries for Substances and Reactions.

As a general rule, vendor training material should be consulted for the most up-to-date information for all the resources described anywhere in this Chemical Information Sources wikibook. It simply is not feasible to go into great detail on the mechanics of searching each of the sources described and even if this was done within the wikibook, it would quickly go out of date due to the frequency of changes and new features added to these search systems and databases.

Beilstein and Gmelin

Beilstein and Gmelin are two classic print compendiums of chemical information. Most of the information in the print volumes was converted to electronic form and for a period of time existed as separate databases. All of the digital information from these two sources as well as other databases have now been merged into a unified database system created and maintained by Elsevier, Reaxys, which is also available on STN International as the REAXYSFILE. Large academic research libraries often have significant print holdings that still have value for the patient, diligent searcher. Chemistry librarians at such institutions maintain helpful guides to the print runs, for example, the University at Buffalo (Beilstein, Gmelin) and the University of Texas at Austin (Beilstein, Gmelin).

Beilstein is for organic compounds, whereas Gmelin is for inorganic and organometallic compounds. Beilstein covers compounds containing carbon along with the following elements:

          H
          Li, Be              B, C,  N,  O,  F
          Na, Mg                 Si, P,  S,  Cl
          K,  Ca                     As, Se, Br
          Rb, Sr                     Te, I
          Cs, Ba

Compounds can be single components or salts and mixtures (if they have at least one organic component). Peptides are covered if they contain twelve or fewer amino acids. Polymers or polycondensation products are not treated. The following are not typically treated as Beilstein compounds, but would be found in Gmelin:

CO, CS, CO2, CS2, COS, C3O2, C3S2
Carbonic acid and its thio analogs along with their salts with inorganic cations
HCN, HOCN, HSCN and the corresponding iso-acids and all the metal salts and complexes of these acids
Dicyanogene
Phosgene
Metal salts of formic acid, acetic acid, and oxalic acid

Gmelin covers compounds not covered in Beilstein, i.e., inorganic and organometallic chemistry as well as related fields such as mineralogy and metallurgy. Compounds are indexed with terms such as coordination compounds, alloys, ceramics, and inorganic polymers.

Beilstein Lawson Numbers

Compounds in the Beilstein database are also indexed by a number that indicates various structural features. That is the Lawson Number. It represents certain structural fragments and can be used for structural similarity searches. In general, the smaller the Lawson Number, the more common the fragment. Every substance in Beilstein has at least one Lawson number assigned to it. Dividing the Lawson Number by 8 puts you roughly in the Beilstein system number for the printed Beilstein volume that contains the compound. The compounds are divided into 3 major groups in the printed Beilstein Handbook:

1. Acyclic Compounds, Volumes 1-4; System Numbers 1-449
2. Isocyclic Compounds, Volumes 5-16; System Numbers 450-2358
3. Heterocyclic Compounds, Volumes 17-27; System Numbers 2359-4720.

Unfortunately, the Beilstein Institute never published the meanings of the 4,720 system numbers used to classify organic compounds. However, the Lawson Number Descriptions can now be found on the web. The Lawson Number is effective when used in combination with other search keys, such as molecular formula, element ranges, etc. It is also useful when combined with NOT in substructure searches.

Chemisches Zentralblatt

Chemisches Zentralblatt is the oldest abstracting journal in the field of chemistry. It covers the chemical literature from 1830 to 1969. In the course of those 140 years, Chemisches Zentralblatt published 900,000 pages, including 2 million abstracts. Chemisches Zentralblatt introduced formula indexes using the Richter system (different from the Hill system) in 1925. In 1956 it changed to the Hill system. The previous title Chemisches Central-Blatt (1856-1906) has only author, subject, and patent number indexes. InfoChem has performed automatic chemical named entity recognition of the entire text of this abstracting journal to produce the Chemisches Zentralblatt Structural Database which is structure-searchable. The database is offered either as a Web application or for in-house loading. It links to digitized versions of the original paper product that were produced by FIZ-Chemie.

Summary

Structure searching considerably expands the ability of a chemist to retrieve information from a database since the search key is the "native language" of the chemist, the chemical structure. Any chemist, regardles of his or her native tongue, understands a chemical structure. Thus, the structure searching systems speak the universal language of chemistry. The development of graphical user interfaces that allow easy drawing of the desired structure on a computer screen was a major advance in chemical searching. There are now several commercial databases such as Chemical Abstracts and Reaxys (Beilstein/Gmelin) that have this capability, as do public systems such as PubChem and ChemSpider. It may take some time to explore and learn all of the capabilities of the structure searching systems, but the reward in enhanced search retrieval is well worth it.

CIIM Link for further study

SIRCh Link for Structure Searches

Problem Set on this topic