SMILES™, SMARTS™
Codenames: smiles, smarts=smiles:s
Contents:
Marvin imports and exports SMILES strings with the following specification
rules:
- Atoms:
- Atoms are represented by their atomic symbols.
- Isotopic specifications are indicated by preceeding the atomic symbol.
- Any atom but not hydrogen is represented with '*'.
- Bonds:
- Single, double, triple, and aromatic bonds are represented
by the symbols -, =, #, and :, respectively.
- Single and aromatic bonds may be omitted.
- Branches are specified by enclosing them in parentheses.
The implicit connection to a parenthesized expression (a branch)
is to the left.
- Cyclic structures are represented by breaking one single
(or aromatic) bond in each ring and the missing bond is denoted
by connection placeholder numbers
- Disconnected structures:
- Disconnected compounds are written as individual structures separated
by a period.
- Isomeric specification
- Configuration around double bonds is specified by "directional bonds":
/ and \.
- Configuration around tetrahedral centers may be indicated by a
simplified chiral specification (parity) @ or @@.
- Unique SMILES.
The "unique" name can be sometimes misleading when dealing with
compounds with stereo centres.
MDL's SMILES specification (3.1.
SMILES Specification Rules) defines generic, unique, isomeric and absolute SMILES as:
- generic SMILES: representing a molecule (there can be many different
representations)
- unique SMILES: generated from generic SMILES by a certain algorithm [1]
- isomeric SMILES: string with information about isotopism, configuration
around double bonds and chirality
- absolute SMILES: unique SMILES with isomeric information - in Marvin during graph canonicalization
the isomeric information is also considered as an atom invariant
The name canonical SMILES is used for absolute or unique SMILES
depending whether the string contains isomeric information or not (both
strings are "canonicalized" where the atom/bond order is unambigous).
Marvin generates always canonical SMILES with isomerism info if it is
possible to find out from the input file. The molecule graph is always
canonicalized using the algorithm in article [1]
but it is not guaranteed to give absolute SMILES for all isomeric
structures. With option u currently we are using an
approximation to make the SMILES string as absolute (unique for isomeric
structures) as possible. For correct exact (perfect) structure searching
MolSearch
and JChemSearch
classes of JChem
Base or the jc_equals
SQL operator of the JChem
Cartridge are suggested.
The initial ranks of atoms for the canonicalization are calculated
using the following atom invariants:
- number of connections
- sum of non-H bond orders
(single=1, double=2, triple=3, aromatic=1.5, any=0)
- atomic number (list=110, any atom=112)
- sign of charge:
0 for nonnegative, 1 for negative charge
- formal charge
- number of attached hydrogens
- isotope mass number
See ref. [1] for details.
With option u it is possible to
include chirality into graph invariants. This option must be
used with care since for molecules with numerous chirality centres
the canonicalization can be very CPU demanding [2].
SMILES canonicalization algorithm is not generic,
it depends on the software package,
so it is most useful to compare SMILES strings within a software package.
- Stereochemistry
-
Parity is a general type of chirality specification
based on the local chirality.
The most common tetrahedral class is implemented.
An atom can have parity (odd, even)
if the following conditions are met:
- number of ligands + implicit H > 3
- implicit + explicit H < 2
- number of ligands is < 5
- if the atom is not in ring then the graph
invariants of the ligands must differ
Parity value 0 is used for atoms which cannot have parity.
- Cis-trans isomerism
The default stereoisomers in small rings (size < 8) are cis,
which are not written explicitly.
See import option c
to override this feature.
- Reactions
- syntax:
reactant(s)>agent(s)>product(s), where
reactants = reactant1 . reactant2.....
agents = agent1.agent2 . ....
products = product1.product2 . ...
Agents are molecular structures that do not take part in the chemical reaction,
but are added to the reaction equation for informative purpose only.
All of the above sections are optional. For example:
- a reaction with no agents: reactant(s)>>product(s)
- a reaction with no agents and no products (mainly used in reaction search):
reactant(s)>>
- a reaction with no agents and no reactants (mainly used in reaction search):
>>product(s)
- atom maps
- Not supported SMILES features:
- Branch specified if there is no atom to the left.
- General chiral specification: Allene like, Square-planar,
Trigonal-bipyramidal, Octahedral.
Marvin imports and exports SMARTS strings with the following features:
- SMARTS features interpreted during import/export as full-functional
(editable) query features:
- atom lists like [C,N,P] and 'NOT' lists like [!#6!#7!#15]
- any bond: ~
- ring bond: C@C
- hydrogen count: H0, H1, H2, H3, H4
- valence: v0, v1, ..., v8
- connectivity: X0, X1, X2, X3, X4
- in ring: R
ring count: R0, R1, ..., R6
- size of smallest ring: r3, r4, r5, r..
- number of ring bonds: x2, x3, x4
at least one ring bond: x
- aromatic and aliphatic atoms: a, A
- aliphatic, aromatic, aliphatic_or_aromatic atom query properties
- single_or_double, single_or_aromatic, double_or_aromatic bonds
(used in Marvin)
- directional or unspecified bonds: C\C=C/?C
- chiral or unspecified atoms: C[C@?H](Cl)Br
- component level grouping: (C).(O) (C.O)
- A subset of SMARTS features are imported as SMARTS atoms/bonds.
These atoms/bonds have limited editing support in the Marvin GUI,
but can be exported and evaluated
(e.g. JChem structure searching handles them correctly):
- implicit hydrogen count: h2, h3, h..
- degree: D2, D3, D..
- more difficult logical expressions in atom or bond expressions: &,;!
(Simpler cases, like atom lists, not lists, "and"-expressions are handled by the above features.)
- recursive SMARTS: [$(CCC)]
- A subset of features are exported as SMARTS atoms/bonds.
- MDL Substitution Count query atom property
s<n> is converted to degree Dn.
In case of s* the non-H neighbours are counted and exported as
degree D<number>.
- MDL Unsaturated Atom query atom property u is converted to
recursive SMARTS: $([*,#1]=,#,:[*,#1]) is appended after
the SMARTS atom.
In case of SMARTS:
- Impicit H atoms are not written inside brackets. Eg: [C:1]
- Query H atoms are written inside brackets without using the low precedence "and" operator ';'. Eg: [CH3]
Implicit bond types:
The default bond types for import and export strongly depend on the atoms connected by the bond.
- Aromatic bonds are not written explicitly if neither atoms are
aliphatic and they are in a ring.
Eg: c1ccccc1 But: c:c, c:[c;a], [#6]:c
- Single bonds are not written explicitly if at least one atom
is not aromatic.
Eg: CC, C[c;a], Cc, C[C;A], [#6]C But: [#6]-[c;a], c1ccc(cc1)-c2ccccc2
- Single_or_aromatic bonds are not written explicitly if both atoms of
the bond are aromatic and any of them is not in the same ring.
Eg: [#6]cc, [#6][c;a], [#6][#6]
f
{fFIELD1,fFIELD2,...}
|
Import data fields from a multi-column file.
The fields should be separated by tab character.
The first column contains the SMILES/SMARTS strings,
the second contains the
data field called FIELD1, the third contains
FIELD2, etc.
Example:
molconvert sdf "foo.smi{fname,fID}"
reads the smiles string, the name and the ID from the foo.smi
file and converts it to sdf format.
|
d
|
Import with Daylight compatiblity for query H.
In daylight smarts, H is only considered as H atom when
the atom expression has the syntax
[<mass>H<charge><map>]
(mass, charge and map are optional).
Otherwise it is considered as query H count.
Examples: [!H!#6] without d option is imported as
an atom which is not H and not C.
However with d option it is imported as an atom which
has not one H attached, and which is not C.
Use "H1" or "#1" or "#1A" instead of "H" to avoid
ambiguous meaning of H. "H1" always means query H count.
"#1" always means H atom, "#1A" means aliphatic H atom.
|
c
|
Ignore fixing of double bond stereo information in small rings.
Double bonds in small rings (ring size < 8) is imported
automatically with CIS stereo information. If c options is set,
the double bond stereo information is not changed to CIS
during the import.
|
Z
|
Import compressed smiles. The compressed format must be specified
expicitly, as it is not recognized by the importer automatically.
|
Export options can be specified in the format string. The format descriptor
and the options are separated by a colon.
... |
Basic options for aromatization and
H atom adding/removal. |
0 |
Do not include parity and double bond
stereo (cis/trans) information.
Examples: "smiles:0" (not stereo),
"smiles:a0" (aromatic, not stereo) |
q |
Check atom equivalences using graph invariants at double bonds.
The graph invariant is used for symmetry description
at each atom in the molecule. If one end of the double bond
is symmetric (the ligands' graph invariants equal),
then the / and \ signs are not used in the
description of the double bond, if this option is set.
Example: molconvert smiles:q -s "C/C=C(/C)C" results CC=C(C)C |
s |
Write query smarts. (See query Smarts for details.) |
u |
Write unique smiles (considering chirality info also [2]).
Note: Use this option if you want unique smiles export.
|
h |
Convert explicit H atoms to query hydrogen count. |
Tf1:f2:... |
Export f1, f2 ... SDF fields.
The fields are separated by tab character.
If '-' is given before the T option like '-Tf1:f2:...' then no header
line is written.
|
n |
Export molecule name (the first line of an MDL molfile). |
Z |
Use compressed format, and compress the SMILES string.
Note that the compressed format is not recognized by the import,
so it should be specified explicitly. |
See also
™: SMILES, SMARTS, and SMIRKS are trademarks of Daylight Chemical Information Systems.