A major problem with many of the current naming conventions is that they often contain both locus and allele information in the names, are based on a particular technology and/or contain bold, italic, super/subscripts or some other characters or formatting that are incompatible with electronic databases.
For example:
* E(CCA)M(AGC)114 or ACAAAC110 : AFLP locus, name gives discriminatory bases and band size
* OH03_700: RAPD locus with band size given
* U379_500T: RAPD locus with band size, reason for T not known This is obviously not ideal as the same locus in a different variety might well have a different band size.
There are at least two kinds of names needed:
• physical clones
• genetic loci and their alleles
There are of course subclasses of these (i.e. sequences of clones), but these cover the major groups.
The Soybean Genetics Committee recommends the following guidelines which will result in names that are compatible with electronic databases. The ones for physical clones are already in use in several labs and seem to be acceptable.
An EST example:
Gm-c1049-2803
• Gm -- Glycine max - use the appropriate genus & species initials; 2 chars only
• c -- cDNA; r = rerack of a previous library;
• 1049 -- library number; start at 1001; numbers coordinated by SoyBase - thus c1049 is the library name
• 2083 -- clone number
A BAC example:
Gm_UMb001_090_G08
• Gm -- Glycine max - use the appropriate genus & species initials; 2 chars only
• UM -- University of Minnesota; IS = Iowa State Univ; other 2 char abbreviations to be assigned as needed
• b -- BAC
• 001 -- library number; use leading zeroes to pad to 3 chars - thus UMb001 is the library name
• 090 -- plate number
• G08 -- row and column of clone
Subclones of BACs are named by adding appropriate characters to identify subclone to the BAC name:
• Gm_UMb001_090_G08_s_row_column where s indicates a subclone
• F or R appended as needed
Other single character abbreviations might be a = AFLP, f = RFLP, etc.
• Append a U or R followed by an integer for the walking step to the BAC or subclone name.
• For example Gm_ISb002_091_F11U3 for the 3rd step from the U sequencing primer.
• For consistency T3/T7, etc. should be referred to as U/R with the actual sequencing primer given in the database record.
• A clone's completed sequence would be Gm_ISb002_091_F11C where the C indicates that this is a completed sequence.
For clones that are already in use and which can't be easily fitted into this scheme a simplified version of the above may be used.
• cDNA Gm-cn-arbitrary text: subclones and sequences would be named as above.
• RFLP Gm-pn-current rflp name: all current G. max probes would be considered 'library' 1. An example would be Gm-p1-A632.
• AFLP Gm-an-arbitrary text: subclones and sequences would be named as above.
• RAPD Gm-r-primer name from supplier: subclones and sequences would be named as above.
This is obviously an area badly in need of some consistency. In general it is inappropriate to include such details as band sizes, coordinates, restriction sites, cloning vehicle history, primer IDs, dates, etc. into locus names as these quickly become irrelevant. Additionally, reconciling new and old names as the technique, laboratory, and species change, or when one locus overlaps, includes, or is identical to another is extremely difficult. We recommend that the general rule be that the locus name should not contain an overabundance of information, but should rather be just enough to point to a database entry where all of the rest of the information can be found. This is the model used by the public sequence databases. A similar approach has worked well for soybean genes and I think it can work for other kinds of genetic loci also. Many of the loci we are using will eventually be used in other species and it would be inconvenient to have to either rename loci for each species or carry along a lot of only-historically-interesting baggage.
The Committee recommends that: