Once we start editing DNA on a large scale, we will need to keep track of what we do, revision histories, comment the new genes and add copyright notices. This is a suggested standard of entering ASCII information into the genome:
We will use 4-base codons to encode 7-bit ASCII. I know it is a bit primitive, but I think it does well enough and we might want to use the extra bit (see below). Each base codes two bits, and the complementary base codes the inverse:
A: 00 G: 01 C: 10 T: 11
Thus each character will be coded as four bases, read in the canonical 5'->3' direction.
The letters 'DNA' will thus become
01000100 01001110 01000001
G A G A G A T C G A A G
The problem when reading a DNA string is: which strand should we read? If we read the complementary strand, we will get an inverted string backwards. But since we use 7-bit ascii, we can test to see if every 8th bit is a one or zero, and deduce which side we are on. The reading process thus tries out the eight starting frames, and chooses the one which gives an unbroken stretch of ones or zeros. If the stretch are zeros, the bases are read and converted, if they are ones they are read to the end of the message, inverted and reversed. Note that some errors can become detectable this way, as interruptions of the stretches of similar bits.
To delineate the comments, we need markers. A standard could be the sequence corresponding to "COMMENT COMMENT COMMENT..." repeated a number of times (we don't want to use a long stretch of similar bases, since it would influence the bending of DNA, which might lead to unwanted effects).
A problem is that we might accidentally create active regions in the DNA with these comments; ideally we should choose a coding that minimizes the biological effects of the comment. Methylating the cytosine bases will also inactivate the comment. If it can be marked as an intron it could also be placed inside exons, making sure the comment will follow the gene it belongs to.
Thanks to John D. Gleason for the methylating and intron ideas.