Once we start editing DNA on a large scale, we will need to keep track of what we do, revision histories, comment the new genes and add copyright notices. This is a suggested standard of entering ASCII information into the genome:
We will use 4-base codons to encode 7-bit ASCII. I know it is a bit primitive, but I think it does well enough and we might want to use the extra bit (see below). Each base codes two bits, and the complementary base codes the inverse:
A: 00 G: 01 C: 10 T: 11
Thus each character will be coded as four bases, read in the canonical 5'->3' direction.
The letters 'DNA' will thus become
01000100 01001110 01000001
G A G A G A T C G A A G
or GAGAGATCGAAG.
The problem when reading a DNA string is: which strand should we read? If we read the complementary strand, we will get an inverted string backwards. But since we use 7-bit ascii, we can test to see if every 8th bit is a one or zero, and deduce which side we are on. The reading process thus tries out the eight starting frames, and chooses the one which gives an unbroken stretch of ones or zeros. If the stretch are zeros, the bases are read and converted, if they are ones they are read to the end of the message, inverted and reversed. Note that some errors can become detectable this way, as interruptions of the stretches of similar bits.
To delineate the comments, we need markers. A standard could be the sequence corresponding to "COMMENT COMMENT COMMENT..." repeated a number of times (we don't want to use a long stretch of similar bases, since it would influence the bending of DNA, which might lead to unwanted effects).
A problem is that we might accidentally create active regions in the DNA with these comments; ideally we should choose a coding that minimizes the biological effects of the comment. Methylating the cytosine bases will also inactivate the comment. If it can be marked as an intron it could also be placed inside exons, making sure the comment will follow the gene it belongs to.
Thanks to John D. Gleason for the methylating and intron ideas.
The Personal Genome Project, initiated in 2005, is a vision and coalition of projects across the world dedicated to creating public genome, health, and trait data. Sharing data is critical to scientific progress, but has been hampered by traditional research practices. The PGP approach is to invite willing participants to publicly share their personal data for the greater good.
A nonprofit DNA and genealogy research website. You have to upload your DNA sequencing report to be allowed to search the database. Requires registration.
OpenPCR is a fully functional yet affordable ($599us) PCR (polymerase chain reaction) device, used for replicating DNA for the purposes of sequencing or barcoding (species determination). The whole kit - from the software to the hardware itself - is open source, so you can download the code, CAD, and Eagle files and build your own if you don't want to buy a kit.
An open source software project for converting sequenced DNA into music. Well beyond simply assigning notes to nucleotides, whose pieces can be played simply by interpreting the electrochemical properties of a DNA sequence in different ways. The site allows you to upload your own pieces as well as listen to those of others (oh, and download the software).
Online store for genetic research equipment, DIY kits.