I got to spend last week in sunny California. I forgot how wonderful it is to sit and eat lunch outside! I was participating in a workshop held at the Department of Energy's Joint Genome Institute (JGI). The workshop was entitled Microbial Genomics and Metagenomics. Basically I spent the week learning about different tools that are available to help biologists deal with the data flood that has come out (and continues to flow faster and faster) of sequencing technologies that continue to get faster and cheaper. Since microbes are not exactly easy to observe with ones eyes, microbiologists rely heavily on genetic data to tell us about our organisms of interest. For environmental microbiologists, whose organisms can not currently be grown in the lab, knowing what genes our "bugs" contain tells us what processes they might be capable of and can also provide information into their evolutionary history. We can extract the total DNA out of a soil sample and begin to get a picture of what the community living within that soil is capable. However, before we do that we need to take the massive files of As, Cs, Ts, and Gs and figure out how to interpret that.
Imagine a text document of one of Shakespeare's plays (or even a page of said play) with all of the spaces removed. Imagine each like cut out and shredded into a few random pieces. Imagine that you had 20 copies of that document and each was shredded differently. These multiple copies (or coverage in the bioinformatics world) allow you to attempt to piece together the play by finding pieces that overlap. This can be tricky if there are certain words or phrases that repeat frequently, but given the right computer program you can start to put some of the strips into larger phrases. This is referred to as aligning your sequences.
Once you have aligned sequences there are some tools available to search the alignments for segments that could represent genes. A genome that has been searched for known and recognizable genes is said to be annotated.
There is an interesting paradox here which is that technology keeps improving which means the volume of genetic sequence data we have to analyze is growing faster than the tools and programs we have to do said analyses, however the more bioinformatic data that exists the better our analyses will be because there will be fewer unrecognized genes and more organisms will be discovered. I can't wait to see how archaic the program that I spent 5 hours struggling with today (ARB for anyone who is familiar... uggh!) will seem in 10 or even 2 years!