How are genome mapping and assembly performed?
A high school student from California asks:
"My teammates and I are doing an interdisciplinary project on human genome mapping and we can’t seem to understand the method of human genome mapping. How is a human genome mapping performed?"
Probably not quite like you might think. When scientists sequence a person’s genome, they don’t get to read off the whole three billion letter sequence at once. Instead, they get millions of short sequences of a few hundred letters or less.
It is as if you were reading a book and instead of just reading it the normal way, someone had cut up several copies of the book up into little pieces. Each piece had a few words on it, but you didn’t know what order the pieces were in. It would be pretty hard to figure out what was going on!
This is why the next step is to take these millions of bits of DNA and piece them all back together in the right order. As you’ll see below, this is a lot trickier if you are sequencing a brand new beast. Luckily this isn’t as big a problem for human sequences because we’ve already more or less figured out where the sequences should go.
Scientists have put together what they call a “reference genome”, which should be very close to most people’s genome sequence. This is true because even though every person’s DNA is unique, 99.9% of the sequence will be the same between people.
Scientists try to map each individual sequencing read to where in the genome it matches best. Many sequences will map perfectly to a region of the genome.
Other times, a sequence will have one or a few differences but will otherwise match fairly well. These sequences are the ones that are the most interesting as they tell the scientists how someone’s DNA is different from the reference genome.
For example, when scientists try to understand the causes of a patient’s cancer they might look for the mutations specific to the tumor of that patient. Or when trying to figure out the genes responsible for a disease like diabetes, scientists will try to figure out all the DNA differences that are shared by a large group of people with that disease.
Mapping millions of short sequences to a 3 billion letter sequence would be impossible for a person to do without the help of computers. Computer programs have been written to do this mapping quickly and effectively.
So there you have it. Scientists can easily read off a bit of someone’s DNA at a time and then a computer matches it to the right place in the genome. Do this a few million times and you have a new genome sequenced!
For the next part of this answer, I will go over one of the major challenges in genome mapping — repeated DNA. I will then go into how you’d figure out the genome for a plant or animal that has not been sequenced before.
There are parts of the genome that are really hard to read. And some of the most difficult parts are those where a DNA sequence is repeated over and over.
Think back to our example of the book cut up into little pieces. If the phrase “Genetics is awesome” is repeated many times in the book, we wouldn’t be able to figure out where any one copy of that phrase came from.
The human genome contains a lot of repeated sequences. Repeats can happen for several reasons. For example, over time, many parts of our genome have become copied several times due to mistakes in how the DNA gets passed on from one person to the next.
Our genome is also home to genetic parasites — sequences of DNA for which making more copies of themselves is their only goal. Sometimes called “jumping genes”, these parasites can be harmful if they jump into an important gene. A few have been co-opted to do some important work in our cells but mostly these transposons (as they are called) just take up space.
These repetitive sequences make mapping all of a person’s DNA very hard. Getting longer sequencing reads can help with this problem.
Fortunately, for most purposes mapping 100% of the genome perfectly is not all that important. Scientists generally think that many repetitive regions of the genome don’t do very much, although there is some debate about this question.
If you were putting together a book and there was a whole chapter repeating the sentence “Genetics is awesome”, you probably wouldn’t care that much if you couldn’t figure out exactly where each copy of the sentence belonged. You’d probably be more interested in the parts of the book with unique information. The same is usually true for genomes too.
So far we have focused on how to sequence something that already has a reference genome. It becomes much trickier to sequence a new plant or animal.
In that case, scientists would have to figure out the genome from scratch. Assembling a genome is much harder than just mapping reads to a genome that already exists. If you already have a complete copy of a book, it’s a lot easier to figure out where all those fragments came from. If you’re starting from scratch, it’s far more difficult!
For assembling a genome, computer programs rely on the overlaps between different sequencing reads. These overlaps can be used to string together short sequences into longer sequences.
Going back to our book example, we might get the following two sentence fragments:
- “Genetics is awesome and everyone”
- “and everyone should learn more about it”
We could use the overlapping words “and everyone” to guess that those fragments belong together to make the sentence “Genetics is awesome and everyone should learn more about it.” Of course, there are probably lots of “and everyone” phrases in a book with three billion letters, which makes things harder.
This is why having longer reads is really helpful for assembly because that will mean more and longer overlaps. Even with long reads, scientists usually need additional information to complete a new genome.
Scientists may first create an initial mapping of particular bits of DNA sequence, called a linkage map. You can think of linkage maps like an outline for our book. They begin by determining roughly how far away bits of sequence are from each other by figuring out how often they are passed down together from a parent to its offspring.
These initial linkage maps can be useful because they can let scientists break the genome into smaller pieces that are each sequenced and assembled on their own. The map then helps with putting those pieces together.
- Scitable: The chemistry of DNA sequencing
- PLoS Genetics: An in depth article about where repetitive DNA comes from and what it could be doing
- Scitable: More on the Human Genome Project
Author: Alicia Schep
When this answer was published in 2014, Alicia was a Ph.D. candidate in the Department of Genetics, studying the role of nucleosome positioning in gene regulation in Will Greenleaf’s laboratory. She wrote this answer while participating in the Stanford at The Tech program.