Sunday, September 28, 2014

I Think I Understand the AncestryDNA Methodology Now?? i4GG

Some of my own SNPs
One letter comes from Mom and one from Dad in a random order

I watched the i4GG video "AncestryDNA matching: large-scale findings and technology breakthroughs". I've been curious and confused about the methodology used by AncestryDNA. From the start their autosomal testing process has been a mysterious and secretive process, which has given rise to suspicions. They wouldn't release raw data to customers in the beginning. Many people felt they were hiding the fact that most matches were spurious. The fact they still don't have anything like a chromosome browser still leaves us wondering about the validity of the results? On the other hand the fact they phase their results should lead to better, more confident matches than the other companies. The phasing process hadn't been completely clear to me until I listened to Dr. Julie Granka's presentation. She explained the process in greater detail. I believe understand it now?

This is my understanding of the phasing process (I never excelled in science or math in school). If anyone has a better understanding please let me know:

Dr. Julie Granka emphasized the large size of the AncestryDNA data collection, generated from over 500,000 customers, which is leading to more accurate results. The phasing process attempts to separate your results into groups representing your parents. On a position of an SNP you'll get one marker (ACGT) from your mother and one from your father. If for instance you are an AG on a position and your mother AG at the same position of an SNP, but your father was AT at that same position we can infer the G is from your father and the A from your mother.  So your genotype, the marker combinations, come from both parents. The phasing process is designed to separate your single genotype into haplotypes  you got from your mother and father. The phasing process relies on the comparison of your genotype with those of people with known haplotypes (haplotypes are just strings of markers (SNPs) shared by groups of people, ACGT's, the building blocks of DNA).  Your haplotypes are then inferred from the results of these comparisons. This process is complicated by the fact positions contain markers for which they don't know which of the two markers we got from which parent, so they cannot be read in a continuous line. There is some sort of formula for reading these scrambled marker pairs, and separating them into haplotypes for Mom and Dad.  The process can misinterpret a block of DNA as a haplotype when actually it's a mix of different markers inherited from both parents, ACGTs, that happen to look like a known haplotype. It's also possible that one of your haplotypes has not been seen before. When a mismatch occurs it throws the rest of the phasing off. So it's important to limit mismatching. Their old phasing process took 7 to 10 hours for 1000 tests, and resulted in 3 errors per 100 heterozygous sites, the new process takes 5 minutes and results in only 1 error. So the process continues to be refined. Still around half of our thousands of matches are IBS, so it's not perfect.

The haplotypes are very important in the AncestryDNA matching process. In order to be a high confidence match your match has to share a certain amount of DNA plus belong to the same haplotype on that particular segment.

Sometimes these haplotypes proliferated because they were advantageous. Dr. Granka used the example of lactose intolerance. Ancient populations were all lactose intolerant. When animals were domesticated and their milk began being used the genetic mutation which allowed milk to be drunk was an advantage. This gave that person and their descendants an advantage which allowed them to get more nourishment and reproduce at a higher rate. So we all share some of these blocks because they provided a genetic advantage.

The fact that many people share the same DNA blocks presented AncestryDNA with a problem. Do all of these people share a common ancestor in the genealogical time frame? They determined blocks shared by huge numbers of people were IBS and should not be used for matching. This led to a smaller number of matches? I still have 11,000.

Some other very interesting points:
  1. In a group of 200 people there is a 97% chance of finding a pair of 4th cousins
  2. If you can't find evidence of an ancestor in your DNA (and they are several generations removed from you) it could be you just didn't inherit any perceptible DNA from them.
  3. "Absence of evidence isn't evidence of absence."
  4. We have 120,000 7th cousins, which increase your odds of finding a match at that distance
  5. There are 30 million 4th cousin matches at AncestryDNA out of around 500,000 in the database
  6. The average person has 5  3rd cousin matches at AncestryDNA ( I don't have any. My Mom has 7)
  7. The average person has 147 4th cousin matches at AncestryDNA
  8. At 20 generations we share DNA with around 1200 of our 1 million ancestors

No comments: