Pike Surname DNA Project

Technical Questions

This list of questions and answers is probably far from exhaustive, so if you have a question that isn't answered here, email David and/or Stuart at dapike@mun.ca or cocostu@spamex.com.

What part of participants' DNA is tested?

The DNA segments that are tested belong to the Y-chromosome, which is just one of the 24 chromosomes in human males. The Y-chromosome is inherited by male children from their fathers. Since the Y-chromosome of a male child is not genetically influenced by his mother's DNA, its genetic variation is solely the result of occasional mutations that occur naturally. What this means is that each man's Y-chromosome is extremely similar to those of other men in his paternal family, which in turn makes Y-chromosome analysis an ideal tool for genealogists.

In contrast to the Y-chromosome (and also the X-chromosome), the 22 non-sex human chromosomes obtain half of their genetic matter from a father and the other half from a mother. It is these chromosomes (which are not being tested) that constitute our individual genetic uniqueness.

Each of us also has mitochondrial DNA (mtDNA), which we inherit solely from our mothers. FamilyTreeDNA does offer tests that perform mtDNA analysis, but since mtDNA comes from our maternal side, we generally cannot associate it with a single surname. Surname inheritance usually comes from one's father, and so it's paternal DNA (i.e. the Y-chromosome) that is being tested for the Pike DNA Project. That said, some participants in the Pike Surname DNA Project have had their mtDNA tested; these test results are being presented on a separate "mtDNA Results" page.

What does "DYS" mean?

DYS stands for DNA Y-chromosome Segment. These segments are the markers that are tested.

What is Genetic Distance?

The Genetic Distance between two individuals is calculated by summing the differences in their DYS values (well, except for a few particular DYS's that are described in the following questions). Genetic Distance serves as a simple measure of how closely or distantly two people are related.

A Genetic Distance of zero constitutes a perfect match between the two people, which generally indicates that they share a common forefather (especially if they are comparing 37 or more markers). The greater the Genetic Distance between two people, the less likely that they share a common forefather in recent generations. FamilyTreeDNA has prepared a few charts that correlate Genetic Distance with the likelihood of two people being related. These charts are available for 12, 25, 37, 67 and 111.

In the analysis descriptions for several of the Pike test results presented earlier in this webpage, percentages have been stated, estimating the likelihood of a common forefather within particular time frames. These estimates are obtained from FamilyTreeDNA's Time Predictor tool, which is available to project participants who closely match other participants' results. As noted in the FTDNATiP FAQ, the time predictor takes into account the particular markers on which two people are mismatched, as well as the individual mutation rates for these markers.

Why are some DYS's described as (a), (b), etc?

DYS's 385 (a and b), 459 (a and b), 464 (a-d), YCA II (a and b), and CDY (a and b) are multi-copy markers. Multi-copy markers have been found to occur at 2 or more specific sites on the Y-chromosome (and hence there are multiple copies of the marker). The values of the copies found at the locations tested are reported in increasing numerical order as copy (a), (b), etc. So when comparing test results for two people, you should treat multi-copy markers as sets, rather than as individual values. For example, if Allan's DYS-385 values are 10 and 12, while Zachary's are 12 and 15, then they actually match on 1 of the 2 copies. The match would be on Allan's DYS-385(b) and Zachary's DYS385-(a) values.

Multi-copy markers are further discussed in the March 2004 issue of the FamilyTreeDNA newsletter.

How are DYS-389(i) and DYS-389(ii) connected to one another?

When two people are determining their Genetic Distance, they generally tally up the number (and magnitude) of the differences for each DYS. They have to be a bit careful for multi-copy markers (see the previous question), but they also have to be careful when comparing DYS-389(i) and DYS-389(ii).

Comparing values for DYS-389(i) is done in the normal way, by simply determining the difference between individual values. For DYS-389(ii), the comparison that needs to be performed is between each person's difference of (DYS-389(ii) minus DYS-389(i)). As an example, if Allan's DYS-389(i) and (ii) values are 12 and 30, while Zachary's are 13 and 30, then they would tally 2... 1 for DYS-389(i) alone, and then another 1 for the difference between Allan's (30-12)=18 and Zachary's (30-13)=17.

What do the numerical values for each DYS actually represent?

Each DYS is a DNA segment that consists of a sequence of nucleotides (there are 4 types of nucleotides: Adenine, Thymine, Cytosine, and Guanine, which are usually denoted by A, T, C, and G, respectively). When analysing each DNA segment, the goal is to look for (and count) short subsequences that are repeated (these are more technically known as short tandem repeats or STR's). It is this count that is reported as the DYS value. For instance, DYS-393's value reflects the number of times that the subsequence "AGAT" occurs in the tested DNA segment.

The Sorenson Molecular Genealogy Foundation has a webpage that gives several examples, including an example that shows how a value of 25 could occur for DYS-390. As another reference, Table 1 of this scientific publication lists the specific substrings (called motifs) that are counted for several DYS's.

Exactly what are mutations, and when do they occur?

A mutation is considered to have occurred when somebody has a DYS value that differs from that of his father.

Exactly how mutations occur is a matter of some debate, but the prevailing opinion seems to be that they occur as a result of occasional errors made while replicating DNA during the production of sperm cells and hence mutations are something that people are born with.

Does each DYS mutate at the same rate?

No. Although a mutation can occur in any generation, a general estimate is that any particular DYS will mutate about once in 500 generations (that is, there's about a 0.2% probability that a given DYS will mutate in any given generation). However some DYS's have been observed to mutate at much faster rates. These "fast-moving markers" are 385(a and b), 439, 458, 449, 464(a-d), 456, 576, 570, and CDY(a and b). For ease of reference, each of these fast-moving markers has its DYS name displayed in red in the header of each table of result values on the "Results" page.

With what frequency does a given value occur for a given DYS?

Click here to view a webpage that reports frequency rates within several individual haplogroups.

What is a "recLOH" mutation?

A recLOH mutation is a recombinant loss of heterozygosity. This type of mutation sometimes happens to multi-copy markers, that is, markers with (a), (b), etc. components (like DYS's 385, 459, 464, etc). When a recLOH mutation happens, one copy of the marker involved will take on the value of another copy. For example, a DYS-385 combination of 11-14 might suddenly mutation to be 11-11 or 14-14. As another example, a DYS-464 value of 13-15-15-17 could suddenly become 13-13-17-17 if the 13 and 17 each replace one of the two 15s.

This can distort normal genetic distance calculations, for which 11-11 and 11-14 would normally be treated as having a genetic distance of 3 between them (suggesting that it would take a sequence of 3 mutations to get from one to the other). Moreover, it is a challenge to determine the pre-mutation values of a marker that has undergone a recLOH mutation. For example, if we see a DYS-464 with a value of 13-13-17-17, it could be the result of a recLOH mutation on a DYS-464 that had the four values 13, 17, X, and Y, where X and Y could have been pretty much anything.

It has been suggested that when a recLOH mutation occurs with one multi-copy marker, then there is a strong probability that it will simultaneously occur with other multi-copy markers.

In our project, we have a few test results that look like possible candidates for having experienced a recLOH mutation. For example, Johnathan (kit 63479) has a DYS-385 value of 11-11. Also, Steven (kit 60440) not only has a DYS-385 value of 14-14, but he has a DYS-464 value of 15-15-17-17.

How can the origin of a mutation be identified?
How can DYS values for deceased ancestors be figured out?

These two questions are very closely related to each other, so let's address both of them at once. There are two guiding principles when it comes to mutations:

mutations are rare events
but they can occur with any birth

Let's consider a hypothetical example: Doug and Frank are cousins, whose most recent common ancestor is Adam. Suppose that Adam is Doug's 4xgreat grandfather and Frank's 6xgreat grandfather, so Doug and Frank are fifth cousins twice removed. Let's now count the number of times that the Y-chromosome has been passed on: from Adam to Doug it got replicated 6 times (once for each birth from Adam's son onwards) and from Adam to Frank it was transmitted 8 times. So when Doug and Frank compare their test results, they are actually comparing results for which there were 14 opportunities for a change to occur between them.

Let's assume, for the sake of this example, that each marker mutates at a constant rate of 0.2%, so that each time a son is born the probability of a mutation on any given marker is 0.002. Then with 14 birth events separating them, the probability that Doug and Frank have the same value for the first of their markers is (1 - 0.002)¹⁴ = (0.998)¹⁴ = 0.972361..., or about 97%. So while it is highly likely that they share the same value for the first marker, it is not a certainty. Continuing with our 0.2% assumption, then each of the other markers would also have this same 97% probability of matching. If Doug and Frank both did the 37-marker test then the probability that all 37 markers are a perfect match would actually be ((0.998)¹⁴)³⁷ = 0.354503..., or about 35%, assuming also that each marker mutates independently of other markers. Had Doug and Frank only done 12-marker tests, then the probability of a perfect 12-marker match would be ((0.998)¹⁴)¹² = 0.71438..., or about 71%.

The questions about pinpointing the source of a mutation, and figuring out an ancestor's DYS values have not yet been answered, but it was important to first introduce the above example. Let's now get back to looking at just one marker value, say DYS-393, which is the first of the markers in everybody's test results. When Doug and Frank compare their test results on this marker, there are only two possibilities: either the values match, or else their values differ.

If they differ, then we know with absolute certainty that a mutation has in fact occurred. However, knowing that there was a mutation does not tell us when it happened. It could have been with any of the 14 births that separate Doug and Frank in their family tree. There's even a small chance that DYS-393 might have mutated two or more times, since it had 14 opportunities to do so.

On the other hand, suppose that Doug and Frank have the same value for DYS-393. Technically, this does not guarantee that no mutations have occurred, because there is a remote possibility that one of the following scenarios might have taken place:

There was a mutation in Doug's line of descent from Adam, and the same mutation also occurred in Frank's line.
There was a mutation part-way along Doug's line, and another one farther along Doug's line, such that the second mutation was the opposite of the first one (for instance, DYS-393 might have mutated from a value of 12 to 13 with the first mutation, and 13 to 12 with the second mutation). This scenario also might occur within Frank's line.

To quantify just how remote these possibilities are, let's do some more math:

The probability of 0 mutations occurring we've already calculated to be (0.998)¹⁴ = 0.972361...
The probability of exactly 1 mutation occurring among the 14 opportunities is (0.998)¹³ x (0.002)¹ x 14 = 0.027280...
So the probability of 2 or more mutations occurring is 1 - (0.972361...) - (0.027280...) = 0.000358...

So the probability of this remote scenario, in which we need to have at least 2 mutations, ends up being about 0.03%. While this doesn't rule it out, it does show that it is very unlikely. Given that it is so unlikely, it is reasonable to conclude that if Doug and Frank have the same marker value, then that marker mutated zero times. This in turn allows us to conclude that Adam's DYS-393 value was the same as Doug and Frank's (as is the DYS-393 value for each of the other 12 people in between Doug and Frank). Additional matches with other relatives' test results would serve to strengthen this conclusion even more.

That deals with the scenario in which Doug and Frank have a matching value, so let's now return to the scenario in which their DYS-393 values differ. As already noted, this means that at least one mutation must have occurred. Any of the 14 births in the family tree between Doug and Frank are now possible suspects. The only way to isolate which births introduced mutations is to get more DNA samples from other relatives. Each time that a pair of test results are the same, we can use the reasoning presented above to interpolate the marker value for that pair's most recent common ancestor (and everybody in between). For example, if Doug's first cousin gets tested and his DYS-393 value matches Doug's, then we can rule out Doug's birth as well as Doug's father's birth as possible sources of the mutation (but Doug's grandfather's birth is not ruled out). On the other hand, if Doug's first cousin matches with Frank, then we can rule out all of the births except for Doug's and Doug's father's.

In some cases we may be able to narrow down where a mutation occurred, but we might never be able to completely isolate it. Supposing that we have narrowed a mutation down to either a particular person's birth (maybe Doug's) or his father's, then the only way to isolate which birth saw the introduction of the mutation is to figure out the father's marker value. If the father's DNA cannot be directly tested (which would be the case if he is deceased), then we would have to rely on the person in question (Doug) having a brother whose marker value we could determine (this might involve getting two descendants of two different sons of a brother to get their DNA tested so that we can try to interpolate the brother's value). If the person's father has passed away, and either the person in question has no brothers or marker values cannot be determined for any of them, then we may never know which of the two births produced the mutation.

What are haplotypes and haplogroups?

The set of marker values for a particular individual is what constitutes that person's haplotype.

As noted in a question above, the markers being studied for the Pike DNA project are based on mutations in STR's (Short Tandem Repeats). However, there is another, different, type of mutation that can occur, known as a Single Nucleotide Polymorphism (or SNP). SNP's are very specific mutations that are so rare that they generally occur once in history and are then passed on forever. Each haplogroup is defined by a particular collection of SNP mutations.

At this point it might be helpful to take a look at the collection of Y-haplogroup classifications, displayed in the form of a tree, on this webpage. The major ancestral groups (named A, B, C, etc.) are pretty clear in this diagram. Note also that each group has many subgroups. Each subgroup (represented by a branch in the tree) is defined by its own SNP mutation. To help understand this, note that the R1b group branches off from R1, which branches off from R, which branches off from P, and so forth going backwards in time. So anybody in the R1b group has inherited the R1b mutation, as well as the older mutations that define each of R1, R, P, and the ancestral groups of P. One feature of SNP mutations is that the defining mutation for each group or subgroup appears in its own unique location on the Y-chromosome. This means that mutations will accumulate over an anthropological timescale. It is by studying the cumulative mutations in individual people's DNA that this tree has been able to be determined ... as more science is done with more DNA samples, the tree will be updated to reflect newly discovered groups.

If that's a bit too technical, then just understand that haplogroups are based on SNP's, which are a different kind of genetic marker than the STR's that we are testing for. However, it is being discovered that certain collections of STR's (i.e. certain haplotypes) occur in very close correspondence to haplogroups. So, based on each person's haplotype, FamilyTreeDNA is usually able to estimate the person's haplogroup without the need for a SNP test.

Anybody who is interested in confirming their haplogroup prediction with a SNP test can order one, either via FamilyTreeDNA or another company. FamilyTreeDNA offers SNP tests for most haplogroups (priced at $65) as well as for subgroups of some of the major haplogroups (priced at $79); information about these subgroup tests can be found here. Another company, DNA Heritage, offers a comprehensive SNP test for $99 US, in which they start at the top of the haplogroup tree and work their way down to your particular haplogroup, and often also into a subgroup. For more details, follow this link. To order a SNP test with DNA Heritage, go to their SNP order page.

As for the usefulness of haplogroups, they apply more to anthropology than genealogy. However, they can help to indicate the geographic and ethnic origins of people's very distant paternal ancestry. The haplogroups identified so far for members of the Pike project, and their estimated very-far-in-the-past origins (tens of thousands of years ago), are listed below:

E1b1a - Africa
G - Near East
I - northwestern Europe (such as Scandinavia)
J1 - Middle East
J2 - fertile crescent area of southwest Asia
R1a - Eurasian Steppes, north of the Black and Caspian Seas
R1b - Europe

More information about various haplogroups is available from:

Also of interest is this haplogroup map (by J.D. McDonald) that shows the geographical prevalence of various haplogroups, which in turn gives a general indication of the ethnic and geographic origins of various haplogroups. An interactive version of a similar map is available here, courtesy of DNA Heritage.

FamilyTreeDNA has prepared a pie chart showing the relative frequency of each haplogroup among the samples analysed by FamilyTreeDNA. The top three are R1b at 18.93%, I at 9.59%, and R1a at 8.63%.

What is WAMH?

WAMH is the abbreviation that is used for the Western/Atlantic Modal Haplotype. In lay terms, the WAMH is the most common 12-marker genetic signature that has been found in the Western (i.e. Atlantic) region of Europe. The 12 markers that form this haplotype are shown in the following table:

DYS Markers and Their Values

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67

Kit
Number Name
and
Pedigree Broad
Haplogroup 3
9
3 3
9
0 3
9
4

* 3
9
1 3
8
5
(a) 3
8
5
(b) 4
2
6 3
8
8 4
3
9 3
8
9
(i) 3
9
2 3
8
9
(ii) 4
5
8 4
5
9
(a) 4
5
9
(b) 4
5
5 4
5
4 4
4
7 4
3
7 4
4
8 4
4
9 4
6
4
(a) 4
6
4
(b) 4
6
4
(c) 4
6
4
(d) 4
6
0 G
A
T
A
H4 Y
C
A
II
(a) Y
C
A
II
(b) 4
5
6
6
0
7
5
7
6
5
7
0
C
D
Y
(a) C
D
Y
(b) 4
4
2
4
3
8
5
3
1
5
7
8
3
9
5
S1
(a)
3
9
5
S1
(b)
5
9
0
5
3
7
6
4
1
4
7
2
4
0
6
S1
5
1
1
4
2
5
4
1
3
(a)
4
1
3
(b)
5
5
7
5
9
4
4
3
6
4
9
0
5
3
4
4
5
0
4
4
4
4
8
1
5
2
0
4
4
6
6
1
7
5
6
8
4
8
7
5
7
2
6
4
0
4
9
2
5
6
5

WAMH R1b 13 24 14 11 11 14 12 12 12 13 13 29

* DYS-394 is sometimes also called DYS-19.

FamilyTreeDNA has prepared a webpage with some information for those who have this haplotype. Of particular importance is that the WAMH is so common that an estimated 2.5% of Western European males have this haplotype. Indeed, FamilyTreeDNA has well over a thousand people in its database with this exact 12-marker signature, including a few who are in our project.

Within the Pike project, we generally recommend that people have at least 25 markers tested, so that genealogically irrelevant/misleading 12-marker matches can be identified and disregarded. This is especially important for people who have a common 12-marker haplotype such as the WAMH. We have seen some cases in which people with substantially many 12-marker matches still had several matches at 25-markers that were of doubtful genealogical relevance, and in these cases it has taken the additional precision of a 37-marker test to finally narrow down a genetic signature that represents just the Pike family line that is of interest.

In summary, if you closely match the WAMH (or have some other particularly common 12-marker haplotype), then you may want to consider having at least 37 markers tested in order to be able to tell which genetic matches are truly meaningful.

Last Modified: Tuesday, 09 April 2019, 19:08:04 NDT