Analysis output will appear in this frame after clicking the "Process" button in the top frame.
This particular utility is quite intensive, and
may take two or three minutes before it completes its processing.
Some notes about this utility:
- This utility had a bug that got fixed on 07 April 2012. Previous results ought to be discarded and re-run.
- This utility takes as input the raw data files for several siblings and both of their parents.
- If data are unavailable for one or both parents, you can create a dummy file with nothing but No-Calls and specify this file for
the parent(s) without actual data.
- As output, this utility attempts to phase the SNPs of the children.
It also attempts to phase both of the parents.
- The output consists of:
- SNP data (the chromosome number, the SNP's name, and the SNP's position)
- the raw unphased data for the mother
- the raw unphased data for the father
- for each child:
- the child's raw data (either phased or unphased)
- the allele inherited from the mother
- the allele inherited from the father
- the allele from the mother that was not inherited by the child
- the allele from the father that was not inherited by the child
- two pattern sequences, the first being the overall pattern of inherited alleles from the mother,
and the second being the overall pattern of inheritance of alleles from the father.
- two alleles from the mother, phased so that each column represents one of the mother's two chromosomes
- two alleles from the father, phased so that each column represents one of the father's two chromosomes
- Regarding the two sequences of inheritance patterns that are shown,
two symbols are used (+ and O). As an example, a pattern of ++O+O would indicate that children
#1, #2 and #4 inherited the same allele from the parent in question, whereas children #3 and #5 inherited
the complementary allele from the parent.
Note that the pattern OO+O+ also indicates that children
#1, #2 and #4 inherited the same allele from the parent in question, whereas children #3 and #5 inherited
the complementary allele from the parent.
If present, a question mark within a pattern indicates that it could not be determined which of the two alleles a child inherited from the parent.
- The presence of an asterisk to the right of an inheritance pattern denotes the detection of a change in inheritance pattern.
In determining which of the two possible patterns to display for the new pattern (for instance, ++O+O and OO+O+ have the same meaning),
whichever of these the two patterns most closely resembles the previous pattern is the one that is chosen for display.
So, for example, if the previous pattern had been ++OOO and the new pattern is to be one of ++O+O or OO+O+,
then ++O+O is chosen since it differs from the previous pattern in only one position.
The intent is so that the same symbol (+ or O) should consistently refer to the same phased parental chromosome.
- Note that there are cases in which there is a tie regarding which new pattern to choose, as would
be the case if the previous pattern had been OOOOOO but the new pattern should be either OOO+++ or +++OOO.
In such cases, one of the two choices is selected at random (unless you specify the outcomes for coin tosses).
There is a 50% chance of having chosen the wrong pattern, so a double asterisk is displayed to highlight that this scenario has been encountered.
If you find that the wrong choice was made, then some things that you can try doing are:
- Run the analysis again. Since the tie-breaking process is randomised, the other choice ought to be made within a few tries.
- Add another file to the analysis. For example, if you've used raw data for children Alice, Bob, Charlie and Doris,
then try doubling up on one of them. You can use the exact same file more than once, so you could
process Alice as Child #1, Alice also as Child #2, and then Bob, Charlie and Doris as Children #3, #4 and #5.
- Instead of relying on random choices, you can force coin toss outcomes by entering a string
of zeros and ones. The first entry in your string will be used for the first coin toss, the second entry for the second decision, etc.
Note that a coin toss can alter the meaning of subsequent coin tosses (so it is best to force initial tosses first, before
determining which subsequent tosses should be forced).
Random choices will be made for any decisions beyond those for which you have entered values.
The set of all decisions will be printed as a binary string just prior to when overall processing is completed.
- To the right of the pair of inheritance patterns are four columns of alleles that represent the parents' phased data.
The first two represent the alleles of the mother that respectively correspond to the O and + in the inheritence pattern.
The last two represent the alleles of the father that respectively correspond to the O and + in the inheritence pattern.
Barring any errors in bridging between different inheritance patterns (such errors are possible, especially in the event of
ties denoted by double asterisks), then these four columns represent the phased data for the two parents.
- The notation "In" to the right of an inheritance pattern indicates that it is "informative" in the sense that
there is obvious distinction between the alleles passed on to different children
(which is to say that the parent in question is heterozygous).
This notation is omitted for patterns that have asterisks shown.
- The tally of phased and unphased alleles at the end of the output indicates how many of the children's alleles
were and weren't able to be phased.
- In the event that you have already phased the data for one or more siblings, then this can be taken into consideration.
The data file for a phased individual should be prepared to have the same format as for an unphased individual, with
the exception that each pair of alleles needs to be presented in order, with the first allele being inherited from the
parent designated as the mother and the second allele from the parent designated as the father. If multiple siblings have
already been phased, it is imperative that their alleles are consistently ordered (i.e., so that the first allele is always from
the parent designated as the mother, and that the second allele is always from the parent designated as the father).
- To phase X chromosome data, it can be useful to edit the input files for all males and manually phase their data
in advance. For instance, SNPs such as AA, CC, GG, and TT can be revised to be AD, CD, GD and TD to reflect the macro-deletion
that they would have inherited from their father. No-Calls in the males can be also be revised, from ?? to ?D.
Heterozygous SNPs in males may be erroneous and can also be revised to be No-Calls of the form ?D.
- The protocol used by this utility is similar to that described in the paper
"Phasing the Chromosomes of a Family Group When One Parent is Missing" by T. Whit Athey
in the Fall 2010 issue of the
Journal of Genetic Genealogy.
- Only SNPs that occur in all files will be considered. All other SNPs will be ignored.
- New as of 18 June 2017:
You can enter the positions of SNPs for which you wish to have all alleles converted to No-Calls,
forcibly over-riding the genotype information in the raw data files for these positions.
The positions can be entered with commas or spaces as separators, either on one line or on multiple lines.
If multiple SNP positions are entered, they do not need to be sorted.
- No attempt whatsoever is made to phase discordant SNPs. These SNPs will either be converted to No-Calls
or you can opt to omit them (in either case, all discordant SNPs will be reported though).
- When possible, values for No-Called alleles are inferred based upon the data available from the other people.
Available utilities for processing unzipped autosomal files from Family Tree DNA and/or 23andMe:
These utilities were developed by David Pike.
My original motivation for developing these utilities was so that I could privately perform some advanced analysis of autosomal DNA results, with my
objective being to better pursue genealogical research within my own family. Instead of limiting these utilities to my own personal use, I have made
them available in the hope that they might assist other members of the genetic genealogy community with their own individual research goals.
Given the do-it-yourself nature of the intended user, the utilities are presented with minimal documentation and without verbose explanation of analysis
results. Also note that the analysis performed by these utilities is based on my own methodology and implementation,
whereas Family Tree DNA, 23andMe and AncestryDNA
have their own proprietary analysis methodologies. Differences in methodology may give rise to minor interpretive differences between their analysis and
that performed by my utilities.
Some other notes about these utilities:
- They only accept *UNZIPPED* raw data files from either Family Tree DNA's Illumina-based FamilyFinder (.csv files) or 23andMe (.txt files)
or AncestryDNA (.txt files).
- They should work with Chrome, Safari, and some other browsers.
But they probably will not work with Internet Explorer.
They should work with Firefox 3.6 Beta or later versions of Firefox
(available from here).
Earlier versions of Firefox will not work with these utilities.
Note, however, that some of the utilities are computationally intensive and might run much slower in Firefox
than in a different browser such as Chrome.
- They do their processing on your computer and not mine.
My real point here is that your raw data files are not sent over the internet by these utilities.
I should also mention that the output from the utilities is also not sent over the internet,
and in particular, no copy of it gets saved anywhere (it gets shown on your screen, but that's all).
- No-Calls are reported differently by FTDNA versus 23andMe versus AncestryDNA.
These utilities convert all No-Called SNPs to question marks.
-
It is assumed that the files have been pre-sorted to have the SNPs on each chromosome listed in ascending position value.
The ROH utility performs a validation check to ensure that the input file is properly sorted.
- Utilities that involve the comparison of two or more raw data files do not need the files to all be from one company or to have the same file format.
Nor do the files need to contain the same set of SNPs.
Only those SNPs that occur in all files being processed will be compared. All other SNPs will be ignored.
Synchonisation of SNPs between files is performed based on the position numbers of the SNPs.
It is assumed that the files have been pre-sorted to have the SNPs on each chromosome listed in ascending position value.
- Note that Family Tree DNA's X Chromosome files tend to not be pre-sorted in this manner,
so it is necessary that they be editted accordingly (in which case it is imperative to save the resulting file
in one of the four data formats outlined below).
- Note also that data files based on different reference standards (such as Build 36 versus Build 37) should not be directly compared
without first converting to a common standard.
- If you want to trick these utilities into using other text files, then you'll need to conform to one of the following file formats:
- A tab-separated file with an initial line beginning with # and containing the string "23andMe", and no quotes on any data.
Data items should be in order: RSID, Chromosome Number, SNP Position Number, a single string with both alleles
- A tab-separated file with an initial line beginning with # and containing the string "AncestryDNA", and no quotes on any data.
Data items should be in order: RSID, Chromosome Number, SNP Position Number, first allele, second allele
- A comma-separated file with each data item enclosed in quotes.
Data items should be in order: RSID, Chromosome Number, SNP Position Number, a single string with both alleles
- A space-separated file with no quotes on any data.
Data items should be in order: RSID, Chromosome Number, SNP Position Number, a single string with both alleles
- A comma-separated file with no quotes on any data.
Data items should be in order: RSID, Chromosome Number, SNP Position Number, a single string with both alleles
Any line beginning with the # symbol or containing the string "RSID" will otherwise be ignored.
Note also that if you wish to process files from FTDNA's former Affymetrix-based FamilyFinder then
you should translate all instances of "--" to "DD"
before subjecting your data to analysis;
otherwise the utilities will erroneously treat these deletions as No-Calls.
Instances of "---" in Affymetrix-based data will still be correctly treated as No-Calls.
- If Firefox complains about "A script on this page is causing Mozilla to run slowly" then you might want
to configure Firefox to not complain as quickly when a script is running.
Click here for details about one way to change this Firefox configuration setting.
- Javascript source code is contained within the top frame of each individual utility.
If you want to view how I've implemented things, that's where you'll find the code that I've written.
If you want to express thanks for these utilities, then here are some helpful things you can do:
If you happen to know any males with the surname PIKE or PYKE then encourage them to join the
Pike DNA Project.
The Family History Society of Newfoundland and Labrador is a registered charity that
could use some assistance. Financial donations are tax deductible (at least in Canada).
And a special note to people in academia or industry who make substantive use of these utilities:
Can you let me know how you use them? Thanks.