• Capital Expenditures
  • About
  • Contact
  • Capital Expenditures
  • About
  • Contact
CAPITAL EXPENDITURES

a learning investment in DAta Science, entrepreneurship, and Biotech

by
​ ​Vanessa Mahoney

Rudolph the PureBred Pitbull

12/28/2017

0 Comments

 
Picture
This Christmas, a popular gift in our family was AncestryDNA, which is a pretty cool service that estimates your genetic origins. My father-in-law had recently discovered he was 9% Italian, which prompted a slew of hilarious, if not politically incorrect interpretations of Italian accents and imitations. Using a cheek swab, AncestryDNA tests the DNA in your sample for small single nucleotide polymorphisms, or SNPs. These SNPs, as the name implies, represent single replacements of nucleotides, such as a C (cytosine) instead of a T (thymine). SNPs usually occur between genes (away from exons or control regions), so most of the time there's no effect on health or development. These little replacements are relatively common in the human genome, occurring ~ 1 out of every 300 nucleotides, or about 10 million potential SNPs in each of us. What makes SNPs so great for genetic analysis is that their inheritance is relatively stable - if your mom has an SNP you're highly likely to have the same one. However, there are variations between populations, as over the centuries ethnic cohorts have developed their own signature blend of SNPs. While there are ~10 million SNPs, it's been shown that analysis of just 19 SNPs can identify with high probability an ethnic group. It looks like AncestryDNA uses a chip that has the ability to analyze up to 700K SNPs, so your results are probably pretty damn accurate. 

I'm mildly interested in my own genetic heritage, but something that really DID interest me was the genetic history of ...... my dog. Before you roll your eyes and stop reading, I was mostly curious, but I was also interested because my little man is getting older and I wanted to watch out for the beginnings of breed-specific health problems. I rescued my pitbull Rudy from a shelter in Brooklyn, and we had always speculated if my little brown guy was part chocolate lab, boxer, Rhodesian Ridgeback, etc. So, I shoved a swab in his mouth and sent the sample of to Wisdom Panel and anxiously awaited the results. Well the results came, and to my shock, Rudy is a purebred American Staffordshire! They threw in a cute little genetic fingerprint (probably full of a few of the SNPs that they mapped), a phylogenetic tree, a certificate that he's purebred, and lots of plots that show Rudy is squarely a Staffie. (Most people, including myself, call Staffies pitbulls. However, they're a little different, and I think we can actually see in the following chart.)
​

Picture
In this chart, principal component analysis (PCA) has been conducted on Rudy, subbreeds of Staffies, and a cohort of all other breeds. If we look at the graph, we see PC2 (principal component 2) on the y axis and PC1 (principal component 1) on the x axis. Each of the points is a dog, so each dog has a PC1 and PC2 score.  Dogs have hundreds of characteristics, and in this study, hundreds of SNPs were analyzed, so why is it that we're just graphing 2 properties of dogs? The answer is, we're not! Behind each of these components, we're getting inputs from a whole bunch of characteristics. In other words, PC1 and PC2 are new features that we've created through linear combinations of the actual variables. 

How do we decide which variables make up PC1 and PC2? If you think about it, we want to choose variables that vary the most between breeds - we want a variable that tells us something. We wouldn't want to choose something like "has 4 legs" to distinguish between dog breeds, because that tells us nothing about a sample. However, a gene like "snub nosed" will be very useful in helping us characterize our dog. A good analogy is that game 20 questions. You wouldn't waste your time asking dumb questions; you want to ask those questions that eliminate choices and help you zone in on the target. That's the same goal of PCA. Mathematically, PC1 uses values that maximize variance. Another advantage of PCA is that it take variables that may be correlated and transforms them into variables that are uncorrelated. For example, if height and weight are highly correlated - ie they increase and decrease together very closely - PCA could be used to make one variable that is a combination of height and weight. We are reducing the dimensionality by combining these 2 variables into one new feature, with little loss of information.  In 20 questions, if you had already learned that something is round, you wouldn't want to ask next if it rolled. It probably does, but that information doesn't tell you much more then you already knew. Instead, you'd want your next question to give you the best incremental clarity on what the object is - again, what PCA is mathematically trying to accomplish. 

Let's look back at Rudy's PCA graph. Basically, we have no idea what sort of characteristics we are looking at for each of these dogs - that's not the point here. What PCA does allow us to see is the different signatures of dog breeds. By design, PC1 and PC2 have been composed of the characteristics that tell us the most about the breeds. As you can see in the graph, these signatures are distinct. Rudy is clearly not in this "All Breeds Outgroup" cluster. However, he is in a few clusters, because as you can see, the signatures of these 3 American Staffordshire terrier sub-breeds overlap. What this means is that according to Wisdom Panel's analysis, there are at least 3 types of American Staffordshires, according to their genetic makeup. My guess is that one of these American Staffordshire groups is actually representative of pitbulls.
Picture
The reason I think that is because in the next chart, they also compared Rudy to "next closest breeds". As you see, the next 2 closest breeds are American Bulldogs and Bulldogs, rather than Pitbulls, so I would guess they're calling Pitbulls an American Staffie subgroup. (I sure hope that's not reluctance to share with someone that their dog is a pitbull, because it's a great thing, an not using the word will only feed the stereotype). One more thing to notice about this graph: we didn't just see these 2 new breeds as new clusters on our previous graph because a new set of PCA has been conducted. Previously, traits that separated Rudy from other breeds might be a lot less informative, now that we're trying to compare him to bulldogs. Back to our 20 questions analogy - you gotta ask different questions as possible targets start getting more similar. ​ 

That's just a little look at the science beneath these DNA services! Catch ya later, my little purebred is begging me to go outside! :)
0 Comments



Leave a Reply.

    Picture

    Vanessa Mahoney,  PHD

    Biomedical scientist & data analyst who loves learning how things work - from mortgage-backed securities to cardiac electrophysiology to Donald Trump's comb over

     
    The postings on this site are my own and don't necessarily represent IBM's positions, strategies, or opinions. 

    Archives

    December 2018
    November 2018
    June 2018
    December 2017
    June 2017
    April 2017
    September 2016
    July 2016
    June 2016
    May 2016
    February 2016
    January 2016
    November 2015
    September 2015
    August 2015
    June 2015
    May 2015

    Categories

    All
    Biotech/Healthcare

    RSS Feed

    Categories

    All
    Biotech/Healthcare

Powered by Create your own unique website with customizable templates.