Updated 8/16/2018

How are clade geographic origins computed?

  1. Filtering of samples
    • Discard samples outside a latitude-longitude bounding box
      • bounds:{'minLat': 0, 'maxLat': 75.0, 'minLon': -25.0, 'maxLon': 97.0}
      • Next version will include the rest of Asia and Oceania west of the International Date Line - all I need is country geometry files to do this, contact me if you want to help. At this time I will introduce Y-haplogroups O and C.
      • Subsequent version will incorporate the Americas, haplogroup Q, and require code change to handle International Date Line complications to pathing, shortest point and averaging computations
    • Discard samples not tested to a terminal subclade or basal
      • Avoids obscuring the migration path from a parent to a child clade in the case that a sample(s) used to compute parent clade turn out to be positive for child SNP
  2. Compute initial positions for terminal subclades and basal clades
    • Weighted average of latitude and longitude of samples where weight for each sample is the product of
      • A regional sampling factor (inverse of number of all samples from any haplogroup from the area, using a normal distribution with standard deviation of 100km)
        • Absolute world sampling map generated from this method.
          • dark grey: < 3.125
          • white: 3.125 - 12.5
          • peach: 12.5-50
          • orange: 50-200
          • red: 200+
      • An age factor (linear function of sample age)
  3. Initialization of entire tree
    • Starting at leaf nodes, go up the tree computing each clade's location as the average of its children's locations
    • If there are basal samples, the computed basal clade position is treated as a child clade for above computation
  4. Directionality Refinement
    • Starting at penultimate leaf nodes, go up the tree refining each clade's location with a pathing algorithm taking two parameters
      • its parent clade's location
      • the polygon defined by its child clades' location
    • If the parent is inside the polygon, refine the clade's location as the midpoint between the precomputed locations of it and its parent
    • If the parent is outside the polygon, refine the clade's location as the closest point between the parent and the child polygon. Example
    • Use a different method to refine the root clade, as the above algorithm cannot be used given that the root has no parent
      • Refine the root clade location as the point that minimizes total distance between itself and all children clades