Allele-Dependent STR Mutation Rates Calculated from YFull YTree

Studies have shown that STR length, i.e. the number of repeats, factors into the mutation rate. Generally, the greater the length, the higher the mutation rate, and the greater chance of having a deletion of several repeats.

Motivation

Out of curiosity, and with the aim of eventually improving STR Match Finder to more accurately compute genetic distance based on STR-allele differences, I decided to make use of the large data set of YFull YTree computed haplotypes to calculate the allele-based STR mutation rates of the STRs in the 111 STR set used by FTDNA for which YFull has computed haplotypes.

Data

YFull has developed their own algorithm for computing haplotypes based on the STRs of each sample. They obtain the STRs of their samples either through the customer's direct import of STR test results or by extraction from the BAM file.

In order to compute the mutation rates using these haplotypes, I make use of the YFull YTree's TMRCA estimates.

I used the haplotypes that were publicly available on YFull between August 1-6, 2022 and the estimated TMRCAs from YFull YTree version 10.04.

There are three potential sources of error in the input data that I have used to compute allele-dependent STR mutation rates:

  1. Errors in extracting STRs from BAM file
  2. Errors in the algorithm YFull uses to compute haplotypes
  3. Errors in TMRCA estimates

Methodology

For each haplogroup root, I traversed the YFull YTree, keeping track of how many years elapsed between subsequent TMRCA estimates and whether or not a given STR-allele changed.

The years elapsed between TMRCA of a parent and its child I refer to as bottleneck length.

In the event that the downstream haplogroup's haplotype is undefined for a particular STR, that data point is ignored.

Next I use binning to convert the information of [bottleneck length (years), mutation or no mutation] into a mutation rate per time interval bin for each STR-allele.

The cutoffs for the bins I have been using to quantify average mutation rate for the STR-allele combo per bin are, in years: [100,200,400,600,800,1000,1200,1500,2000,2500,3500,4500]

In the case that a bottleneck was estimated by YFull to be zero years, I adjusted this to be 50 years.

As long as I have at least 30 data points falling within a bin, I include that bin in the set of [x,y] to solve for a best fit Poisson distribution.

x = average bottleneck length of points in bin
y = observed mutation rate (probability)

The reasons I do not use mutation rates from longer bottlenecks are:

  • Less data points in these bins make them less reliable
  • The longer the elapsed time between observations, the greater the chance that 'no observed change' is masking a hidden mutation away and mutation back to the original value - this can really mess up Poisson best fitting because in reality, the mutation rate should always get higher with longer bottlenecks.
  • For DYS456 = 15 I found some odd apparent stability at the 10,000 year and older bottleneck length that did not match the observed behavior of it being less stable than DYS456 = 14 at lower bottlenecks. It could be due to low sample number combined with above effect or maybe in some haplogroups (most predominantly E), the ancestral DYS456 = 15 allele has some other base pairs mixed in with it that makes it more stable than other DYS456 (just a possible explanation I read about from another paper).

Results

There is a very strong positive correlation between mutation rate and length of allele for a given STR, as I think we expected from previous studies.

Sometimes even a factor of ten increase in mutation rate, as comparing DYS388 = 12 to DYS388 = 16 or 17 as you can see in the table below, where:

  • 'n' is number of data points, i.e. number of times that this STR's allele was computed as an ancestral haplotype for a clade that has a direct child branch with this STR also computed and for which the bottleneck is less than 4500 years
  • 'deltas' is the number of times the child had a different allele than the parent
  • Results only computed for n > 500 and deltas > 25

More human readable HTML table where you can also get the underlying JSON object from the javascript.

Allele Mutation Rate (years) n deltas
DYF406
9 29216 2197 44
10 17745 7029 219
11 15826 7499 287
12 14848 2759 127
DYS19
13 18001 1709 51
14 17783 10647 316
15 18020 4364 208
16 7351 2252 192
DYS388
12 101045 10878 79
13 48722 1804 32
14 22354 1807 33
15 18966 1596 61
16 9239 854 44
17 10148 1595 55
DYS389I
12 24270 3875 99
13 15984 12549 503
14 15364 2943 155
DYS389II-I
15 6278 759 111
16 10788 10262 638
17 7947 6479 575
18 5622 1718 221
DYS390
22 20139 2417 76
23 20283 7389 238
24 11635 6430 345
25 10224 2623 171
DYS391
10 28875 11332 271
11 9058 7668 468
DYS392
11 83239 11381 85
13 28544 4390 97
14 21737 1935 51
DYS393
12 50845 5521 64
13 24632 10889 237
14 15755 2516 122
DYS425
12 95952 16209 101
DYS434
9 121403 17917 85
DYS435
11 185942 19151 61
DYS436
12 223948 19203 50
DYS437
14 105415 10091 60
15 22848 6271 185
16 41644 2950 38
DYS438
10 70793 10169 81
11 73080 3144 33
12 34357 3942 60
DYS439
10 14526 2852 101
11 12153 8268 451
12 9377 7549 702
13 4438 840 59
DYS441
13 23570 6356 148
14 20390 5245 185
15 13819 3242 128
16 10952 2782 181
18 2727 595 65
DYS442
11 17999 5524 203
12 11573 10689 583
13 7347 1643 160
14 5565 1268 139
DYS444
11 22024 1787 72
12 12206 9452 490
13 9931 5442 387
14 7369 2366 192
DYS445
10 36389 1763 26
11 39420 7401 93
12 31151 9633 206
DYS446
11 11052 620 26
12 9737 3359 199
13 9241 7942 554
14 12440 4163 241
15 7340 1642 145
16 5052 765 91
DYS447
23 16196 2422 125
24 12007 2274 138
25 9001 5412 397
26 9822 3637 291
27 12323 704 57
DYS448
19 23432 6659 190
20 18277 9344 279
21 15864 2076 115
DYS449
25 3956 1739 141
26 3493 641 69
27 6139 933 99
28 7874 2126 291
29 5732 4337 719
30 6703 2614 356
31 4869 1894 298
32 4733 2500 481
33 5511 679 98
DYS450
8 133620 14098 66
DYS452
29 20376 3052 62
30 18158 6733 291
31 17987 5080 198
32 4687 507 44
DYS454
11 110109 17102 86
12 32383 1956 34
DYS455
11 158727 17169 96
DYS456
14 15434 5004 164
15 12116 8133 592
16 6599 4343 491
17 4891 1477 153
DYS458
15 8300 4098 333
16 7581 4093 446
17 6129 5989 776
18 7556 2835 304
19 3013 785 77
DYS460
9 19255 980 34
10 15741 7786 317
11 15052 10650 484
DYS461
11 30086 6986 120
12 15951 10085 396
13 11227 2026 123
DYS462
11 51412 10393 107
12 41681 7767 113
13 12228 1241 64
DYS463
21 28588 3385 81
22 23623 4911 131
23 19246 910 30
24 18818 4094 160
DYS481
21 12716 1048 49
22 8996 5030 404
23 9418 3210 245
24 7212 1827 188
25 6410 4340 426
26 7091 1770 167
27 3955 792 114
DYS485
14 36493 1900 32
15 23223 12503 335
16 18570 959 33
17 12509 1007 64
DYS487
12 35236 1843 26
13 38597 9808 212
14 19626 4143 153
15 17968 821 46
DYS490
12 90375 16251 134
DYS492
12 128950 15093 63
DYS494
9 160359 16604 54
DYS495
14 83398 1892 27
15 28902 10312 206
16 29674 4844 118
17 10641 1923 83
DYS497
13 21826 1357 26
14 28223 10425 265
15 32281 6210 142
DYS504
13 22707 1899 47
14 13081 2736 134
15 10912 5190 294
16 8906 3500 318
17 6708 5238 532
18 5712 505 45
DYS505
11 24336 5154 148
12 15519 7670 337
13 15750 5503 234
DYS510
16 27091 1448 39
17 13673 11752 574
18 8966 4925 342
19 5064 884 104
DYS511
9 24832 7395 127
10 19559 9880 298
11 18768 2064 96
DYS513
11 21970 5602 146
12 14334 8968 409
13 13434 4461 246
DYS520
18 39058 1368 28
20 17822 7962 282
21 15522 7068 275
22 10325 1361 80
DYS522
10 32436 5237 96
11 21700 6913 199
12 15094 5102 296
13 4808 1780 191
14 2955 538 54
DYS525
9 37482 2341 43
10 19365 13029 314
11 16326 3250 177
12 7193 966 72
DYS531
10 66577 2174 28
11 70472 15955 157
DYS532
9 13502 900 47
10 14218 2957 102
11 11748 5253 279
12 9328 2909 197
13 7046 4369 408
14 6779 1674 197
15 9773 552 54
DYS533
10 16240 1061 36
11 17847 8162 240
12 14003 8855 448
13 6673 1100 83
DYS534
13 9833 2043 125
14 8788 1997 176
15 6773 7213 769
16 6922 4546 602
17 6729 2605 341
18 2084 576 83
DYS537
10 37390 5478 98
11 31193 10374 237
12 7569 2946 166
DYS540
11 30002 5451 90
12 20402 12700 420
DYS549
11 12448 2318 99
12 13447 10647 584
13 8105 6254 572
DYS552
23 14691 993 31
24 9935 9486 474
25 11416 5058 304
26 9621 2298 199
27 7964 1219 117
DYS556
11 42804 7950 118
12 17801 10202 292
DYS557
14 14937 3150 147
15 12508 4680 248
16 9909 5319 370
17 11047 1372 122
18 7167 3296 278
19 5447 1048 154
DYS561
14 37432 3546 60
15 27535 13166 298
16 13675 2609 141
DYS565
11 99154 10717 76
12 24135 5490 175
13 8957 1774 105
DYS568
11 76262 15173 138
12 13177 2687 145
DYS570
16 7891 1420 149
17 6015 5764 661
18 6342 5454 670
19 5691 3964 588
20 3543 1517 282
DYS572
10 50002 3320 59
11 23010 11221 357
12 9068 2298 143
DYS575
10 266733 19200 48
DYS576
15 7523 890 94
16 6037 2783 308
17 6294 6081 812
18 5415 7786 1145
19 6519 1369 171
DYS578
8 181190 13555 47
DYS587
18 27689 12123 250
19 25114 4867 144
20 11307 737 39
21 9030 733 55
DYS589
11 38656 6032 113
12 38405 7794 141
13 23742 3668 137
DYS590
8 208217 18205 44
DYS593
15 145757 16877 70
16 56697 2249 29
DYS594
10 48663 14013 124
11 60558 4862 63
DYS607
12 22845 1281 29
13 17345 2486 90
14 17463 7249 261
15 14681 5754 293
16 6886 2480 186
DYS617
12 53219 9820 134
13 31346 3308 81
DYS635
20 13267 1115 79
21 10481 7702 584
22 6256 2859 378
23 12195 6159 286
24 9542 977 77
DYS636
11 137379 14125 66
12 48176 4710 57
DYS638
11 58648 14726 163
12 16319 2180 53
DYS640
11 82075 12946 76
12 39882 6153 106
DYS641
10 111017 18081 128
11 6726 561 28
DYS643
9 41657 3518 51
10 20703 7977 216
11 20914 2853 98
12 15005 4116 178
13 8354 782 49
DYS650
17 13612 1533 118
18 7692 3511 412
19 5709 6135 791
20 7133 4218 452
21 5589 1717 221
DYS710
30 4208 901 137
31 4928 1723 300
32 4849 2408 457
33 4957 3487 613
34 5208 2666 505
35 3917 2756 564
36 5167 1013 163
DYS712
19 7751 3234 273
20 5723 6291 791
21 5427 1788 283
22 4448 3024 545
23 3610 1074 231
24 3122 876 193
26 2913 813 202
27 4132 560 134
28 4159 619 154
DYS714
21 12178 533 38
22 7795 1001 92
23 7524 1956 166
24 7359 3424 343
25 6144 4645 584
26 5540 4290 624
27 5729 1591 208
DYS715
21 12658 768 34
22 15239 4230 176
23 11588 6001 337
24 8187 7172 571
25 8343 1029 78
DYS716
26 57439 2537 42
27 32288 4457 87
DYS717
19 50117 13654 184
20 41485 3845 63
21 25128 812 32
DYS726
12 144620 16550 70
Y-GATA-A10
12 17476 6129 247
13 9134 10197 733
14 5391 2601 246
Y-GATA-H4
10 22304 8945 244
11 12389 8470 461
12 5693 1307 111
Poisson Loss Function

def bucketize_loss(params, x_array, y_array):
mu = params[0]
loss = 0
for i in range(len(x_array)):
p_mutation = 1 - 0.5 ** (x_array[i] / mu)
error = p_mutation - y_array[i]
error2 = error * error
loss += error2
return loss
These posts are the opinion of Hunter Provyn, a haplogroup researcher in J-M241 and J-M102.

Leave a Reply

Your email address will not be published. Required fields are marked *