## AncesTrees:

### ancestry estimation with randomized decision trees

In forensic anthropology, ancestry estimation is essential in establishing the individual biological profile. The aim of this study is to present a new program-AncesTrees-developed for assessing ancestry based on metric analysis. AncesTrees relies on a machine learning ensemble algorithm, random forest, to classify the human skull. In the ensemble learning paradigm, several models are generated and co-jointly used to arrive at the final decision. The random forest algorithm creates ensembles of decision trees classifiers, a non-linear and non-parametric classification technique. The database used in AncesTrees is composed by 23 craniometric variables from 1,734 individuals, representative of six major ancestral groups and selected from the Howells' craniometric series. The program was tested in 128 adult crania from the following collections: the African slaves' skeletal collection of Valle da Gafaria; the Medical School Skull Collection and the Identified Skeletal Collection of 21st Century, both curated at the University of Coimbra. The first step of the test analysis was to perform ancestry estimation including all the ancestral groups of the database. The second stage of our test analysis was to conduct ancestry estimation including only the European and the African ancestral groups. In the first test analysis, 75 % of the individuals of African ancestry and 79.2 % of the individuals of European ancestry were correctly identified. The model involving only African and European ancestral groups had a better performance: 93.8 % of all individuals were correctly classified. The obtained results show that AncesTrees can be a valuable tool in forensic anthropology.

International Journal of Legal Medicine. September, 2015, Volume 129, Issue 5, pp 1145-1453

#### Metric Pattern Analysis:

1. GOL Glabello-occipital length

Greatest length, from the glabellar region, in the median sagittal plane.

2. NOL Nasio-occipital length

Greatest cranial length in the median sagittal plane, measured from nasion.

3. BBH Basion-bregma height

Distance from basion to bregma, as defined.

The maximum cranial breadth perpendicular to the median sagittal plane, above the supramastoid crests.

The maximum breadth at the coronal suture, perpendicular to the medial plane.

The breadth across the frontal bone between frontomalare anterior on each side, i.e., the most anterior point on the fronto-malar suture.

The direct distance between both zigya located at their most lateral points of the zygomatic arches.

The least exterior breadth across the roots of the zygomatic processes, wherever found.

The greatest breadth across the alveolar borders, wherever found, perpendicular to the median plane.

Direct measurement from one asterion to the other.

The external breadth across the malars at the jugalia, i.e., at the deepest points in the curvature between the frontal and temporal process of the malars.

The breadth across the maxillae, from one zygomaxillare [anterior] to the other.

13. WMH Cheek height

The minimum distance, in any direction, from the lower border of the orbit to the lower margin of the maxilla, mesial to the masseter attachment, on the left side.

14. NPH Nasion-prosthion height

Upper facial height from nasion to prosthion, as defined.

15. BPL Basion-prosthion length

The facial length from basion to prosthion, as defined.

16. BNL Basion-nasion length

Direct length between basion and nasion.

17. NLH Nasal height

The average height from nasion to the lowest point on the border of the nasal aperture on either side.

The distance between the anterior edges of the nasal aperture at its widest extent.

The breadth across the orbits from ectoconchion to ectoconchion.

The breadth across the nasal space from dacryon to dacryon.

21. OBH Orbit height, left

The height between the upper and lower borders of the left orbit, perpendicular to the long axis of the orbit and bisecting it.

Breadth from ectoconchion to dacryon, as defined, approximating the longitudinal axis which bisects the orbit into equal upper and lower parts.

23. FRC Nasion-bregma chord, Frontal chord

The frontal chord, or direct distance from nasion to bregma, taken in the midplane and at the external surface.

24. PAC Bregma-lambda chord, Parietal chord

The external parietal chord, or direct distance from bregma to lambda, taken in the midplane and at the external surface.

25. OCC Lambda-opisthion chord, Occipital chord

The external occipital chord, or direct distance from lambda to opisthion, taken in the midplane and at the external surface.

26. SSS Zygomaxillary subtense

The projection or subtense from subspinale to the bimaxillary width [ZMB].

27. NAS Nasio-frontal subtense

The subtense from nasion to the bifrontal breadth.

28. FRS Nasion-bregma subtense, Frontal subtense

The maximum subtense, at the highest point on the convexity of the frontal bone in the midplane, to the nasion-bregma chord.

29. PAS Bregma-lambda subtense, Parietal subtense

The maximum subtense, at the highest point on the convexity of the parietal bones in the midplane, to the bregma-lambda chord.

30. OCS Lambda-opisthion subtense, Occipital subtense

The maximum subtense, at the most prominent point on the basic contour of the occipital bone in the midplane.

tournamentForest implements a recursive full elimination round-robin tournament classification algorithm built upon randomForest classifiers using LDA projected predictors.

The algorithm needs at least 3 groups to run, and automatically selects the best binary classifier given the data inputed by the user.

It follows a divide-and-conquer approach, where in each iteration of the tournament the least likely ancestral group is discarded as viable hypothesis. The tournament is finished when only two ancestral groups remain in "competition".

The algorithm explores an hypothesis space composed of $\frac{N(N-1)}{2}$ classifiers performing every possible pairwise comparison between N ancestral groups in order to establish the most likely one. This algorithm is best suited for cases where little to no background knowledge on a possible ancestry is available.

tournamentForest is set as the default algorithm because it represents a fully automated and data-driven approach to bio-geographic ancestry prediction.

#### Ancestry Prediction

#### Model Information & Accuracy