B1 FamiLinx


Data


General


FamiLinx is a scientific resource of curated genealogical and demographic data from tens of millions of people mostly from the last 500 years. Different from traditional studies, this resource is the product of an ultra crowd-sourcing approach and is based on the collaborative work of genealogy enthusiasts around the world who documented and shared their family stories.

The starting point of FamiLinx was the public information on Geni.com, a genealogy-driven social network that is operated by MyHeritage. Geni.com allows genealogists to enter their family trees into the website and to create profiles of family members with basic demographic information such as sex, birth date, marital status, and location. The genealogists decide whether they want the profiles in their trees to be public or private. New or modified family tree profiles are constantly compared to all existing profiles, and if there is high similarity to existing ones, the website offers the users the option to merge the profiles and connect the trees.

With permission from MyHeritage, the team downloaded the public profiles of individuals from Geni.com for future scientific studies. We used graph algorithms to clean the data and organize the pedigrees into fast accessible formats. We also employed natural language processing to tokenize birth, residence, death, and burial locations of individuals and converted this information into quantitative longitude and latitude. The format of the FamiLinx data consists of several text files. We encourage users to load these files into a database for ease of use. Users can create their own local copy with the download package.

For privacy purposes, the resource does not contain any names and any attempt to re-identify the users is strictly prohibited.


Examples


An example of a (small) FamiLinx pedigree of 6,000 people that spans over 7 generations:

Green nodes denote individuals and red nodes denote marriages



Quantitive analysis of human migration with crowd-sourced genealogy



The Dataset


The dataset has demographic data on 86M individuals and genealogical data on 43M individuals. More details are available in the supplementary material. The data is organized in two main files:
  • profiles-anon.txt: This file contains information about each profile (such as gender and date and location of birth/death/burial). Click here for list of all data fields.
  • relations-anon.txt: The parent-child relations between profiles.

Visit the download page to request access to the database.

Identifiers

The downloaded data contains anonimized identifiers.
To overlay other datasets on the FamiLinx data, you will need the dynamic version that includes the Geni profile-id.
Write to yaniv@cs.columbia.edu to obtain this type of data. The dynamic version is only available to academic researchers with a proven academic email address.