AI Genealogy Use Case Guide: How-to Get from Story to Structured Data, 2: Create GEDCOM (family tree) files from obits, articles, & announcements

Introduction:

  • In the world of genealogy research, information is scattered across various sources, including narrative texts such as birth and wedding announcements, obituaries, and newspaper articles. These unstructured narratives can be challenging to manage and analyze. In this blog post, we will explore the specific task of quickly extracting valuable data such as names, relationships, dates, and places from these texts and converting them into structured data formats such as GEDCOM files, the format used to build and exchange family trees and to share genealogical data. This data extraction and storage process is tremendously beneficial for for streamlining genealogical research and making connections between family members, ancestors, and historical events.
  • To accomplish this task, we will be utilizing ChatGPT, an advanced AI language model developed by OpenAI. ChatGPT is capable of processing and extracting information from large volumes of text, making it an ideal tool for genealogy enthusiasts seeking to organize and analyze their research data efficiently.

Objectives:

  • Automate the extraction process: Utilize ChatGPT to efficiently extract names, relationships, dates, and places from various sources, such as announcements, obituaries, and newspaper articles, minimizing manual effort and speeding up the research process.
  • Improve data organization: Convert the extracted information into structured data formats such as GEDCOM files to facilitate better organization, storage, and retrieval of genealogical data.
  • Enhance data analysis: Enable genealogy researchers to analyze structured data more effectively, identify patterns, and uncover hidden connections between family members and historical events.
  • Save time and resources: Streamline the research process by reducing the time spent on manually extracting and organizing data, freeing up more time for analysis and interpretation.
  • Increase research accuracy: Minimize human errors and inconsistencies in data extraction and organization by leveraging ChatGPT’s advanced language processing capabilities.

Requirements:

  • Access to AI: Obtain a free or paid subscription to an artificial intelligence service such as OpenAI’s ChatGPT. Other AI options include Google’s Bard, Anthropic’s Claude, or Microsoft’s Bing Chat, but in April 2023, OpenAI’s GPT-4 based ChatGPT is strongest.
  • Input data: Provide ChatGPT with text from sources such as birth or wedding announcements, obituaries, and newspaper articles, containing information about names, relationships, dates, and places relevant to genealogical research.
  • Family tree software to read and use the GEDCOM file created here. Most genealogy applications and website and utilize GEDCOM files to share family trees and genealogical data; these include desktop applications such as RootsMagic, Family Tree Maker, Gramps, and online resources such as Ancestry, DNA Painter, MyHeritage, and FindMyPast.
  • Optional: Formatting requirements: Ensure that the input text is free of major errors or inconsistencies. Although ChatGPT can handle some level of noise in the data, better-formatted input will yield more accurate and reliable results. Our earlier Use Case Guide on Cleaning OCR Text quickly steps you through this process.
  • Helpful: Genealogy research resources: Familiarize yourself with various genealogical research methods, repositories, and databases to: (1) know where to find the texts to data mine, and (2) effectively contextualize and validate the information extracted by ChatGPT.

Caveats for the Careful on Large Language Models in Genealogy (April 2023):

  • No live internet access, with data current only up to September 2021
  • Unreliable for fact-based research, relying on statistical language patterns
  • Official ChatGPT warning: may produce inaccurate information
  • Mainly used for information processing, not discovering new data
  • Limited to 1500 words input/output (approx. 4k tokens)
  • Chatbots lack traditional memory, necessitating careful management of conversation
  • See Use Case Guide #1: Cleaning OCR Text for detailed information on each caveat above.

How To: Methodology:

  • Step 1: Get a free or paid AI account. In April 2023, OpenAI’s GPT-4 based ChatGPT is strongest, but the free version based on GPT-3.5 will also work; you can get a free account at https://chat.openai.com/auth/login.
  • Step 2: Find and prepare your input text. In spring 2023, most publicly-accessibly AI systems are based on large language models that are untethered to reality or knowledge systems; they work by selecting the next statistically most likely word based on your prompt and previous utterances in your chat. For this reason, fact-based researchers such as genealogists restrict the AI to working only on the data you input. In this series of Use Case Guides, we have been using texts from publicly available sources such as the Chronicling America newspaper archive from The Library of Congress and the National Endowment for the Humanities. Our Use Case Guide #1: How to Clean Raw and Poor OCR Text breaks-down this step down in a detailed walk-through; refer to that guide if you would like help with this step. This Guide uses an obituary first published 100 years ago this week in one of my state’s capital newspapers: “Westmoreland Club Honors J. E. Royall,” Richmond Planet (Richmond, VA) 1883-1938, April 21, 1923, Page 8, Image 8; Image and text provided by Library of Virginia; Richmond, VA; < https://chroniclingamerica.loc.gov/lccn/sn84025841/1923-04-21/ed-1/seq-8/ > [accessed: 17 April 2023]. I processed and cleaned the raw (nightmarish) Chronicling American text using the steps in Usage Guide #1.
  • Step 3: Start a new ChatGPT session. It is important to start a new ChatGPT session when beginning a new genealogical task because (lacking both short-term and long-term memory) the chatbot re-ingests up to the previous 400 lines of your “dialogue” in order to simulate a conversation; this can have the unintended effect of contaminating your chat with information from pervious utterances in the current session. See “Don’t Get Burned by Spicy Autocomplete” for more information about this concern.
  • Step 4: Write your prompt. Your prompt will include two parts: (a) your instructions to the AI, and (b) the input text from which you want to extract structured genealogical data. We’ll discuss both parts in turn.
  • Step 4(a): Write your instructions to the AI. The instructional component of a prompt itself has subcomponents. Here you can see that the first part of the instructions are directing the AI to assume a role, in this case, of a genealogist; this has the effect of providing a context for the AI’s response. Next, action verbs to direct the AI; in this case “Find,” “Prioritize,” “Create,” “Include,” and “Respond.” These will change depending on your task; if you have troubling crafting this part, ChatGPT can help: write the instructions as best you can, then use ChatGPT to “convert these statements to the imperative mood“; this has the effect of changing your statements to the desired form “[you, the AI] do (verb) this.” Finally, you will see that here we are directing the AI to create a table of data; this is the most simple form of structured data, and perhaps the most meaningful and accessible to the human genealogist! Later, I’ll show you how to transform this table data into forms more suited for genealogical tools such as tree making applications, spreadsheets, and databases. To assist in the verification of our work, the AI is instructed to show its work, that is, to include the evidence it used to make a relationship determination by quoting the passage it relied upon to state a relationship. So far, thus prompted, contained, restrained, and instructed, I have not witnessed a fabrication or hallucination of a relationship. If you do, capture and save your whole session; I’d love to see it.
PROMPT: Assume the role of an expert, professional genealogist. Find below the text of an obituary. Prioritize fidelity to the information below. Create a table of named relatives of the deceased. Include only explicitly named relationships. Respond in the form of a markdown table with the column headings: Deceased | Person 2 | Relationship | Evidence (where evidence is the quoted text from the article used to determine relationship).
  • Step 4(b): Paste your input text below the instructions. Below your instruction, paste the text from which you would like to AI to process. Remember, as of April 2023, we are limited to about 1500 words input. Folks may tell you that you can upload more, perhaps by asking the AI to accept your input in parts, and it will agree to do that, but if you upload more than about 400 lines or about 1500 words, the AI will drop and ignore parts of your input. (In OpenAI’s technical jargon, we are limited to 4096 “tokens,” more akin to syllables than words, but for simplicities sake, about 1500 words input and 1500 words output.) [NOTE: If you need to enter a newline or line break in the ChatGPT edit box, Shift-Enter will give you a new line without submitting the request.]
  • Step 5: Examine your results; adjust as needed. Just as writing means re-writing, so prompt engineering means prompting and re-prompting. I never get the best results on a first attempt, so I expect to refine a prompt until the AI is producing the data in the form I want.

The results here are exactly as expected. This use case is a greater accomplishment than may be apparent to some. The imagination or understanding of how this use case will soon scale (become more powerful) is sometimes the missing piece. Here, eight relationships were extracted from a 700-word obituary; that is admittedly weak tea. But in time we will be able to process book chapters (50-pages announced already), whole books (this year or 2024), and entire archives after that. That’s the big deal that’s coming.

  • Step 6: Set the context for your GEDCOM request prompt. Set the context of the GEDCOM request by asking the AI about its familiarity with GEDCOM files.
PROMPT: Are you familiar with the GEDCOM file format and standard?
  • Step 7: Prompt for the creation of the GEDCOM file. There are several items to note about this prompt. First, we are directing to AI to transform the table of named relationship created earlier; that table included direct quotations from the source article for ease of validation and verification, but since that is not desired in the GEDCOM file, we instruct the artificial intelligence to omit that column of information. We do want the source of this information included with the GEDCOM file, so we supply that here to ChatGPT.
PROMPT: Create a GEDCOM file from the table of named relatives above. Omit "Evidence" column. Include source information: "Westmoreland Club Honors J. E. Royall," Richmond Planet, Richmond, Va. 1883-1938, April 21, 1923, Image 8, Image and text provided by Library of Virginia; Richmond, VA
Persistent link: https://chroniclingamerica.loc.gov/lccn/sn84025841/1923-04-21/ed-1/seq-8/

GEDCOM FILE:
  • Step 8: Examine your results. I have found that ChatGPT very reliably creates accurate and functional GEDCOM files using this complete method. Other artificial intelligences may not be as reliable: Anthropic’s Claude will create an accurate GEDCOM file, but fail to add the newlines (carriage returns or hard line breaks) needed to create a functional GEDCOM; asking again for those newline characters usually works; I have not been able to successfully create an accurate and functional GEDCOM with Google’s Bard, Perplexity AI, nor Microsoft’s Bing Chat.
  • Step 9: Save your results. You need to save your GEDCOM file as a text file. This means finding and using your computer’s text file editor. The basic Windows text editor is Notepad, so Windows users will open Notepad with a new, blank file. Then, in the ChatGPT code window, click the “Copy code” link at the upper right corner of the code window. Switch to the blank text file, paste the GEDCOM data into the text file, and save the file with a name such as “Royall.ged”; the “.ged” extension (last characters of the file name) is important. Remembering to include this file name extension will enable your genealogical apps and sites to recognize this text file as a family tree file.
  • Step 10: Open, test, verify, and confirm your work. At this point, you can open your genealogy application such as RootsMagic, Gramps, or Family Tree Maker and open or import the GEDCOM file (you will need to check that application’s instructions for opening and/or importing a GEDCOM file). You will usually want to open the GEDCOM file as a new tree, as opposed to merging it into an existing tree. Compare the information now in your new family tree to the information stated in the birth or wedding announcement, obituary, or newspaper article.

Results and Analysis:

  • Expected Outcomes: By using AI for this genealogy task, you can expect the extraction of key information such as names, relationships, dates, and places from various text sources, subsequently saving the data in a structured format such as GEDCOM files. This will facilitate easier sharing of family trees and the exchange of genealogical information.
  • Accuracy and Reliability: While ChatGPT is a powerful AI model, the accuracy and reliability of the results will depend on the quality of the input data and the clarity of the information present. In most cases, ChatGPT can accurately extract relevant data points, but manual review and validation is required to ensure the information is consistent with your research goals.

Conclusion:

  • In conclusion, using ChatGPT for extracting structured data from narrative sources like birth and wedding announcements, obituaries, and newspaper articles offers significant benefits and some limitations. The technology has the potential to greatly enhance genealogy research by automating the extraction of names, relationships, dates, and places, saving time and effort for researchers. The ability to convert this information into a structured data format such GEDCOM files further streamlines the research process and facilitates data organization, the sharing of family trees and the exchange of genealogical information.
  • However, limitations need to be considered. ChatGPT’s accuracy may vary depending on the quality of the input text, especially if dealing with raw OCR text or handwritten documents. Additionally, ChatGPT might struggle with complex relationships and ambiguous information present in the narratives. To overcome these challenges, AI Genealogists need to manually review and verify the extracted data.
  • For further improvement and exploration of AI in genealogy, researchers should consider integrating ChatGPT with other natural language processing tools or specialized genealogy software to enhance its capabilities. Collaborating with AI developers to create tailored models for genealogy research could further optimize the extraction process and improve overall accuracy. Encouraging users to share their experiences and provide feedback will contribute to the ongoing refinement of AI solutions for genealogy.
  • Ultimately, employing AI tools like ChatGPT for genealogy research has the potential to revolutionize the field, making it more accessible, efficient, and accurate. As AI technology continues to evolve, the possibilities for its application in genealogy will only expand, benefiting researchers and family historians alike.

Call to Action:

  • Try ChatGPT for your genealogy tasks: We encourage you to harness the power of ChatGPT for your genealogy research. Experience firsthand the benefits of using AI to extract structured data from narrative sources like birth and wedding announcements, obituaries, and newspaper articles.
  • Share your experiences: We would love to hear about your experiences using ChatGPT for genealogy tasks. Share your successes, challenges, or any interesting insights you’ve gained through utilizing AI in your research. Your feedback can help improve the technology and benefit the entire genealogy community.
  • Ask questions and seek advice: If you have any questions or need assistance with using ChatGPT for genealogy tasks, feel free to post them in the comments section below or reach out to us on social media. Our community of experts and fellow genealogy enthusiasts will be more than happy to help.
  • Connect with others and expand your knowledge: Join genealogy forums, social media groups, and other online communities where you can connect with others who are using AI for genealogy research. These platforms are excellent resources for sharing tips, tricks, and best practices, as well as staying up-to-date with the latest advancements in AI technology.
  • Explore additional resources: To further enhance your understanding of AI in genealogy and to make the most out of ChatGPT, check out the provided links to tutorials, support forums, and related articles. Continuously learning and staying informed will help you maximize the potential of AI in your genealogy research.