AI Genealogy Use Case Guide: How-to Get from Story to Structured Data, 1: from Text to Table Data, from Stories to CSV files

Introduction:

  • In the world of genealogy research, information is scattered across various sources, including narrative texts such as birth and wedding announcements, obituaries, and newspaper articles. These unstructured narratives can be challenging to manage and analyze. In this blog post, we will explore the specific task of quickly extracting valuable data such as names, relationships, dates, and places from these texts and converting them into structured data formats such as tables, JSON files, and near-universally usable CSV files (great for importing into spreadsheets such as Excel and Google Sheets and into databases such as MySQL and AirTable). This data extraction and storage process is tremendously beneficial for for streamlining genealogical research and making connections between family members, ancestors, and historical events.
  • To accomplish this task, we will be utilizing ChatGPT, an advanced AI language model developed by OpenAI. ChatGPT is capable of processing and extracting information from large volumes of text, making it an ideal tool for genealogy enthusiasts seeking to organize and analyze their research data efficiently. Stay tuned as we dive into the objectives, requirements, and methodology of using ChatGPT for genealogy data extraction and organization.

Objectives:

  1. Automate the extraction process: Utilize ChatGPT to efficiently extract names, relationships, dates, and places from various sources, such as announcements, obituaries, and newspaper articles, minimizing manual effort and speeding up the research process.
  2. Improve data organization: Convert the extracted information into structured data formats (e.g., tables, JSON, and CSV files) to facilitate better organization, storage, and retrieval of genealogical data.
  3. Enhance data analysis: Enable genealogy researchers to analyze structured data more effectively, identify patterns, and uncover hidden connections between family members and historical events.
  4. Save time and resources: Streamline the research process by reducing the time spent on manually extracting and organizing data, freeing up more time for analysis and interpretation.
  5. Increase research accuracy: Minimize human errors and inconsistencies in data extraction and organization by leveraging ChatGPT’s advanced language processing capabilities.

Requirements:

  1. Access to AI: Obtain a free or paid subscription to an artificial intelligence service such as OpenAI’s ChatGPT. Other AI options include Google’s Bard, Anthropic’s Claude, or Microsoft’s Bing Chat, but in April 2023, OpenAI’s GPT-4 based ChatGPT is strongest.
  2. Input data: Provide ChatGPT with text from sources such as birth or wedding announcements, obituaries, and newspaper articles, containing information about names, relationships, dates, and places relevant to genealogical research.
  3. Optional: Formatting requirements: Ensure that the input text is free of major errors or inconsistencies. Although ChatGPT can handle some level of noise in the data, better-formatted input will yield more accurate and reliable results. Our earlier Use Case Guide on Cleaning OCR Text quickly steps you through this process.
  4. Optional: Data storage and processing tools: Utilize software and tools like Microsoft Excel, a CSV editor, or a MySQL client to store, manage, and analyze the structured data extracted by ChatGPT.
  5. Helpful: Genealogy research resources: Familiarize yourself with various genealogical research methods, repositories, and databases to: (1) know where to find the texts to data mine, and (2) effectively contextualize and validate the information extracted by ChatGPT.

Caveats for the Careful on Large Language Models in Genealogy (April 2023):

  • No live internet access, with data current only up to September 2021
  • Unreliable for fact-based research, relying on statistical language patterns
  • Official ChatGPT warning: may produce inaccurate information
  • Mainly used for information processing, not discovering new data
  • Limited to 1500 words input/output (approx. 4k tokens)
  • Chatbots lack traditional memory, necessitating careful management of conversation
  • See Use Case Guide #1: Cleaning OCR Text for detailed information on each caveat above.

Caveats for the Bold

  • These initial Use Cases are admittedly weak tea: limited and narrow in function and capacity; these constraints reflect the abilities and token limits of AI systems for fact-based research in April 2023.
  • For now, think “Lego pieces” not “Post-Doc Assistant”; that is, in spring 2023, don’t imagine AI is a magic genie that can do all your work for you, like a post-doc assistant; instead, AI-assisted genealogical tasks are now more like a growing Swiss army knife or set of Lego blocks with which you can build tools to solve larger problems. It doesn’t take too much creativity to imagine how even these modest use cases can today be linked/chained and combined with each other to accomplish larger genealogical goals; soon enough, I imagine, the larger goals will be one-step AI-assisted tasks. But, for now, enjoy playing with the fundamental building blocks of more powerful systems to come.

How To: Methodology:

  • Step 1: Get a free or paid AI account. In April 2023, OpenAI’s GPT-4 based ChatGPT is strongest, but the free version based on GPT-3.5 will also work; you can get a free account at https://chat.openai.com/auth/login.
  • Step 2: Find and prepare your input text. In spring 2023, most publicly-accessibly AI systems are based on large language models that are untethered to reality or knowledge systems; they work by selecting the next statistically most likely word based on your prompt and previous utterances in your chat. For this reason, fact-based researchers such as genealogists restrict the AI to working only on the data you input. In this series of Use Case Guides, we have been using texts from publicly available sources such as the Chronicling America newspaper archive from The Library of Congress and the National Endowment for the Humanities. Our Use Case Guide #1: How to Clean Raw and Poor OCR Text breaks-down this step down in a detailed walk-through; refer to that guide if you would like help with this step. This Guide uses an obituary first published 100 years ago this week in one of my state’s capital newspapers: “Westmoreland Club Honors J. E. Royall,” Richmond Planet (Richmond, VA) 1883-1938, April 21, 1923, Page 8, Image 8; Image and text provided by Library of Virginia; Richmond, VA; < https://chroniclingamerica.loc.gov/lccn/sn84025841/1923-04-21/ed-1/seq-8/ > [accessed: 17 April 2023]. I processed and cleaned the raw (nightmarish) Chronicling American text using the steps in Usage Guide #1.
  • Step 3: Start a new ChatGPT session. It is important to start a new ChatGPT session when beginning a new genealogical task because (lacking both short-term and long-term memory) the chatbot re-ingests up to the previous 400 lines of your “dialogue” in order to simulate a conversation; this can have the unintended effect of contaminating your chat with information from pervious utterances in the current session. See “Don’t Get Burned by Spicy Autocomplete” for more information about this concern.
  • Step 4: Write your prompt. Your prompt will include two parts: (a) your instructions to the AI, and (b) the input text from which you want to extract structured genealogical data. We’ll discuss both parts in turn.
  • Step 4(a): Write your instructions to the AI. The instructional component of a prompt itself has subcomponents. Here you can see that the first part of the instructions are directing the AI to assume a role, in this case, of a genealogist; this has the effect of providing a context for the AI’s response. Next, action verbs to direct the AI; in this case “Find,” “Prioritize,” “Create,” “Include,” and “Respond.” These will change depending on your task; if you have troubling crafting this part, ChatGPT can help: write the instructions as best you can, then use ChatGPT to “convert these statements to the imperative mood“; this has the effect of changing your statements to the desired form “[you, the AI] do (verb) this.” Finally, you will see that here we are directing the AI to create a table of data; this is the most simple form of structured data, and perhaps the most meaningful and accessible to the human genealogist! Later, I’ll show you how to transform this table data into forms more suited for genealogical tools such as tree making applications, spreadsheets, and databases. To assist in the verification of our work, the AI is instructed to show its work, that is, to include the evidence it used to make a relationship determination by quoting the passage it relied upon to state a relationship. So far, thus prompted, contained, restrained, and instructed, I have not witnessed a fabrication or hallucination of a relationship. If you do, capture and save your whole session; I’d love to see it.
PROMPT: Assume the role of an expert, professional genealogist. Find below the text of an obituary. Prioritize fidelity to the information below. Create a table of named relatives of the deceased. Include only explicitly named relationships. Respond in the form of a markdown table with the column headings: Deceased | Person 2 | Relationship | Evidence (where evidence is the quoted text from the article used to determine relationship).
  • Step 4(b): Paste your input text below the instructions. Below your instruction, paste the text from which you would like to AI to process. Remember, as of April 2023, we are limited to about 1500 words input. Folks may tell you that you can upload more, perhaps by asking the AI to accept your input in parts, and it will agree to do that, but if you upload more than about 400 lines or about 1500 words, the AI will drop and ignore parts of your input. (In OpenAI’s technical jargon, we are limited to 4096 “tokens,” more akin to syllables than words, but for simplicities sake, about 1500 words input and 1500 words output.) [NOTE: If you need to enter a newline or line break in the ChatGPT edit box, Shift-Enter will give you a new line without submitting the request.]
  • Step 5: Examine your results; adjust as needed. Just as writing means re-writing, so prompt engineering means prompting and re-prompting. I never get the best results on a first attempt, so I expect to refine a prompt until the AI is producing the data in the form I want.

The results here are exactly as expected. This use case is a greater accomplishment than may be apparent to some. The imagination or understanding of how this use case will soon scale (become more powerful) is sometimes the missing piece. Here, eight relationships were extracted from a 700-word obituary; that is admittedly weak tea. But in time we will be able to process book chapters (50-page capacity announced already by OpenAI), whole books (this year or 2024), and entire archives after that. That’s the big deal that’s coming.

  • Step 6: Save your work. Save both your ChatGPT session and save your response to a text file. You can now easily download your entire ChatGPT history. You may also want to copy-and-paste the table data to a local file; some of the formatting will be lost if you paste into a plain text file, but pasting into a Word or Google Docs file will preserve the markdown formatting, if that is important to you.
  • Step 7: Wring further data from the text. Named relationships are not the only data that ChatGPT can extract from a text. ChatGPT excels at FAN processing of a text (finding friends, associates, and neighbors that are mentioned in a text). People (“entities” in AI jargon) are not the only data that can be extracted. ChatGPT will also extract places, events, and dates from a text. For example, after extracting the explicit relationships from the obituary, I instructed ChatGPT to extract all named associates from the text:
PROMPT: Create a table of named associates of the deceased; broaden the meaning of associates as wide as possible to include ALL named people in the obituary if their relationship or function at funeral is stated. Respond in the form of a markdown table with the column headings: Deceased | Person 2 | Relationship.
  • Step 8: Create derivative data structures and formats. You can now instruct ChatGPT to create alternate file types such as CSV (common separated values) files which are nearly universally usable by spreadsheets (Excel, Google Sheets), databases (MySQL, MS Access, AirTable), and word processors (Word, Google Sheets). For the technically inclined, GPT-4 is able to convert the table data to JSON files for processing by web applications and custom programming scripts such as Python. In the next Usage Guide, I’ll show you step-by-step how to create a GEDCOM file, used widely to create family trees and exchange genealogical data. Here you can see all how the named associates of the deceased may quickly be downloaded as a CSV file:
  • Step 9: As a last step, ask the AI what you forgot. This is always fun, and reveals that while I may be focused on one type or piece of information, the AI may help me uncover the missing piece to solve a brick wall that was under my nose but which I’d overlooked.
PROMPT: What other genealogically relevant information might I also extract as structed data from this obituary?

Results and Analysis:

  • Expected Outcomes: By using AI for this genealogy task, you can expect the extraction of key information such as names, relationships, dates, and places from various text sources, subsequently saving the data in structured formats like table data, JSON, and CSV files. This will facilitate easier data analysis and integration into your genealogical research.
  • Accuracy and Reliability: While ChatGPT is a powerful AI model, the accuracy and reliability of the results will depend on the quality of the input data and the clarity of the information present. In most cases, ChatGPT can accurately extract relevant data points, but manual review and validation is required to ensure the information is consistent with your research goals.

Conclusions:

  • In conclusion, using ChatGPT for extracting structured data from narrative sources like birth and wedding announcements, obituaries, and newspaper articles offers significant benefits and some limitations. The technology has the potential to greatly enhance genealogy research by automating the extraction of names, relationships, dates, and places, saving time and effort for researchers. The ability to convert this information into structured data formats such as table data, JSON, and CSV files further streamlines the research process and facilitates data organization.
  • However, limitations need to be considered. ChatGPT’s accuracy may vary depending on the quality of the input text, especially if dealing with raw OCR text or handwritten documents. Additionally, ChatGPT might struggle with complex relationships and ambiguous information present in the narratives. To overcome these challenges, AI Genealogists need to manually review and verify the extracted data.
  • For further improvement and exploration of AI in genealogy, researchers should consider integrating ChatGPT with other natural language processing tools or specialized genealogy software to enhance its capabilities. Collaborating with AI developers to create tailored models for genealogy research could further optimize the extraction process and improve overall accuracy. Encouraging users to share their experiences and provide feedback will contribute to the ongoing refinement of AI solutions for genealogy.
  • Ultimately, employing AI tools like ChatGPT for genealogy research has the potential to revolutionize the field, making it more accessible, efficient, and accurate. As AI technology continues to evolve, the possibilities for its application in genealogy will only expand, benefiting researchers and family historians alike.

Calls to Action:

  • Try ChatGPT for your genealogy tasks: We encourage you to harness the power of ChatGPT for your genealogy research. Experience firsthand the benefits of using AI to extract structured data from narrative sources like birth and wedding announcements, obituaries, and newspaper articles.
  • Share your experiences: We would love to hear about your experiences using ChatGPT for genealogy tasks. Share your successes, challenges, or any interesting insights you’ve gained through utilizing AI in your research. Your feedback can help improve the technology and benefit the entire genealogy community.
  • Ask questions and seek advice: If you have any questions or need assistance with using ChatGPT for genealogy tasks, feel free to post them in the comments section below or reach out to us on social media. Our community of experts and fellow genealogy enthusiasts will be more than happy to help.
  • Connect with others and expand your knowledge: Join genealogy forums, social media groups, and other online communities where you can connect with others who are using AI for genealogy research. These platforms are excellent resources for sharing tips, tricks, and best practices, as well as staying up-to-date with the latest advancements in AI technology.
  • Explore additional resources: To further enhance your understanding of AI in genealogy and to make the most out of ChatGPT, check out the provided links to tutorials, support forums, and related articles. Continuously learning and staying informed will help you maximize the potential of AI in your genealogy research.

One thought on “AI Genealogy Use Case Guide: How-to Get from Story to Structured Data, 1: from Text to Table Data, from Stories to CSV files

Comments are closed.