Academic researchers, commercial vendors and volunteer interest groups have produced a vast array of online resources useful to family historians and genealogists. The quality of the websites and data contained in them varies hugely. For this discussion, I will focus on websites that offer access to digital copies of original records.
Quality has nothing to do with the total number of records, or the number of collections or data sets. Quality is unrelated to the cost of a website subscription or motivation of the provider.
Without documentation of all the processing of records and information in them a researcher cannot asses the reliability of records. We can’t change the imperfect state that the original records come to us in. Archivists work hard to preserve both the records themselves and the context of their creation and use, but online presentation is often performed by other parties. Digitisation, indexing, and search are just a few of the processes that happen before an online version of the record is presented. Presentation can have profound influence on how records are perceived and the conclusions drawn from them. Consequently, transparency is an ethical obligation.
What are the most important website features? How can they be assessed?
The following, in order of importance, are essential:
- Catalogue
- Transcript quality
- Search facilities
- Browsing facilities
- Record quality
The quality of other features are also important, and a bonus if included. Examples include analytical tools, user data (e.g. family trees, imported sources, research notes etc.), collaborative tools and social networks. But for now, I will discuss the basic five points above.
Collections with different histories or characteristics should be assessed separately. Only the catalogue can be assessed across a whole website.
Catalogue First
Yes, I really do mean that the catalogue is more important than anything else.
Genealogists use archival material, whether in the form of original records or some kind of derivative. Genealogy websites are really a digital archive of such materials, so a genealogy website’s catalogue should share many of the features of an archival catalogue.
In her blog post
The Value of Archival Description, Considered, archivist Maureen Callaghan recognizes researcher’s needs:
“getting to understand who created records, why they were created, and what they provide evidence of – really gets to the nature of research. These are the questions that historians and journalists and lawyers and all of the communities that use our collections ask – they don’t just see artifacts, they see evidence that can help them make a principled argument about what happened in the past. They want to know about reliability, authenticity, chain of custody, gaps, absences and silences.”
So, a catalogue is more than just a list of collections. Such a list might be the starting point for creating a catalogue, but falls well short of the sophisticated database that comprises an archival catalogue. It contains information about the collections, so serves a quite different purpose to search and browsing facilities.
A good catalogue answers questions about the website’s collections with no fuss:
- Is the catalogue complete, including collections not yet digitised and indexed with a timescale of expected online availability? This information allows the researcher to make informed decisions using the database or seeking the records elsewhere.
- What record collections does it contain? You want to know that relevant records are included before paying a subscription or spending precious time searching for records, don’t you?
- Where did the collections come from? Typically records come from originals in an archive, or a publication. The barest minimum information for archival material is the archive and the archive reference, and for published information, the bibliographic reference. That allows the researcher to check the archive’s or bibliographic catalogues.
- How do the collections relate to one another? Logical groupings of record sets by record type reflect original function of the records, whilst groupings by creator reflect the history or provenance of the records. Both are important for understanding how the records can be used. Were several types of record created by a particular process e.g. collection of taxes involved assessment of liability, record of payments and penalties for late or non-payment.
- What is the structure of the data set? How is the record set arranged? Is it by date, person or something else?
- Is each collection or record set complete?
- What is the extent of each collection and record set? How many sub-sets, how many records in each?
- Does the catalogue entry describe the records? Is a brief history of the original records creation and provenance included? Were the digital records an image of the original, or derived from a microfilm or transcript?
- What information do the records typically contain?
- Is a scholarly work on the record type referenced, or a critique on the strengths and weaknesses of the records included?
Transcript Quality
Transcription transforms manuscript and typescript documents into computer readable text, essential for creating searchable records. In evaluating the quality of computerized records consider if the website documents the following:
- The completeness of the transcript. A complete transcript captures the most information so is far more useful than an abstract or an index.
- Accuracy of transcription is influenced by how it was produced. Optical character recognition (OCR) is commonly used for typescript. Human data entry of manuscript or handwritten material depends on palaeographic and keyboard skills. Typically, OCR and unskilled data entry yield less accurate transcripts.
- Checking procedures should detect obvious gobbledy-gook, and common OCR and data entry errors. Double data entry produces a con-census interpretation, but may not avoid common reading errors.
- Have error rates been assessed?
Search Facilities
Good search rests on an accurate transcript, not algorithms or user added ‘corrections’. Repeatable search, essential for confidence in the validity of results, requires a complete data set and stable search methods. Search is not a simple operation, so inexperienced users need coaching and encouragement, not dumbed-down, limited functionality. Consider the website’s documentation and functionality of the following:
- Targeted search on individual data sets and collections as default. Choosing which collection or collections to search first is much more efficient than filtering out irrelevant collections.
- Search on all data items in the record. This requires a complete transcript.
- Full text search, the ability to search everywhere in the record, also requires a complete transcript.
- Complex search, the ability to specify ‘AND’, ‘OR’ and other operators.
- Name matching algorithm choice. Examples include soundex and metaphone, which perform phonetic matching for English-language names, and Daitch-Mokotoff, which is adapted for Slavic and German spellings of Jewish names.
- Date ranges. Can start and end dates be specified, or a central dates with accuracy?
- Are place searches restricted to place names? Is a proximity search based on distance included?
- Wildcards, replacement characters in the search term that stand in for unknown possibilities e.g. Sm*th returns Smith, Smyth.
- Separation of transcribed values and interpreted values with options to search on either or a combination. For example, the abbreviation ‘Wm’ or Latinised ‘Gulielmus’ can be interpreted as William. Standardised interpretations are known to librarians and archivists as authority control http://en.wikipedia.org/wiki/Authority_control . User added ‘corrections’ are another kind of interpreted value.
- Filtering of search results.
- Optional ‘sticky’ settings and well-chosen default settings.
- Result presentation. Is it simple, clear and contain the information important to you? Does it include the search terms used?
- Result sorting on data fields chosen by the user. What does ‘relevance’ mean?
- Result export in a variety of formats, ready for use by with software tools of your choice.
- Logged searches that document research activity.
Browsing Facilities
Browsing is a tool for examining records in the context of the record set. It should replicate the experience of turning the pages of the original. The order of digital images should exactly follow the order of the original pages. The structure of the record set and relationships between individual records contain subtle information about the creation and use of the original.
Browsing does not replace search. When used as a last resort when search fails, it is an indicator of poor search or transcription.
Record Quality
Original records are typically presented as digital images. Genealogists use the most original source available so they can be confident that the information is as reliable as possible. A digital image is not the same as the original, but can come acceptably close, provided that:
- Good image quality that is legible. Sharp focus, resolution, colour accuracy, and contrast all contribute to legibility. Digital image file types vary in the degree of data compression, which influences image quality.
- Information that identifies the record portrayed included in the image file. That means all the information that you want in your citation, such as the archive and archive reference of the original, page number, record identifier, person of interest etc. A meaningful file name is helpful, but enough detail makes an unreasonably long name, as is human readable text added to the image. Potentially most useful is embedded citation information in the image file metadata, which is computer readable.
- Technical camera or scanner metadata provides provenance of the image, including whether it has been modified.
Your Challenge – Review one data set
There is a lot to consider in assessing the quality of genealogy websites and the data they contain. Of course, we want all the features mentioned above in a user-friendly package, but I think there is quite enough to start with above. Have I omitted anything vital? Do you agree with these criteria?
Before we can hold genealogy data suppliers accountable, we need to fairly assess whether what they offer is of sufficiently good quality for our purposes. What constitutes ‘fit for purpose’ is open to debate. I think genealogy data consumers would benefit from setting expectations and demanding quality, and that suppliers would benefit greatly from carefully considered feedback.
In the interest of collaboration between suppliers and consumers, I challenge you to review one data set using these criteria.