Tuesday 22 April 2014

Criteria for Assessing the Quality of Genealogy Websites and Online Data

Academic researchers, commercial vendors and volunteer interest groups have produced a vast array of online resources useful to family historians and genealogists.  The quality of the websites and data contained in them varies hugely.  For this discussion, I will focus on websites that offer access to digital copies of original records.

Quality has nothing to do with the total number of records, or the number of collections or data sets.  Quality is unrelated to the cost of a website subscription or motivation of the provider.

Without documentation of all the processing of records and information in them a researcher cannot asses the reliability of records. We can’t change the imperfect state that the original records come to us in. Archivists work hard to preserve both the records themselves and the context of their creation and use, but online presentation is often performed by other parties. Digitisation, indexing, and search are just a few of the processes that happen before an online version of the record is presented. Presentation can have profound influence on how records are perceived and the conclusions drawn from them. Consequently, transparency is an ethical obligation.

What are the most important website features? How can they be assessed?

The following, in order of importance, are essential:
  1. Catalogue
  2. Transcript quality
  3. Search facilities
  4. Browsing facilities
  5. Record quality
The quality of other features are also important, and a bonus if included.  Examples include analytical tools, user data (e.g. family trees, imported sources, research notes etc.), collaborative tools and social networks. But for now, I will discuss the basic five points above.

Collections with different histories or characteristics should be assessed separately. Only the catalogue can be assessed across a whole website.

Catalogue First

Yes, I really do mean that the catalogue is more important than anything else.

Genealogists use archival material, whether in the form of original records or some kind of derivative.  Genealogy websites are really a digital archive of such materials, so a genealogy website’s catalogue should share many of the features of an archival catalogue.

In her blog post The Value of Archival Description, Considered, archivist Maureen Callaghan recognizes researcher’s needs:
“getting to understand who created records, why they were created, and what they provide evidence of – really gets to the nature of research. These are the questions that historians and journalists and lawyers and all of the communities that use our collections ask – they don’t just see artifacts, they see evidence that can help them make a principled argument about what happened in the past. They want to know about reliability, authenticity, chain of custody, gaps, absences and silences.”

So, a catalogue is more than just a list of collections. Such a list might be the starting point for creating a catalogue, but falls well short of the sophisticated database that comprises an archival catalogue. It contains information about the collections, so serves a quite different purpose to search and browsing facilities.

A good catalogue answers questions about the website’s collections with no fuss:
  1. Is the catalogue complete, including collections not yet digitised and indexed with a timescale of expected online availability? This information allows the researcher to make informed decisions using the database or seeking the records elsewhere.
  2. What record collections does it contain? You want to know that relevant records are included before paying a subscription or spending precious time searching for records, don’t you?
  3. Where did the collections come from? Typically records come from originals in an archive, or a publication. The barest minimum information for archival material is the archive and the archive reference, and for published information, the bibliographic reference. That allows the researcher to check the archive’s or bibliographic catalogues.
  4. How do the collections relate to one another? Logical groupings of record sets by record type reflect original function of the records, whilst groupings by creator reflect the history or provenance of the records. Both are important for understanding how the records can be used. Were several types of record created by a particular process e.g. collection of taxes involved assessment of liability, record of payments and penalties for late or non-payment.
  5. What is the structure of the data set?  How is the record set arranged? Is it by date, person or something else?
  6. Is each collection or record set complete?
  7. What is the extent of each collection and record set?  How many sub-sets, how many records in each?
  8. Does the catalogue entry describe the records? Is a brief history of the original records creation and provenance included? Were the digital records an image of the original, or derived from a microfilm or transcript?
  9. What information do the records typically contain?
  10. Is a scholarly work on the record type referenced, or a critique on the strengths and weaknesses of the records included?

Transcript Quality

Transcription transforms manuscript and typescript documents into computer readable text, essential for creating searchable records. In evaluating the quality of computerized records consider if the website documents the following:
  1. The completeness of the transcript.  A complete transcript captures the most information so is far more useful than an abstract or an index. 
  2. Accuracy of transcription is influenced by how it was produced.  Optical character recognition (OCR) is commonly used for typescript.  Human data entry of manuscript or handwritten material depends on palaeographic and keyboard skills.  Typically, OCR and unskilled data entry yield less accurate transcripts.
  3. Checking procedures should detect obvious gobbledy-gook, and common OCR and data entry errors.  Double data entry produces a con-census interpretation, but may not avoid common reading errors.
  4. Have error rates been assessed?

Search Facilities

Good search rests on an accurate transcript, not algorithms or user added ‘corrections’.  Repeatable search, essential for confidence in the validity of results, requires a complete data set and stable search methods.  Search is not a simple operation, so inexperienced users need coaching and encouragement, not dumbed-down, limited functionality.  Consider the website’s documentation and functionality of the following:
  1. Targeted search on individual data sets and collections as default. Choosing which collection or collections to search first is much more efficient than filtering out irrelevant collections.
  2. Search on all data items in the record.  This requires a complete transcript.
  3. Full text search, the ability to search everywhere in the record, also requires a complete transcript.
  4. Complex search, the ability to specify ‘AND’, ‘OR’ and other operators.
  5. Name matching algorithm choice.  Examples include soundex and metaphone, which perform phonetic matching for English-language names, and Daitch-Mokotoff, which is adapted for Slavic and German spellings of Jewish names.
  6. Date ranges. Can start and end dates be specified, or a central dates with accuracy?
  7. Are place searches restricted to place names? Is a proximity search based on distance included?
  8. Wildcards, replacement characters in the search term that stand in for unknown possibilities e.g. Sm*th returns Smith, Smyth.
  9. Separation of transcribed values and interpreted values with options to search on either or a combination. For example, the abbreviation ‘Wm’ or Latinised ‘Gulielmus’ can be interpreted as William. Standardised interpretations are known to librarians and archivists as authority control http://en.wikipedia.org/wiki/Authority_control . User added ‘corrections’ are another kind of interpreted value.
  10. Filtering of search results.
  11. Optional ‘sticky’ settings and well-chosen default settings.
  12. Result presentation.  Is it simple, clear and contain the information important to you?  Does it include the search terms used?
  13. Result sorting on data fields chosen by the user. What does ‘relevance’ mean?
  14. Result export in a variety of formats, ready for use by with software tools of your choice.
  15. Logged searches that document research activity.

Browsing Facilities

Browsing is a tool for examining records in the context of the record set.  It should replicate the experience of turning the pages of the original.  The order of digital images should exactly follow the order of the original pages.  The structure of the record set and relationships between individual records contain subtle information about the creation and use of the original.

Browsing does not replace search. When used as a last resort when search fails, it is an indicator of poor search or transcription.

Record Quality

Original records are typically presented as digital images. Genealogists use the most original source available so they can be confident that the information is as reliable as possible. A digital image is not the same as the original, but can come acceptably close, provided that:
  1. Good image quality that is legible.  Sharp focus, resolution, colour accuracy, and contrast all contribute to legibility. Digital image file types vary in the degree of data compression, which influences image quality.
  2. Information that identifies the record portrayed included in the image file. That means all the information that you want in your citation, such as the archive and archive reference of the original, page number, record identifier, person of interest etc. A meaningful file name is helpful, but enough detail makes an unreasonably long name, as is human readable text added to the image. Potentially most useful is embedded citation information in the image file metadata, which is computer readable.
  3. Technical camera or scanner metadata provides provenance of the image, including whether it has been modified.

Your Challenge – Review one data set

There is a lot to consider in assessing the quality of genealogy websites and the data they contain. Of course, we want all the features mentioned above in a user-friendly package, but I think there is quite enough to start with above. Have I omitted anything vital? Do you agree with these criteria?

Before we can hold genealogy data suppliers accountable, we need to fairly assess whether what they offer is of sufficiently good quality for our purposes.  What constitutes ‘fit for purpose’ is open to debate.  I think genealogy data consumers would benefit from setting expectations and demanding quality, and that suppliers would benefit greatly from carefully considered feedback.

In the interest of collaboration between suppliers and consumers, I challenge you to review one data set using these criteria.

12 comments:

  1. I wish certain commercial Web sites were taking this in Sue. The origin and nature of the online data is very important, and I know you've made the point before, but even when the Web site helps you to cite the records (which is rare), how easy is it to start with a citation and find them again? FMP has to be a case-in-point here as some of their records have proved impossible to find following their "upgrade" -- irrespective of what citation you have for them.

    ReplyDelete
    Replies
    1. Tony I was using the images of the Lincolnshire parish records. They have disappeared. However I did read that they are to return and at least one other similar database. They really did not think things through before the move.

      Delete
    2. They don't appear to know much about real genealogy Hilary. On a generate note, though, I feel that many of these commercial sites (I deliberately emphasise that to distinguish this point from 'archives', which in turn muddied the water when Sue posted to the APG Members group) don't appreciate the significance of a record's condition and provenance, or of the importance in being able to cite and reliably return to it. Call me sceptical but it feels like a consequence of their "tree fodder" attitude to commercial "genealogy".

      Delete
  2. A great post Sue! I will tweet it to share around.

    ReplyDelete
    Replies
    1. Thanks for spreading to word. How about doing a review using the criteria?

      Delete
  3. Very thorough - how can we get commercial genealogy sites to follow these guidelines? An example is the new FindmyPast site where searching for a bride's birth year in a 1969 marriage I was offered same name but birth in 1929, despite having entered a number of far more reasonable birth ranges. Correct person came up in seconds on Free BMD. Fmp offered to search themselves, and came up with the 1929 person also, so it isn't my ignorance of how to operate their search facility.

    ReplyDelete
  4. I agree. The big providers give too much emphasis to their 'I've got more records than you battle'. They should be concentrating on making the records they do have easily accessible with searches that can be conducted in a way that researchers actually search, not using methods that computer programmers THINK we require. For my thoughts on Internet Genealogy see http://thehistoryinterpreter.wordpress.com/2014/04/10/i-is-for-internet-genealogy-is-this-progress/.

    ReplyDelete
  5. Publish a review using the guidelines yourself. That gives the website feedback and allows you to critically evaluate what you are looking at. Would it help if reviews were collected in one place? What wuld help you write a review?

    ReplyDelete
  6. A very thoughtful post that should guide our research efforts. .

    ReplyDelete
  7. Sue,

    Great post. I agree with you.

    I would add a pre-step one three things.

    What is my reason for going to that website?
    What is my genealogical Question?
    Does the record I want exist?

    The first two questions are interchangable, meaning the order isn't as important the questions are asked.

    Before I do step one, I look to see if they have a Wiki. I start at the Wiki to start answering that 3rd question, then I can go to the Card Catalog which will help men answer that 3rd question. I MAY not go to set one, if the Wiki tells me that the website doesn't have the records I am looking for. I may look at the Card Catalog for grins and giggles in case I had the wrong record group in mind.

    The best example that comes to mind, for me, is the 1890 US Census for the State of New Jersey. Using the Wiki told me that there was a specific record group that would cover the 1890 time frame, and it's NOT City Directories.

    Only a thought. But I do a little homework before I hit the Catalog

    Russ

    ReplyDelete
  8. An interesting post. I'll review a record set when I have more free time, but for now I've just done a few 'test searches' in FindMyPast's British census records. I found more information (and in much less time) than when I did the searches using the old format.

    ReplyDelete
  9. Interesting post. I came the A-Z and hadn't given much thought to the quality of genealogy websites, as I've never done a genealogy search. But as a writer, I find it fascinating and was already thinking ...a search could provide a great backstory for one of my characters.

    ReplyDelete

Hello, thanks for leaving a comment on the World Wide Genealogy Blog. All comments are moderated because of pesky spammers!

Best wishes
World Wide Genealogy Team