Topic: Future of recording and recording software

This is a rather grandiose title for what is a relatively simple question about how we record and use recording software.

Working with LRC data as part of the offsetting project I am involved with reminded me of some of the issues of acquiring data from disparate sources i.e.  certain things that could be standardised are actually recorded in different ways.  This made me wonder whether there is a case for trying to standardise the recording process a little more. The rationale would be that recorders up and down the country could be offered a standard to conform to, should they wish, that makes it more efficient to combine and compare their data.  This would be done by working towards standardising online recording systems, as well as the database software that replaces Recorder 6 (assuming that happens).

Here's an example of what I am talking about.

A record of house sparrows for London is recorded as "about 20" in the abundance field in iRecord.  A similar record for house sparrows for Worcester is recorded by a botanist surveying a local wildlife site, on paper, as simply "17".  In Sunderland, a school recording scheme has decided to record abundances in categories, and the flock of house sparrows was recorded as "20-25".  At St Andrew's University, a researcher recorded some house sparrows as "c.20".

The abundance of house sparrows could be a very important piece of data.  However, analysing it could be made more difficult by the fact that in these examples it has been recorded in different ways.  Could this be standardised?  Would recorders want this? Remember, a standard is an option that means that people can use to ensure their work conforms, if they want it to, to known parameters.  A definition might be "something considered by an authority or by general consent as a basis of comparison; an approved model" (Dictionary.com).

Another simple example might be a blackbird egg recorded as "egg" by one person "egg shell" by another and "ovum" by a third person.  This may seem a trivial example, but I believe that there are significant real world examples out there which have cost people time, or at worst meant that biodiversity data wasn't used for certain purposes.

Perhaps what we are talking about is "controlled vocabulary", rather than a "standard" which comes with certain connotations.  It may be difficult to achieve this, but it doesn't necessary mean it's not worth trying.  Perhaps also, Darwin Core is trying to achieve this, in which case do we need to assimilate it more fully into the systems we use?

As a first step into assessing whether or not this problem is necessarily real or imagined, I would appreciate it if LRCs who have the time could send me spreadsheets of individual entries in their "sex / stage" fields.  This should be a relatively easy query (something a like "count" for the relevant field). Perhaps it could be done for one taxonomic group (e.g. lepidoptera) so it's not too huge.  Perhaps a similar query can be done for abundances, maybe for birds over the last five years or something.  I will then compare the results to see if there are examples of where the same thing has been recorded in different ways.  We may then want to discuss this further and see if this is a problem, and if so can it be addressed?

Many thanks,

Tom Hunt - ALERC National Coordinator

Re: Future of recording and recording software

Hi Tom

I think this is a very worthwhile discussion to have and I suspect that there are cases where survey designers could adopt standardised attributes to make analysis of collations simpler. However, I'm also aware that any proposed standardisation would need to be more widespread than just the recording systems we use and would have to be adopted at the field recording level. Many Indicia survey forms are designed to copy existing paper forms or survey methodologies and it is quite key to usability that they continue to do this. I suspect that we will always have to deal with data from citizen science surveys, smartphone apps and the like were we can't have much say in what information is captured with each record.

We can also solve some of these issues at a technical level without needing to resort to changing the way people work in the field though. A good example where we already do this are map references - we can accept all sorts of different notations and coordinate systems for data entry, but internally we translate this to a common geometry format and can triangulate this back to any other notation or coordinate system. So, analysis across different datasets is not a problem. For the handling of differing terms (egg|egg shell|ovum), the answer is to use a thesaurus data model to allow translation between terms. This allows the system to include records of ovum in requests for eggs. It gets more complex when terms only partially overlap or contain each other, should a query for all records of hives return all records of nests for example? This becomes easy for us humans, as we might say that it should but only for records of social bees (avoiding technicalities here - but you get the point). These sort of inferences are pretty easy for humans but pretty difficult for computers, though not impossible. This technology actually exists in the data models used by both Indicia and Recorder 6 so the problem is not really at the technical end, rather the difficulty is that to make use of thesaurus data models, you need to populate the thesaurus and all its relationships which is a pretty huge task. Handling of numerical values is much easier though, for example we could quite easily map the attribute values as entered to an internal machine readable format which defines the lower end and upper end of the range, plus a flag if the values are approximate.

I guess the likelihood is that the solution is a combination of these things. 

Best wishes

John

John van Breda
Biodiverse IT