Uploading Data to Google Base

There are two basic approaches to uploading data to Google Base (GB):

  • Per item, where an insert request is sent to GB for each item
  • Bulk upload, where a file containing multiple items is uploaded to GB

The per item approach works well for relatively small collections of records - less than a few thousand. The main problem is the overhead associated with the number of web connections that need to be made (at least one per item, limited to at most 5 per second). Testing indicates that the per-item approach is useful for maintenance tasks, but not effective for uploading thousands of records.

The bulk upload approach is fairly straight forward, and several different file formats may be used for the upload. Atom 1.0 is probably the most convenient format for our purposes.

Requirements For Data Upload

The following are the basic requirements of the upload process adopted:

  1. A participant must be able to upload all relevant records to GB.
  2. A participant must be able to remove any of their records from GB.
  3. A participant must be able to update any or all of their records in GB
  4. The portal must be able to retrieve all records from all participants that have been uploaded to GB.

Requirement 1 is quite straight forward and can be achieved by uploading single records at a time or through the bulk upload process.

Requirement 2 can be achieved most easily using the API, deleting one records at a time, which requires that each record can be uniquely identified (the same effect can probably be achieved by setting the expiration of each record- but it is not clear if these records are deleted or merely inactive). GB assigns a unique ID to each record, but if using the bulk upload process, this ID is not returned. Instead, it is necessary for the uploaded data set to contain an identifier. The bulk upload process utilizes an attribute "id" to locate existing records, however a recently discovered bug in GB does not allow searching on this "id", so it is necessary to provide an attribute that does allow searching. Hence it is necessary for the following attributes to be available on every record uploaded in the bulk upload process:

id
This is the GB required record ID. It is used by the bulk upload process to identify individual records in a bulk upload file and is used in subsequent uploads to determine if the Google Base record needs to be be modified. It can not currently be used to search for an item. The value of this needs to be unique for each record - should be able to use the value of accession_number for this.

institution_code
The institution_code from the "Botanic Garden" institutional level data.
accession_number
the accession number (actually an alpha-numeric code) of the record. This should be unique for all records for a particular botanic garden. Hence the combination of institution_code and accession_number should be globally unique in the context of the Plant Collections project.

Requirement 3 is managed by the bulk upload process. For this to work, it is necessary that the id value is unchanged between uploads as this is the value GB uses to locate an existing record.

Requirement 4 Is a little difficult as GB only allows up to 1250 items to be retrieved anonymously in response to a query. From a presentation point of view, this is unlikely to be an issue, however, for the PC project, it is important that complete data sets are downloadable without restriction. A very simple way to get around this limitation is to assign a sequential integer value ("indexer") as an identifier for records. Then numeric range queries can be used to retrieve blocks of records. The combination of "item_type", "institution_code" and "indexer" will be unique for each record, and thus it will be trivial to retrieve all items.

Associating Records with indexing_ids

Managing the record ids requires that the bulk upload file(s) are generated programmatically and that the association between record identifiers (accession_number) indexing_ids is maintained. Conceptually this could be a simple lookup table with two columns - one for accession_number and the other for indexing_id. However it is very important that the indexing_id values are maintained consistently, not duplicated, and so forth (in practice, a few duplicates will not matter, more than a thousand or so duplicates will be difficult to resolve).

Since it is very simple to specify an item_type in a query, it does not matter if the indexing_ids are re-used for different types of items (records) uploaded by a participant.