A  A

AD·VNVM·DATVM Down to a single bit of data A Self-Hosted Sync Solution for Zotero

Posted in || , , , , , , , , , 9 min. to read

In short:

Because of my preference not to use the Zotero server to sync my academic library, I have written a small bash (command-line) function to sync Zotero database files over any server of one's choosing (including, e.g., OwnCloud or Dropbox).

Although I previously wrote that I had begun to use a combination of reference managers, centering on JabRef, I have since settled on Zotero, a free, open-source project funded by the Roy Rosenzweig Center for History and New Media at George Mason University. Zotero has an active community of developers and users, and is impressive in its ease of use as well as in the quick response times and enthusiasm of its community volunteers. Zotero started as an extension for the Firefox web browser, but now also features a Standalone version that works with the Chrome and Safari browsers, as well.

Introduction to Zotero Sync

Zotero includes a free service that allows users to sync their reference libraries across multiple computers and to share citations within Zotero with other users. As of this writing, syncing of citations is free, as is syncing of actual content (e.g., PDFs) up to a threshold), beyond which users must pay for the use of the extra space on the Zotero servers. However, users can also use their own server space to store and share content; in all cases, though, to sync citations, the only substantive option Zotero currently allows is to use the central Zotero Sync server. I don't expect to use the citation-sharing feature with collaborators (I would rather send and receive BibTeX-format text), but I have wanted to be able to sync my library across multiple computers.

While Zotero's free sync option works well and is generously provided, I have been looking for an option that does not require uploading the metadata for one's entire academic library to Zotero's servers. Like many organizations', Zotero's Privacy Policy states, "We will keep your Personal Information private and will not share it with third parties, unless such disclosure is necessary to... comply with the law or legal process served on us... [among other conditions]."

A data ethics issue for some users

This caveat is central to a larger data ethics issue (in which some users are disproportionately affected compared to others, depending on their research or reading habits) under US legislation that allows library records (which Zotero user data aren't, but which are conceptually related) and digital content more generally (which can include Zotero user data) to be accessed for state-sponsored investigations in broad contexts, potentially for political purposes when the subject of the investigation is not suspected of being involved in criminal activity. To my understanding, this issue is large enough in its implications that it has prompted some libraries to adopt policies like that of the San Francisco Public Library, which routinely deletes records of the type that might eventually be requested:

A borrower's library record includes current information, items currently checked out or on hold, as well as overdue materials and fines. The Library does not maintain a history of what a borrower has previously checked out once books and materials are returned on time. Similarly, the Library's computer search stations are programmed to delete the history of a user's Internet session and all searches once an individual session is completed. The Library treats reference questions, whether in person or online, confidentially. Personal identifying information related to these questions is purged on an ongoing basis. See also the American Library Association's posts on Privacy and Confidentiality here and here.

This is worth noting, regardless of one's own reading habits, because of its significance for researchers of politically or morally unpopular or sensitive topics. The American Library Association (ALA) has argued that knowledge of the possibility of monitoring of reading habits can be expected to have a chilling effect on inquiry and research in these cases:

Confidentiality of library records is a core value of librarianship.... One cannot exercise the right to read if the possible consequences include damage to one's reputation, ostracism from the community or workplace, or criminal penalties. Choice requires both a varied selection and the assurance that one's choice is not monitored.... The right to privacy is the right to open inquiry without having the subject of one's interest examined or scrutinized by others. Confidentiality relates to the possession of personally identifiable information, including such library-created records as closed-stack call slips, computer sign-up sheets, registration for equipment or facilities, circulation records, Web sites visited, reserve notices, or research notes. [Emphasis added]

Limited alternative syncing solutions

Although Zotero has published the source code for its sync server, potentially allowing users to use the Zotero Sync function with their own servers, the documentation is currently incomplete; similarly, as of this writing, the pre-built, downloadable versions of the Zotero client lack the ability to specify a sync server, further complicating the process of deploying a self-hosted instance. It is with this context in mind that I've been working on a way to easily maintain the sync feature (lacking the collaborative extras) using a self-hosted option like OwnCloud. The solution described below will also work with services like Dropbox, though Dropbox specifically has had similar but further substantiated privacy concerns to those discussed above.

While JabRef uses plaintext BibTeX files as a "database," Zotero currently lacks a bidirectional BibTeX solution (i.e., several plugins allow for automatic export of the Zotero database to a BibTex file, but importing from a BibTeX file is currently a manual process, making it slower and more cumbersome than the existing sync solution for repeated use). Instead of a BibTeX file, Zotero's metadata is stored in a single SQLite file. Zotero's documentation and fora emphatically state that simply storing this .sqlite database file in OwnCloud, Dropbox, etc. can lead to data loss: Zotero's sync page warns about syncing directly through services like Dropbox, noting that "the forums contain several threads about the problems that users are facing with Dropbox-based setups." if a user forgets to close Firefox or Zotero on one system and then opens it on another, the database can become corrupted.

Instead, the Zotero documentation (cf. here) recommends that for a safe and reliable workflow, users transfer the .sqlite database file from one computer to another (e.g., on a flash drive) whenever they switch systems.

An automated approach using bash

I've automated this process with a bash The function below should work on Linux and Mac OSX systems, as well as on Windows through Cygwin. function to be included in an alias file. This function defines three commands:

  1. zotero-sync push will copy the Zotero .sqlite database from one's computer to a central syncing server (e.g., Owncloud, Dropbox, etc.), first backing up any version that is currently present there.
  2. zotero-sync pull will will copy the Zotero .sqlite database from the central syncing server (e.g., Owncloud, Dropbox, etc.) to one's local computer, first backing up any version that is currently present there.
  3. zotero-sync compare will report when, and from where, the last push and pull commands were made.

For example, running zotero-sync compare in the terminal on my machine produces output similar to this:

Most recent push to '/path/to/OwnCloudDesktopSyncFolder/Academic_Library/zotero_REMOTE_COPY_NOT_TO_EDIT.sqlite': 
Mon Jun 1 11:30:19 PDT 2015 from Home Computer
Most recent pull to '/path/to/Zotero_Database/zotero.sqlite': 
Sat May 30 12:47:47 PDT 2015 from OwnCloud

If, e.g., I had most recently pushed an updated copy from my secondary computer, line 2 above would reflect that. The push and pull commands will also check to make sure that Firefox and Zotero are closed before making copies, in order to prevent database corruption.

Limitations

This is not a perfect solution: it works as long as one remembers to close Firefox/Zotero and use zotero-sync push after working at one's local machine, and to use zotero-sync pull before beginning work at a new machine. In this way, however, it does match the typical workflow for software developers who use decentralized version control software. The difference here is that if two databases need to be merged (because changes have been made to both), the process is manual, following the Zotero documentation for combining two libraries (The process involves exporting a "Zotero RDF" file from one, and importing it into the other). It's a fine solution until Zotero itself implements an easier-to-deploy self-hosted sync option, however, and has been working well in my daily workflow.

One note is that this approach only syncs metadata — PDFs and other attachments can already be synced using one's own server through Zotero's custom data directory feature. If you want to be able to access attached (e.g., PDF) files on a shared server from multiple systems, see the "Linked Attachment Base Directory" subsection here.

The code

I am releasing this code using a AGPL-3.0 license, which means that it is free to use, but that if you make changes and redistribute them, you must make the code for them available. Zotero and many of its plugins and translators use the AGPL license, which is why I've also chosen it here.

The only settings that need to be changed are on lines 11 and 13. There are additional optional settings on lines 17, 19, 21, and 23.

function zotero-sync() {

    #######################
    # SETTINGS:
    #######################

    ###
    # The variables immediately below should be the locations of the files (file paths + file name (e.g., '/home/username/zotero.sqlite', instead of just '/home/username/'):   
    ###

    push_location="/path/to/OwnCloudDesktopSyncFolder/zotero_REMOTE_COPY_NOT_TO_EDIT.sqlite" # Place to put copy of Zotero database file (e.g., Dropbox)

    pull_location="/path/to/Zotero_Database/zotero.sqlite" # Location of local Zotero sqlite database file (see https://www.zotero.org/support/zotero_data for default locations, or use Zotero's Settings menu to set a custom location).

    ###

    id_of_pull_location="Home Computer" # An ID marker for this machine, to make it easier to understand from where a file was last pushed to Dropbox/OwnCloud/etc. This can be whatever you want, but should be unique to this machine.

    id_of_push_location="OwnCloud" # An ID marker for the push location, to make it easier to understand from where a file was last pulled to the local copy. This can be whatever you want, but should be unique to this machine.

    suffix_for_backup_files="-BACKUP_BEFORE_OVERWRITING_WITH_REMOTE_COPY" # When this functions pushes or pulls the database, it will make a backup of the copy that is about to be overwritten.

    suffix_for_date_files="-date_of_last_sync_to_this_location" # When this functions pushes or pulls the database, it will make a text file that states when that location's copy was last pushed/pulled.
    #######################

    # Check that we have push and pull locations:
    if [ "$push_location" == "" ] || [ "$pull_location" == "" ] || [ "$id_of_pull_location" == "" ] || [ "$id_of_push_location" == "" ]  || [ "$suffix_for_backup_files" == "" ] || [ "$suffix_for_date_files" == "" ]
    then
        echo "Error: We don't have all necessary settings defined in this function. Exiting so that you can look at this function and set those settings..."
        return # Exit the function
    fi

    # Check if Firefox or Zotero standalone are running. Following http://ubuntuforums.org/archive/index.php/t-915299.html, if it's not found, grep will return a "0" code:
    is_firefox_running="$(ps -e | grep --extended-regexp 'zotero|firefox')"
    message_for_if_firefox_is_running="In order to ensure that the Zotero database not become corrupted, please close Firefox / Zotero before running this command."

    # Run commands based on what the first argument to this function ($1) is:
    case "$1" in # If the first argument is... (push, pull, compare, etc.)
    "push")
        if [ "$is_firefox_running" ] # Check whether Firefox or Zotero are running. If they are, exit the script (the called function will provide a message to the user).
        then
            echo "$message_for_if_firefox_is_running"
        else
            if [ -e "$pull_location" ] # If the file that we're supposed to copy exists...
            then
                if [ -e "$push_location" ] # If the file that we're supposed to copy TO exists...
                then
                    cp "$push_location" "$push_location$suffix_for_backup_files"
                    echo "$(date) from $id_of_pull_location" > "$push_location$suffix_for_date_files"
                fi
                cp "$pull_location" "$push_location"
            else
                echo "Error: The file we're supposed to copy ('$pull_location') doesn't exist. Exiting so that you can figure out what went wrong..."
                break # Exit the case statement
            fi
        fi
        ;;
    "pull")
        if [ "$is_firefox_running" ] # Check whether Firefox or Zotero are running. If they are, exit the script (the called function will provide a message to the user).
        then
            echo "$message_for_if_firefox_is_running"
        else
            if [ -e "$push_location" ] # If the file that we're supposed to copy exists...
            then
                if [ -e "$pull_location" ] # If the file that we're supposed to copy TO exists...
                then
                    cp "$pull_location" "$pull_location$suffix_for_backup_files"
                    echo "$(date) from $id_of_push_location" > "$pull_location$suffix_for_date_files"
                fi
                cp "$push_location" "$pull_location"
            else
                echo "Error: The file we're supposed to copy ('$push_location') doesn't exist. Exiting so that you can figure out what went wrong..."
                break # Exit the case statement
            fi
        fi
        ;;
    "compare") # If we have them, print the dates of last push/pull:
        if [ -e "$push_location$suffix_for_date_files" ]
        then
            echo "Most recent push to '$push_location': "
            cat "$push_location$suffix_for_date_files"
        else
            echo "[No date available for last push to '$push_location']"
        fi

        if [ -e "$pull_location$suffix_for_date_files" ]
        then
            echo "Most recent pull to '$pull_location': "
            cat "$pull_location$suffix_for_date_files"
        else
            echo "[No date available for last pull to '$pull_location']"
        fi
        ;;
    *) # If the argument isn't one of the above:
        echo "Please specify either 'push', 'pull', or 'compare' as the first argument to this function."
        ;;
    esac # End of case() statement.
}