Friday, August 29, 2014

For data analysis teamwork: A trick to combine source control with data sharing

For analytical teamwork, we have, primarily, two things to manage to enable accountability and reproducibility: data and code. The fundamental question to answer is what (version of) source and data led to some results.

The problem of managing code is pretty much solved (or at least there are many people working on the problem). See this thing called git for example. As for managing data, it needs more work, but it's being worked on and I only know of dat that can provide some kind of versioning of data akin to git . Keep in mind some databases have snapshotting or timestamping capability. But in the view of dat, a database is just a storage backend since you would version with dat.

Suppose you're not worried about versioning data; perhaps you've got your own way of versioning data or that the data that you're working with is supposed fixed and its appropriate store is a filesystem. Now there are various systems to share the data files but the data would be living in a separate world from the code. Now it's possible to store data in a source control system but it would be generally an inefficient way to deal with the data especially if it's large.

Now wouldn't it be nice to also have source controlled text files such as metadata or documentation interleaved with the data files? To accomplish this, two things need to happen:
  • The data needs to be in the scope of the (source) version control control system but does NOT manage the data files
  • and the file sharing system needs to NOT share the source controlled files.

Let's see how to implement this with git for source control and BitTorrent Sync (BTSync) for file sharing. Your project directory structure will be like:
  • project_folder
    • .gitignore (root) (git)
    • src_dir1
    • src_dir2
    • data_store
      • .SyncIgnore (BTSync, right pane)
      • .gitignore (child) (git, left pane)
      • data_dir
        • readme.txt
        • hugefile.dat
      • data_dir2


Let's explain this setup by examining the *ignore files. By excluding certain files, the *ignore files achieve the two exclusions required.
  • .gitignore (root) is typical and left unchanged.
  • .SyncIgnore tells BTSync to not synchronize, for example, readme files that live in the data directory since they are source controlled (highlighted). (Not essential to the setup but it might be preferable to only sync "input" data files and not files that are generated)
  • .gitignore (child) works with .SyncIgnore. We tell git not to manage files in the data directory as a general rule except the source controlled files (highlighted) that we specified in .SyncIgnore.  We also need to tell git not to manage BTSync's special files and folders. (Not essential to the setup but we can take things further by source controlling .SyncIgnore)
I'm sure there are various ways of dealing with the *ignore files but what I've shown here should apply to many cases.

With this setup, your team achieves better synchronization of project files! Please let me know if I need to explain the system better.