HPC Archiving

This document covers the usage of the Archiving Storage attached to LIMS-HPC.

Purpose

The Australian Code for the Responsible Conduct of Research requires researchers to make use the archive storage at a research organisation (i.e. La Trobe University) and keep records about such data. La Trobe University's policies specifically reference the Code.

Relevant Documents:

Glossary

Key terms used within this document

LIMS-HPC key directories

Archive types

The Archive storage on LIMS-HPC is used for storing data that is no longer being actively processed. There are two groups of archived projects:

Published

Data analysis that has been published (or is under review). This data must be cleaned, compressed and have a COMPLETE meta-data file. Once the project is ready it will be locked so that no users can add, change or delete the files within it. A locked project will only be readable for researchers within the labgroup and not changeable. A supplementary sub-directory for each project is used for adding new files/information after the project is locked. Files within here will also be locked when ready.

Shelved

For projects that have been put on hold for the time being. They can be brought back to the HPC storage (i.e. /home/group/*) when work resumes. Intermediate files must also be removed prior to archiving.

Archive directory layout

Each labgroup's archive directory contains two sub-directories for the Published and Shelved data.

Meta-data

Items that must be included as a minimum in archived data (metadata.txt)

metadata.txt: a template file to use for your projects metadata.txt

What to keep

Data-wise, the code says you need to keep at least the raw input files, the output results and anything that is not possible to reproduce in your data analysis. You might also want to keep any intermediate results that take a significant time to reproduce. Any dataset that you download from the web to use in your analysis should be archived with your results as it might be hard to recreate the dataset in the future (if the web resource disappears or changes).

Meta-data wise, you need to keep all meta-data which includes all your job scripts for each step of your analysis.

Don't keep SAM files particularly if you have BAM files of the same contents. When developing your processing pipeline consider a method which produces BAM files directly if possible.

Copying data

Preparation

Prior to transferring you need to cleanup the project directory. See the Best Practices section below for the tasks that should be completed.

All FASTQ/A files should be compressed as do any other large text files. When you compress a text file that wasn't compressed when you ran pipeline make sure you document that it was done in a README file so that you remember to decompress it if you need to perform processing again.

Transferring

The Temporary archive folder in your lab-group is in the same file system as your lab-group so please use the mv command to move the project into your archive directory once it is prepared for archival.

# e.g.
mv PROJECT_DIR_NAME /home/group/LABGROUP/archive/pub/
# Where PROJECT_DIR_NAME is the top level directory for your project
#       LABGROUP is the labgroup you belong to. e.g. smithlab

Data disposal

Data must only be disposed of in accordance with the La Trobe Research Data Retention and Disposal Policy and any other relevant policies.

Additionally, when a dataset is disposed of, the project directory and metadata file must remain. A note should be added to the metadata file indicating the date, whom and reason it was deleted. A full directory listing should be stored prior to deletion:

ls -lR >> listing.txt

Project best practice

To help produce repeatable science and make your life easier when it comes time to archive here are a number of suggestions to use while working on a project.

Each step in its own directory

Within each project there are commonly multiple processing steps used. It's best practice to create a sub-directory for each step in your analysis. This helps reduce the confusion about files and makes it easier to cleanup and archive. If you add a sequence number to the beginning they order correctly with ls

Clean as you go

Once you have successfully completed a step in your analysis you should remove any files resulting from earlier failed analysis.

Better yet, move all files from this step out of the way (i.e. to a subdirectory called 'old') and repeat the steps you used to achieve a successful result. This makes sure you are able to repeat the analysis. When finished you can remove the old files.

Maintain metadata.txt

Create the metadata.txt file when you start a project and complete it as you go along. As a minimum you should put your contact details in there and document the source (and any approvals for) data.

You should resist the urge to create links from one project to another as this will result in dead links if one project is archived before another. If two projects share the same data then you can apply to have the data stored in the genomics platform archive and link to it there.

Background

When using data within your project it's best practice to use symbolic links to grab the source data from the genomics platform archive instead of making a copy to your project. There are two types of symbolic links: (1) relative and (2) absolute and are differentiated by whether you use a relative path or absolute path when creating the link.

When should you use each?

Read-only

It's a good idea to make your files read-only once they are successful to help prevent data-loss when you do a recursive delete in the wrong place. NOTE: this will NOT protect you from the rm -f.

# individual file(s)
chmod a-w FILENAME

# whole directory
chmod a-w -R DIRECTORYNAME

Another helpful hint is to create a file named '0' in directories that contain valuable files and make this read-only. This will cause rm -r ... to prompt before deleting it so if you accidentally try to delete the directory then this can help save you (by pressing CTRL + C to terminate the command).

touch 0; chmod a-w 0;