PCDC Data Portal
Working with PFB Files
Data for approved PCDC Data Portal research projects will be provided in PFB (Portable Format for Bioinformatics) file format. PFB files include both schema and data in a single compact package. More information on PFB files can be found here.
PFB files are created, explored and modified using the Python PyPFB SDK. The SDK requires an installed Python version between 3.6 and 3.8. More information can be found here.
By convention, PFB files have a .avro file extension.
Set up the environment
1) Activate a Python 3 virtual environment: You can use any Python package manager to activate a virtual environment (e.g. Python virtualenv, Anaconda, Poetry)
python -m venv env source env/bin/activate
2) Install Python PyPFB SDK and its dependencies
pip install pypfb[gen3]
3) Type "pfb" in the terminal window to verify the installation.
Convert PFB into TSV
PFB files downloaded from the PCDC Data Portal can be converted to multiple .tsv (tab separated values) files.
Usage: pfb to [PARENT OPTIONS] tsv [OPTIONS] [OUTPUT]
Convert PFB into TSV files under [OUTPUT] for modification of data in TSV format.
[PARENT OPTIONS]: -i FILENAME(The PFB file)
The default [OUTPUT] is ./tsvs/.
[OPTION]: None
Example:
pfb to -i data.avro tsv
Note: The PFB file in this example includes data from three nodes (study, person, and subject).