# Data repository description

### `0_pipeline`
Membrane protein dataset scripts and raw tables. This directory contains incremental steps that process the `csv` table as downloaded from the RCSB PDB (`0_pipeline/csv_downloaded`) to the final non-redundant membrane protein dataset (`0_pipeline/2_35_homology_filter`). The internal readme file explains in details the sub structure and contains metadata to read tables.

The directory contains the classification scripts that allocate the various chains into the trans-membrane/monotopic and peripheral/integral classes. Together with a manual classification described in `0_pipeline/3_classification/manually_curated_entities.txt`.

Results of this classification are store in `0_pipeline/3_classification/classification`.

### `1_contact_maps`
Scripts to compute contact maps for each chain. This computation is done on the original PDB files (as downloaded from RCSB PDB).

### `2_structure_modeling`
Directory containing script to split PDB files into the relevant chains (`2_structure_modeling/0 - pdb2chain`), an example with the detailed explanation on how the modeling is done through an instructive notebook (`2_structure_modeling/modeller_example`) and the actual script used to modeling structures (`2_structure_modeling/1 - model_chains`)

### `3_gaussian_entanglement_computation`
Contains scripts used for the computation of the Gaussian Entanglement. All the scripts makes use of the in-house developed python library [pyge](https://github.com/gentangle/pyge). The `indices_ge2pdb.csv` files is used to map the indices resulting from the G' computation (describing loops and thread extrema) and those in the PDB file.

### `4_ge_results`
Scripts inside this directory are used to analyse results from the G' computation. Each file has a header that describe its aim, and the table `4_ge_results/fragmented_domains.csv` is a list of domains that are "fragmented", i.e. with non-consecutive in the sequence fragments, that has to be discarded.

Scripts to compute the survivor distribution and the grouping fraction are store in the `4_ge_results/survivor_grouping_fraction` directory, together with the input files. Here, the list of G' results for the reference set can be found (`4_ge_results/survivor_grouping_fraction/G1C_CvsN_prot_data_revised.dat`) with columns: CATH domain code, G'_n, G'_c and G'_max.

Results for the distributions are contained in `4_ge_results/distribution_files`.

Tables with G' lists, as well as G'_n and G'_n, are located in `4_ge_results/ge_list_files`.

### `CM`
Contains the Numpy arrays of the contact maps computed on the original PDB files. These are all 2D upper-triangular arrays with non-zero entries when a contact is present in the native configuration.

### `ALIGNMENTS`
Alignments files resulting from the modeling calculation (`2_structure_modeling`) done for chains with missing segments ($<11 \AA$). Hence, not all chains found in `CHAINS_FIX` are present here. For each chain, three files are present: `.fasta` with the sequence as described in the `SEQRES` entry, `seq` listing all `ATOM` entries in the PDB files and `.ali` with teh alignment between the two (used as an input for MODELLER).

### `CHAINS_FIX`
PDB files containing the chains used for the G' calculations. These are copied if no segments are missing or, on the other hand, are the result of the modeling procedure.