# Data repository description ### `0_pipeline` Membrane protein dataset scripts and raw tables. This directory contains incremental steps that process the `csv` table as downloaded from the RCSB PDB (`0_pipeline/csv_downloaded`) to the final non-redundant membrane protein dataset (`0_pipeline/2_35_homology_filter`). The internal readme file explains in details the sub structure and contains metadata to read tables. The directory contains the classification scripts that allocate the various chains into the trans-membrane/monotopic and peripheral/integral classes. Together with a manual classification described in `0_pipeline/3_classification/manually_curated_entities.txt`. Results of this classification are store in `0_pipeline/3_classification/classification`. ### `1_contact_maps` Scripts to compute contact maps for each chain. This computation is done on the original PDB files (as downloaded from RCSB PDB). ### `2_structure_modeling` Directory containing script to split PDB files into the relevant chains (`2_structure_modeling/0 - pdb2chain`), an example with the detailed explanation on how the modeling is done through an instructive notebook (`2_structure_modeling/modeller_example`) and the actual script used to modeling structures (`2_structure_modeling/1 - model_chains`) ### `3_gaussian_entanglement_computation` Contains scripts used for the computation of the Gaussian Entanglement. All the scripts makes use of the in-house developed python library [pyge](https://github.com/gentangle/pyge). The `indices_ge2pdb.csv` files is used to map the indices resulting from the G' computation (describing loops and thread extrema) and those in the PDB file. ### `4_ge_results` Scripts inside this directory are used to analyse results from the G' computation. Each file has a header that describe its aim, and the table `4_ge_results/fragmented_domains.csv` is a list of domains that are "fragmented", i.e. with non-consecutive in the sequence fragments, that has to be discarded. Scripts to compute the survivor distribution and the grouping fraction are store in the `4_ge_results/survivor_grouping_fraction` directory, together with the input files. Here, the list of G' results for the reference set can be found (`4_ge_results/survivor_grouping_fraction/G1C_CvsN_prot_data_revised.dat`) with columns: CATH domain code, G'_n, G'_c and G'_max. Results for the distributions are contained in `4_ge_results/distribution_files`. Tables with G' lists, as well as G'_n and G'_n, are located in `4_ge_results/ge_list_files`. ### `CM` Contains the Numpy arrays of the contact maps computed on the original PDB files. These are all 2D upper-triangular arrays with non-zero entries when a contact is present in the native configuration. ### `ALIGNMENTS` Alignments files resulting from the modeling calculation (`2_structure_modeling`) done for chains with missing segments ($<11 \AA$). Hence, not all chains found in `CHAINS_FIX` are present here. For each chain, three files are present: `.fasta` with the sequence as described in the `SEQRES` entry, `seq` listing all `ATOM` entries in the PDB files and `.ali` with teh alignment between the two (used as an input for MODELLER). ### `CHAINS_FIX` PDB files containing the chains used for the G' calculations. These are copied if no segments are missing or, on the other hand, are the result of the modeling procedure.