StESPT

TODO

IMPORTANT: Also read the file PARAMS.md

0. Related Repositories

Below, we present some other related repositories that may be of interest to you:

HospitalKG_changes: It is also linked to ~~doi: TODO~~.
HospitalEdgeWeigths: It is also linked to ~~doi: TODO~~.
HospitalGeneratorRDF_V2: Code used to generate the input dataset in RDF* for STeMECH based on the output from H-Outbreak.

1. Other sections

TODO

2. Installation

The source code is currently hosted on github.com/LorenaPujante/STeMECH/Code.

The code is in Python 3.10. The following packages are needed:

matplotlib v3.9.2
networkx v3.2.1
numpy v2.1.2
pandas v1.5.3
scikit_learn v1.5.2
scipy v1.14.1
SPARQLWrapper v2.0.0

3. Input

The code doesn't need any input files to read but requires a repository in GraphDB Semantic Graph Database to query the data about patients.

This repository must be an RDF* ontology following the data model described in 10.1109/JBHI.2024.3417224 and HospitalKG_changes. HospitalGeneratorRDF_V2 has been used to generate the data for the repository.

The RDF* ontology with the dataset for the experiments of ~~doi: TODO~~ can be found in dataset/HospitalGeneratorRDF_V2_output. In addition, the input data to generate the ontology is in dataset/H-Outbreak_output.

4. Execution

There are 4 main python files to execute the different parts of the framework. Each file must be executed separately. Go to the folder containing the folder and run: python name_of_file.py. All the parameters for STeMECH are in the file config.py, which are described in the next section.

The parts of the frameworks are:

The file main_NumCases.py can be used to search the number of positive cases for a microorganism for each week of the dataset. It also searches the cases by week and floor. It can be used to have an approximate idea of the number of patients whose trajectories will be studied depending on the parameters' values.

5. Configuration params

Here there are the parameters for the execution of, mainly, main.py but also the rest of the _main_files:

zero: It indicates if it is necessary to ask the database for the matrixes to calculate the spatial distance. If False, they must be stored in Code/matrixes.
repository: The name of the GraphDB repository with the input dataset.
dateStart: The date and time to start searching for patients with a positive TestMicro for a specific Microorganism.
dateEmd: The date and time to stop the search for patients with a positive TestMicro for a specific Microorganism.
idLoc: Value for the id attribute of the Floor where to search the infected patients.
idMicroorg: The value for the id attribute of the Microorganism whose infected patients we are searching.
maxDaysTrajForward: When we already have found the patients infected during a period, we will also search for other events of these patients, at most, during the indicated days.
similarityFunctions: The ids of the Trajectory similarity measurement algorithms to be run. The allowed values are:
- dtw: for Dynamic Time Warping (DTW).
- dtw_st: for Spatiotemporal DTW (ST-DTW).
- lcss: for Spatiotemporal Longest Common Subsequence (ST-LCSS).
- lcss_2: for ST-LCSS With Time Window (ST-LCSS-WTW).
- tsJoin: for Spatiotemporal Linear Combine (STLC).
- tsJoin_2: for Joint Spatiotemporal Linear Combine (JSTLC).
beta: The β parameter of the equation for temporal similarity between sampling points.
alfa: The α parameter of the equation for the spatiotemporal similarity between sampling points.
maxStepsBackwardLCSS: For the LCSS and LCSS_WTW algorithms, the maximum allowed number of difference between two steps. If the distance in steps is bigger than this value, there won't be a match between the sampling points.
margin: For the LCSS_WTW algorithm, the number of steps with which we do the match check forward and backwards.
maxSpDist: For the LCSS and LCSS_WTW algorithms, the maximum spatial distance allowed between two Beds to allow matching two sampling points.
maxDiffStepsSTLC: For the STLC and JSTLC algorithms, if it is True, the temporal and spatial similarities will be divided between the number of steps of our search time. If it is False, they will be divided (as the original STLC) between the number of steps of the compared trajectories.
nameFolder_Matrix: The path to the folder where to save the matrixes to calculate the spatial similarity. They are saved in CSV files. The headers of the matrixes (the id of the locations) are saved in this folder.
nameFolder_SimArrays: The path to the folder where to save the similarity matrixes between patients' trajectories. They are saved in CSV files. The result of each algorithm is saved in a separate file. The arrays with the similarities normalised to range [0,1] are also saved here with the name ending with "_01".
nameFolder_Figures: The path to the folder where to save the figures with the resume of the results of all the main files.
nameFolder_Outputs: The path to the folder where to save the matrixes to calculate the spatial similarity.
timeInFile: If it is True, the execution time of each similarity measurement algorithm for each pair of trajectories will be saved in a file.
nameFolder_Time: The path to the folder with the file with the execution time of each similarity measurement algorithm for each pair of trajectories.

Here there are the parameters for some of the main files.

annotated: This parameter is only used in main_Heatmap.py. If it is True, the heatmaps with the trajectory similarities between patients will also show in each cell the value of the trajectory similarity between the pair of patients.
heatColors: This parameter is only used in main_Heatmap.py. The name of the colour scheme for the heatmaps.
maxClustersPats: This parameter is only used in main_Clustering_Ks.py. If it is True, when searching for the best value of K for the K-Means clustering algorithm, the tested Ks will be in the range [2, **numPatients-1**]. If it is False, the Ks will be in the range [2, numPatients/2].
numRows: This parameter is used in main_Clustering_Ks.py and main_Clustering_Plots.py. It determines how many rows will have the image that shows the bar charts of the clustering metrics for each value of K or trajectory similarity algorithm.
meshSize: This parameter is only used in main_Clustering_Plots.py. It determines the "definition" of the chart showing the points of each cluster in a bi-dimensional chart.
Ks: This parameter is only used in main_Clustering_Plots.py. It is an array that for each trajectory similarity measurement algorithm saves which value of K returns the clusters with the optimum cohesion and separation. It must be the same size as similarityFunctions and follow its order.
reducedColors: This parameter is only used in main_Clustering_Plots.py. The name of the colour scheme for the bi-dimensional representation of the clusters.
barColors: This parameter is only used in main_Clustering_Plots.py. It is an array with the name of the colours for the bars of the charts that show the clustering metrics for each trajectory similarity algorithm.

In the file PARAMS.md we present the values for all these parameters used to create the dataset for the work ~~doi: TODO~~.

6. Output

TODO

LorenaPujante/StESPT

StESPT

0. Related Repositories

1. Other sections

2. Installation

3. Input

4. Execution

5. Configuration params

6. Output