Data-TSAR (Data Time-series Satistical AggRegator) helps you to post-process time-series data files in bulk. Features include:
- Easy and readable workflow
- Parallel processing of data files
- Minimizes the number of disk read operations
- Can read HAWC2 and FLEX file formats
- Extraction of parameter values from file names
- Works with nested file structures
Data-TSAR performs data aggregation on a specified file directory on files which match a simplified Regex pattern. Lets look at an example which calculates the mean and damage equivalent loads of some time series files and prints a Pandas DataFrame of the results:
from TSAR import TimeSeriesAggregator
channels = {
"channel1": 1,
"channel2": 2,
}
agg = TimeSeriesAggregator("fictitious_data_dir",
pattern="data_foo{foo}_bar{bar}.sel",
channels=channels)
agg.add("mean", ["channel1", "channel2"])
agg.add("DEL", ["channel2"], m=4, neq=600)
agg.run_par()
df = agg.to_dataframe()
print(df)
First, a TimeSeriesAggregator
object is created. It takes the following parameters:
- the path to the directory containing the data (in this case,
fictitious_data_dir
) - The file pattern to match. Capture groups are indicated as curly brackets. For example, a file with the name
data_foo10_bar123.4.sel
will be matched, and the capture groupsfoo = 10
andbar=123.4
will be captured. - The channel numbers to consider, and their corresponding labels to use. In this case, the label
channel1
corresponds to channel number1
.
Next, the desired time series operations are lazily added to the TimeSeriesAggregator
(no operations are performed until the run
or run_par
functions are called). The first argument is a string with the desired operation. Available operations include mean
, std
, var
, min
, max
, DEL
, count
. The second argument is a list of channel labels to perform the operation. In this example, the mean of channel1 and channel2 are calculated, and the DEL of channel2 is calculated.
Finally, the parallelized calculation is performed by calling agg.run_par()
. Every operation added to the aggregator will be performed on all files which match the file pattern. Data-TSAR is optimized to minimizes the number of disk read operations.
The data is then converted to a Pandas DataFrame and printed:
[1000 rows x 5 columns]
foo bar channel1_mean channel2_mean channel2_DEL
0 10 1234.4 1.484160 2.177548 0.56382
1 20 1234.4 1.411251 2.009189 0.56382
2 30 1234.4 1.745251 2.224972 0.56382
3 10 2345.6 1.442648 2.461884 0.56382
4 20 2345.6 1.731026 2.940186 0.56382
.. ... ... ... ...
995 20 2345.6 1.537893 2.218454 0.56382
996 30 2345.6 1.667881 2.812031 0.56382
997 10 9999.9 1.749854 2.666937 0.56382
998 20 9999.9 1.489747 2.320878 0.56382
999 30 9999.9 1.743557 2.678526 0.56382
As the aggregated data is stored in long-format, further postprocessing can be performed using pivot tables.
Clone the repository and pip install:
git clone git@github.com:jaimeliew1/Data-TSAR.git
cd Data-TSAR
pip install .