
Where can I download the dataset?

Hello, thanks for the wonderful work!

Can you give more details about the dataset? And where can I download the dataset?

Thank you!


You can find a sample of the dataset, as well as a brief description, as an open data challenge, in csv format. You will have to transpose it to npz format, or use a custom pytorch dataset (see the challenge demo repo), in order to use the notebooks.


Would you like to share the npz file? As the data structure from Open Data Challenge seems different from yours.
See the difference:

Hi, I can't share a npz file containing any other data than the ones uploaded on the data challenge, as it would go against the very rules of the challenge.
The structure of the labels is different, but that shouldn't be an issue if you just want to convert the csv dataset to npz, as the code was written with these possible modifications in mind. Just load the csv with the OzeDataset class, and export R, Z and X using np.savez. You're aiming at this kind of data structure.

Hi do you have any code that could transform the csv to npz, I am not sure what we should include in the npz

Once again, all needed information are present in the challenge benchmark repo, but to prevent further questions on the dataset I have drafted a function to convert csv to npz.

Dear @maxjcohen , I joined the challenge 28, downloaded the following files:

  • x_train_LsAZgHU.csv
  • y_train_EFo1WyE.csv
  • x_test_QK7dVsy.csv

Then I copied csv2npz script to utils folder within the project.
Then I created and ran the following python script at project's root folder:

from src.utils.csv2npz import csv2npz

csv2npz('datasets/x_train_LsAZgHU.csv', 'datasets/y_train_EFo1WyE.csv')

But unfortunately it errored as can be seen below.

Traceback (most recent call last):
  File "/home/<username>/Workspaces/Python/transformer/", line 3, in <module>
    csv2npz('datasets/x_train_LsAZgHU.csv', 'datasets/y_train_EFo1WyE.csv')
  File "/home/<username>/Workspaces/Python/transformer/src/utils/", line 21, in csv2npz
    R = x[labels["R"]].values
  File "/home/<username>/.virtualenvs/.env/lib/python3.7/site-packages/pandas/core/", line 2806, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/home/<username>/.virtualenvs/.env/lib/python3.7/site-packages/pandas/core/", line 1553, in _get_listlike_indexer
    keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
  File "/home/<username>/.virtualenvs/.env/lib/python3.7/site-packages/pandas/core/", line 1646, in _validate_read_indexer
    raise KeyError(f"{not_found} not in index")
KeyError: "['initial_temperature', 'roof_1_thickness_3'] not in index"

Hi, this error means that the index "initial_temperature" and "roof_thickness_3" are not present in the challenge dataset. Indeed, if you take the original labels.json, these values are not present, because they were not intended to be used in the challenge.

In order to solve your error, I recommend using the original labels file from the benchmark repo.

I created a pull request #6 with some improvements I came up with up to now, it might be useful to merge @maxjcohen, please advise.

I am looking at your project and try to process different dataset. If convenient, please describe the data format so I can process any data beyond the challenge dataset only. Thanks.

Hi, there is no particular data format to use with the Transformer beside the input shape specified in the documentation.

We currently handle our data using the OzeDataset class, inherited from PyTorch's Dataset class. As the format here is a bit specific, I encourage you to write your own Dataset inherited class fitting your data, and feed it to the Transformer.

Hi, thanks for the reference for the helpful data loading function. Just one minor tip here.

The original data loader uses X.values.reshape((m,-1,k)) where m is the number of observations and k is the length of time series. However, a normal LSTM or Transformer model accepts an input vector in shape (batch, time series length, num_feature). Thus the reshaping of (m, k, -1) is recommended. Same for variable "Z" (have to point out that the naming is quite confusing at the first glance.)
X = X.values.reshape((m, K, -1))
Z = Z.values.reshape((m, K, -1))

For the labels.jason, I delete "week" and "light_blabla_mask" (can't remember the name but the error message alert me that this index is not found). You can also refer to the data specification on Challenge website to modify your labels.jason

My final input vector size is (8, 672, 18) (8 batches, 672 time-series, 18 features ignoring room-paras.) - 2021 / 2 / 25

LSTM in pytorch accepts a vector of shape (time series length, batch, num_features), see the docs.

I managed to get a .npz file using the labels.json from and the code from

However, in the notebook in the dataset used has (I think) 25k rows (the one downloaded from the ozechallenge has 7500

$ wc -l dataset/x_train_LsAZgHU.csv 
7501 dataset/x_train_LsAZgHU.csv

If I change the splits to dataset_train, dataset_val, dataset_test = random_split(ozeDataset, (5500, 1000, 1000)), I hit an error in the cell that does the training:

[Epoch   1/30]:   0%|          | 0/5500 [00:00<?, ?it/s]

RuntimeError                              Traceback (most recent call last)
<ipython-input-20-4b3396332a6c> in <module>
     13             # Propagate input
---> 14             netout = net(
     16             # Comupte loss

~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/modules/ in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/code/notebooks/transformers/transformer/tst/ in forward(self, x)
    124         # Embeddin module
--> 125         encoding = self._embedding(x)
    127         # Add position encoding

~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/modules/ in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/modules/ in forward(self, input)
     93     def forward(self, input: Tensor) -> Tensor:
---> 94         return F.linear(input, self.weight, self.bias)
     96     def extra_repr(self) -> str:

~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/ in linear(input, weight, bias)
   1751     if has_torch_function_variadic(input, weight):
   1752         return handle_torch_function(linear, (input, weight), input, weight, bias=bias)
-> 1753     return torch._C._nn.linear(input, weight, bias)

RuntimeError: mat1 dim 1 must match mat2 dim 0

What is this 'datasets/dataset_57M.npz'? and what are X, R and Z? thanks!

Hi, the dataset from the challenge and the one I'm using on this repo are quite different, this is why dimensions don't match. If you want to use this Transformer for the challenge, you'll have to make a few ajdustements.

As for your question about X, R and Z, you can check #28 .

Hi!, thanks for answering.

Can you tell me more about the differences? For example, what are the shapes of X, R, and Z indataset_57M.npz? Also, I'm lost when you say that

If you want to use this Transformer for the challenge, you'll have to make a few adjustments.

Is this not what is going on in this repo? In the readme, you say that the dataset used to train this transformer is the one from the challenge, but that does not seem to be the case. Can you tell me more about what are the adjustments needed?

The variables X, R and Z are proper to the challenge dataset, and completely independent from the Transformer model. They simply describe the dataset, with 2 inputs instead of the usual one:

  • R contains the characteristics of the building, which don't change with time, and are concatenated with Z to serve as input. Shape should be (n_samples, n_characteristics).
  • Z contains the input time series. Shape should be (n_samples, time_steps, n_input_variables).
  • X contains the output time series. Shape should be (n_samples, time_steps, n_output_variables).

The original dataset from the challenge has been modified, for instance some variables where removed from R, some added to Z, etc. But the content is roughly the same, and should be sufficient for trying out the Transformer. All changes can be found in the files labels.json.

Please keep in mind that the dataset dataset_57M.npz is not available for download.

Thanks to the author for the great intuitions and efforts.

For those who may have issues related to the dataset, you might be able to try this that I slightly modified according to the author's suggestions.

and dataset

You can check some plots resulted from the code above (don't know whether it's correct or not).

Hope this helped someone.

The dataset of the challenge contain a file named x_train and y_train. Do they complement each other or one of them is enough ?
Plus, If my data are already in a csv file, can't I just devide them into train, test and validate directly and just use them ?

Hi, yes they complement each other, x_train are the command (input vectors) while y_train are the observations (output vectors). You are, of course, free to divide your data however you desire.
In the future, please keep discussions about the challenge in the challenge repo.

Thank you for your work!

I am new to Transformer methods. Can the package accept csv files directly instead of .npz files?

In this repo, we define a Transformer model that takes as inputs Tensors, see the documentation. We present examples loading data as .npz files, but you can load data however you want.


Hi @yyldtc , from what I was able to translate from your message, something is still not working with the dataset. Could you detail the error that you got in a new issue ? I'll take a look.