monocongo/climate_indices

Not all data availible when running example notebooks

Closed this issue · 2 comments

Daafip commented

I cloned https://github.com/monocongo/climate_indices/ and attempted to run https://github.com/monocongo/climate_indices/blob/master/notebooks/spi_simple.ipynb, however not all the data was availible:

`FileNotFoundError: [Errno 2] No such file or directory: 'C:\data\datasets\nclimgrid\nclimgrid_lowres_prcp.nc'

I did find these https://github.com/monocongo/example_climate_indices, but would be nice if its all in one place ;)

Sorry for the confusion @Daafip . If there's a place where we can highlight this better in the docs then please suggest an edit (or do it yourself and submit a PR).

The example data is in a separate repository because it's a bad practice to keep big files in a git repository. ChatGPT summarizes it a lot better than I can:

Including large data files in a Git repository is generally considered not good practice due to several reasons:

  1. Repository Size: Large data files can significantly increase the size of the Git repository. This can impact repository cloning, pushing, and pulling operations, making them slower and more resource-intensive. It also consumes additional storage space on the server and the local machines of collaborators.

  2. Performance: Git is optimized for tracking changes to source code, which typically consists of text-based files. Binary data or large files are not handled as efficiently by Git. Operations like merging, branching, and viewing file history can become slower when large files are involved.

  3. Collaboration: Sharing large data files through Git repositories can be challenging, especially when collaborating with a distributed team. It becomes difficult to synchronize changes and keep everyone's local repositories up to date. The frequent transfer of large files over the network can also impact the overall performance of the collaboration.

  4. Version Control: Git is primarily designed for version control of source code, allowing for efficient tracking of file changes. However, large data files often do not change incrementally but rather as a whole, which leads to unnecessary duplication in the repository's history and increases its overall size.

  5. Scalability: As the repository grows larger in size, it becomes more challenging to manage and maintain. Operations that require access to the repository's history, such as checking out previous commits or analyzing file changes, may become slower and less efficient.

To address these challenges, it's recommended to adopt alternative approaches for managing large data files, such as:

  • Use a dedicated data storage system or a version control system specifically designed for handling large files, such as Git LFS (Large File Storage), Git Annex, or DVC (Data Version Control).
  • Store large data files outside the repository and reference them through URLs or relative paths.
  • Utilize cloud storage services or artifact repositories to store and share large files separately from the Git repository.
  • Exclude large files from version control and document their dependencies or provide instructions on how to obtain them.

By separating large data files from the Git repository, you can keep the repository focused on source code, improve performance, facilitate collaboration, and maintain a more manageable and efficient version control system.

Daafip commented

Oh right, yeah i missed that.