- Clone the repo.
git clone https://gitlab.ethz.ch/usys_lectures/esds_book
- Create the data directory and link files from the shared polybox folder into it.
cd esds_book
mkdir data
ln -s ~/polybox/Shared/Data\ Science\ Lecture\ Planning\ -\ shared\ folder/4\ Datasets/* ./data/
- Build the book.
- This document holds inputs and ideas on how to further structure and format the ESDS bookdown document.
- This Google Sheet holds a visual overview of the book's progress.
- The repository holds all content (text and figures), exercises and solutions. Data is stored on polybox.
- First you need to install git in order to use it.
- To download the repository to your local machine do the following steps:
- Open your terminal on your computer (or use the one within RStudio)
- Check the directory you are in by entering
pwd
and usels
to check the directories and files therein. To navigate to your "base" directly, simply entercd
. - Use
cd name.of.directory
to navigate to the folder where you want to download the repo to - Use
git clone https://gitlab.ethz.ch/usys_lectures/esds_book
to downlaod the repo (enter your ETH login credentials when prompted) - Create a new directory within esds_book called data via
mkdir data
- Now, you still miss all the data that is needed to compile the book. There are two ways how to do this:
- Downloading Data: Go to the share polybox and download the folder 4 Datasets. Once, you've downloaded it, rename it to data and put it into the esds_book folder. Important: The data folder saved under 10 ESDS Book does not hold all the data needed to build the book.
- Via Polybox Client (download here): Navigate to your esds_book folder using the terminal and
cd
. In here, you can create a so-called "soft-links" to all the data in the folder of the polybox client. To do so, enterln -s ~/polybox/Shared/Data\ Science\ Lecture\ Planning\ -\ shared\ folder/4\ Datasets/* ./data/
(it could be that the path description has to be adjusted first).
- Via Polybox Client (download here): Navigate to your esds_book folder using the terminal and
- Downloading Data: Go to the share polybox and download the folder 4 Datasets. Once, you've downloaded it, rename it to data and put it into the esds_book folder. Important: The data folder saved under 10 ESDS Book does not hold all the data needed to build the book.
- To install all packages that are needed in the book, open up the
index.Rmd
and run the code chunk in there.
- Never add any data sets to the git repository! Only add them to the polybox and create softlinks.
- Follow these steps for a nice collaborative workflow using git (alternatively to using the terminal, you can use the Git interace within RStudio)
- Open the .Rproj file within the esds_book directory
- If any new files have been added to the polybox, new softlinks have to be created. Thus, navigate to your esds_book using the terminal and enter
ln -s ~/polybox/Shared/Data\ Science\ Lecture\ Planning\ -\ shared\ folder/4\ Datasets/* ./data/
- If any new files have been added to the polybox, new softlinks have to be created. Thus, navigate to your esds_book using the terminal and enter
- Within RStudio, navigate to the terminal and enter
git status
to check for updates of the repo- If there are updates available do
git pull
- If there is a so-called merge confilct, check which file is causing it. It is probably easiest if you get in contact with whoever was working on this file too to discuss what changes were made. Then, implement all these changes in one file and add it to the repo.
- If no more change pop up using
git status
- happy working!
- If there are updates available do
- Once you are done working on your files (make sure they are knittable!) do again
git status
to check which files need to uploaded - If you want to upload only changes to a certain file do
git add name.of.file
If you want to add all updated files at once, you can dogit add *
- Next, you have to commit your change by entering
git commit -m "description.of.your.changes"
- End the workflow with a
git push
and agit status
to see if your changes have been commited
- Open the .Rproj file within the esds_book directory
- The R package used to create this book is called
bookdown
. Have a loot at the respective github, website, documentation to get a better grasp. Check out the Get Started page to get started. - If you want to build the book locally on your computer, install
bookdown
, open upesds_book.Rproj
and pressbuild book
under the Build tab in RStudio or enterbookdown::render_book("index.Rmd", "bookdown::gitbook")
in your R console (see this documentation). Once the book is build, open the newly created index.html under esds_book/_book. - As for now (early 2021), only a html version of the book is buildable. Making a PDF available requires significant additional work since the rendering is less straight-forward and ends up in messy formatting.
- All figures and graphics that are incorporated in the book are located in the figures folder on the repository.
- Separation of tutorial and exercises even more strict. Avoid any duplication of explanations in the tutorial part.
- Theory must be accessible somehow. Is now in videos. Embed videos as youtube (see here).
- Add exercises as separate sections into the chapters.
- Consider reducing contents, boiling it down to the essentials and avoid any duplication of explanations. Anything repetitive is to be relegated to exercises.
- Include all library load statements for each chapter at the top of the respective RMarkdown.
- Libraries should be explained with links to respective documentation pages.
bookdown
holds powerful tools like cross-referencing, citing, adding nice tables and figures (usingkable
andknitr
packages), etc. Make sure to get familiar with these by reading the documentation!- Have a look at this cheatsheet to get to know RMarkdown (.Rmd) which the book is based on (e.g., formatting options, how to add links, how to knitt, how to use chunks, etc.)
- Extensive Calculations: If RMarkdowns contain code that takes long (anything above a few seconds) to run, avoid running respective chunk by setting
eval=FALSE
in the chunk options. This still displays the code in the knitted output - which we want. If outputs from compute-intensive code is required for knitting the RMarkdown file, try to come up with a better solution. For example, figures can be created first as PNG, figure files added to the repo and included in the RMarkdown as an image (
). - Citation: Links to online resources can be added using
[Linkname](https://...)
. To do proper citations, check out this bookdown page - always make sure to add references to the book.bib file in esds_folder. - Cross-Referencing: Chapters and figures are set-up to use for cross-referencing. How to do so, read here.
-
All code should be structured following the tidyverse style guide.
-
Please adopt tidyverse grammar wherever possible (for example, see below and chapter on data wrangling).
# Good day_01 # Bad DayOne day.one first_day_of_the_mont djm1 # Don'ts mean <- function(x) min(x) T <- FALSE c <- 10
-
Remember to use tidyverse functions wherever possible. In particular for the the functionalities described in chapter on data wrangling:
- variable selection with
select()
- filtering/subsetting with
filter()
- variable definition with
mutate
- merging with the
_join()
family - dates with the lubridate package
read_csv()
instead ofread.csv()
as_tibble()
instead ofas.data.frame()
- apply functions over elements of a list using
purrr::map()
family of functions instead of theapply()
family.
- variable selection with
- Later chapters on neural networks are based on using
tensorflow
andkeras
. However, installing them locally can be troublesome and executing the codes can take very long or even overwhelm your machine. Thus, the outputs for these chapters were generated on Renku and added as pictures or "fake outputs" in the book. - If you want to install
tensorflow
andkeras
locally to try out the code follow this documentation- The python environment that you want to set up should have the following packages and version installed.
- Make sure to do this in the python environment that is used by R!
conda activate /Users/name/Library/r-miniconda/envs/r-reticulat
conda install -y tensorflow=1.15.0 keras=2.3.1 h5py==2.10.0 pillow
.
- Make sure to do this in the python environment that is used by R!
- Here are some useful stackoverflow posts for troubleshooting:
- The python environment that you want to set up should have the following packages and version installed.
The list below holds inputs and ideas from the lecture evaluation (see pdf on polybox) and moodle feedbacks (final feedback, exercise forum, lecture forum). As for now (early 2021), the book is ready for usage but not proof-read. There is still need for proof-reading theory, improve wording, condense content, improve code style, etc. Have a look at the shared excel to see what tasks are still open and for a rough estimate how many hours it will take to get a final first version of the book ready. Below is a collection of tasks for further improving the book.
-
Recommendation for general structure of chapters
-
Introduction: Learning Objectives - Key Points of Lecture
-
Tutorial (depends on preference, currently first mentioned approach)
Mixing theory and code: Topic 1 - Topic 2 - Topic ... Splitting theory and code: Theory - Code
-
Exercise: Overview - Task Description with Pseudocodes and Outputs to be generated
- Solutions are not provided within the book directly but are stored on repository
-
-
Referencing other resources is currently (early 2021) implemented as links and not as inline references (see here for proper inline citing). However, all mentioned books are listed in the references chapter. Latter does not apply to linked papers, blogposts and youtube videos.
-
PDF of book has been requested by students but due to knitting issues has not been finished yet.
-
Coherent language is missing, current text is mixture of British and American English.
-
Usage of bold and italic face (and other formats) is not coherent yet.
-
Usage of inline referencing to figures is underused and could be improved.
-
Naming of the files could be improved (how to via git).
-
The organisation of files in data directory is somewhat messy without naming convention or sub-folder structure. All data needed in book and exercise is gathered. Note that changing the structure will need respective changes in the RMarkdown files where files are loaded.
-
Overall workload of course has been criticized, shortening and condensing content was suggested.
-
Book and lecture content are not synchronized yet.
-
Important topics that have not been added yet:
- Include content on outlier detection, influential points
- Residual analysis, autocorrelation: add example for what a memory effect could be.
- How to use debugger in RStudio
-
Chapter Prerequisites needs proper introduction, only holds bullet points from ETH VVZ.
-
Chapter 1
- Still holds explanations of how to use Jupyter Notebooks and git therein. This should be reformulated to an introduction to RStudio on Renku and using git in the terminal and GUI therein.
- Implement introduction to how to use Renku. How to fork, to create an environment, navigate within environment.
- Give introduction to environmental data (e.g. NASA, Copernicus, Google Earth Engine, Envidat, Pangaea, etc.). Where to find it, how to access it, what is open data, etc.
-
Chapter 2 holds too much new content. Holds too much new information and exercise takes a lot of time to solve. Maybe split into two chapters?
-
Chapter 5 is only revised regarding the tidyverse codestyle until the start of the case study. This tutorial is rather long and feedback suggested to skip this case study.
-
Chapter 7 and following:
- Distinction of testing and validation data set is not coherent. Also, used figures might be confusing. E.g., in Chapter 7, the figure says "test" for the test-fold which we refer to as validation data. Alternative figure could be this one from this blog post.
- Recommendations: Revisit content to never use "testing set/data" when referring to validation set. Or specifically call validation set "test fold". Part of "repeated CV" in Chapter 7 has been improved but needs proof-read.
- All exercises are added to the end of each chapter. The respective solutions as RMarkdown files can be found on the repository.
- There is no consistent style and formatting of the exercise defined. Exercises are only added as text with some provision of pseudo-code as for now. It would be nice to have a consistent style of structuring the task description (using headings, bold face, horizontal lines, etc.) and have the same done for the format of the solutions.
- Most exercsie and solutions are in tidyverse style but not all yet (e.g. 04)
- Solution 04 is not yet in tidyverse but can easily be rewritten using the tutorial code.
- Solutions 07 and later are not knitted with outputs due to keras/tensorflow issues. To get providable html, run via Renku or see files on polybox under 6 Exercises/solutions.
- Regarding the lenght of exercises and applications: Generally, exercises were critized as being to vaguely written and without clear outcomes to be produced. Revisiting the wording and implementing coherent approach could be helpful here. Have a look at how R for Data Science implements exercises, this could be done here as well (clear list of tasks, outputs to aim for, etc.)
- Both applications have been perceived as too complex to solve within deadline, either add more hints or extend deadline. However, content-wise student called it "challenging but interesting".
- Chapter 05 is a collection of code examples to get a hang on different functionalities of R. There is no coherent structure and little eplanatory text included.
- Improvements on nicely structuring outputs to improve readbility.
- Most of the code is in tidyverse but there are still deviations with wrong spacing, naming, function usage, etc.
- To save space, printing of variables can be done by enclosing a code-line with brackets instead of printing it separately:
(x <- 1:10)
instead ofx <- 1:10
andx
to print.