- Docs/first stop for getting started: https://quarto.org/
- Extensions: https://quarto.org/docs/extensions/listing-revealjs.html
- Presentations (in revealjs): https://quarto.org/docs/presentations/revealjs/
- Getting started
- Misc
- A Quarto tip a day by Mine Cetinkaya-Rundel: https://mine-cetinkaya-rundel.github.io/quarto-tip-a-day/
Extensions (e.g. quarto-ext/fontawesome
for icons) have to be installed with every new project (at least for now). Quick tutorial for this: https://www.youtube.com/watch?v=u8EOVOjX13Y
Sections https://quarto.org/docs/authoring/cross-references.html#sections
To reference a section, add a #sec- identifier to any heading. For example:
## Introduction {#sec-introduction}
See @sec-introduction for additional context.
Note that when using section cross-references, you will also need to enable the number-sections option (so that section numbering is visible to readers). For example:
---
title: "My Document"
number-sections: true
---
Codes for adding emojis in Markdown here: https://github.com/markdown-templates/markdown-emojis
Preventing scientific notation: https://stackoverflow.com/questions/25946047/how-to-prevent-scientific-notation-in-r
My current preferred package management workflow involves creating virtual environments with renv
.
My (previously) preferred package for package management is pacman
. Before loading in dependencies, put this at the top of the script
if(!require("pacman")) install.packages("pacman")
I use the here()
package for file/path management.
To reset home path (tidyverse equivalent of setwd()
): set_here()
. It's a superseded function, but I don't really like the replacement
Combine with purrr::map
to read in multiple csvs to one data frame
https://www.mjandrews.org/blog/readmultifile/
Useful functions that I am constantly forgetting: na_if
and rowwise()
(group_by for rows)
NOTE: Don't get stuck in the trap of doing row-wise operations if pivoting makes more sense!
slice(1L)
for getting the max value of each group
grouped_data <- data %>%
group_by(variable, group_vars) %>%
summarize(values = sum(values)) %>%
mutate(grp = cur_group_id()) %>%
arrange(-n) %>%
slice(1L)
recode()
values in variables
replace_na()
for recoding NA values in variables
Do you want counts of variables in groups without deleting all the other variables? Use mutate()
after group_by()
instead of summarize. Then subset accordingly. e.g.:
df %>%
group_by(country_person) %>%
mutate(
n_articles_total = n(),
n_articles_before = sum(before_appoint==1),
n_articles_after = n_articles_total - n_articles_before,
n_lang_en = sum(lang_en==1),
n_lang_other = n_articles_total - n_lang_en,
av_text_length = mean(length)
)
df_tidy_subset <- df %>%
select(
id, country_person, n_articles_total, n_articles_before,
n_articles_after, n_lang_en, n_lang_other, av_text_length
) %>%
unique() # rm duplicates
id numbers within groups
df %>% group_by(cat) %>% mutate(id = row_number())
My general philosophy: avoid pure regex whenever possible 😅
Remove all characters that are non-numeric: STRING <- str_remove_all(STRING, "\\D+")
Extract substring between two strings: qdapRegex::ex_between()
On reading in multiple files and combining result of a function into a data frame: https://clauswilke.com/blog/2016/06/13/reading-and-combining-many-tidy-data-files-in-r/
Create date object from year and month columns with ym()
function (goes for a bunch of different ymd combinations as well). e.g.:
df %>%
mutate(
date = ym(paste(Year, Month))
)
I collect my data using the twarc
Python package, but work with my data in R
. See code below as an example for wrangling the JSON strings from entities
variables.
You might have to make things more complex if you want to also add where tweets came from, but hopefully the snippet below provides a good starting point!
tweets_entities <- tweets %>%
filter(entities.annotations != "") %>% # for some reason drop_na not working
mutate(entities.annotations = gsub("\"\"", "\"", entities.annotations))
entities <- map(tweets_entities$entities.annotations, fromJSON) %>%
bind_rows() %>%
select("type", "normalized_text") %>%
distinct()
people <- entities %>%
filter(type == "Person")
If you're a Python user, stick to the Grammar of Graphics and use the plotnine library for visualization :D
fig_df |>
ggplot(aes(x = country, y = account_type, fill = n)) +
geom_tile(color = "white") +
geom_text(aes(label = n), color = "white", size = 15) +
coord_fixed() +
scale_fill_viridis(end = 0.7)
Different categorical x-axes https://stackoverflow.com/questions/45019839/ggplot2-different-facet-width-for-categorical-x-axis
This is the theme_set()
that I might use for now.
# add fonts (this might not be a necessary step)
showtext::font_add_google(name = "Fira Sans", family = "fira")
showtext::font_add_google(name = "Roboto", family = "roboto")
# themes and text defaults
theme_set(
theme_minimal() +
theme(
legend.position = "bottom",
plot.title = element_text(family = "fira"),
text = element_text(family = "roboto")
)
)
Use str_wrap()
around different graphic elements to automatically wrap captions/text/legend labels. Sample code below:
top_df %>%
ggplot(aes(x = date, y = as.numeric(rank), color = str_wrap(game, 20))) +
geom_point() +
geom_bump() +
scale_y_reverse(limits = c(10, 1), n.breaks = 10) +
labs(
title = "Top Games Streamed on Twitch",
subtitle = str_wrap("Games shown are a subset of data with the top 200 ranked games over time. Each of these games have consistently ranked in the top 200, but not necessarily top 10 throughout the years.")
) +
guides(col = guide_legend(ncol = 3))
If you want to wrap legend labels but keep factor levels, use the following helper function (thanks Hadley Wickham!)
# for wrapping legend labels while keeping original factor levels
# https://github.com/tidyverse/stringr/issues/107
str_wrap_factor <- function(x, ...) {
levels(x) <- str_wrap(levels(x), ...)
x
}
How to customize which legends are shown based on aesthetic: guides()
. Example:
data %>%
ggplot(aes(x = type, y = fct_rev(abb), size = n, color = n)) +
geom_point() +
labs(
title = "TITLE",
x = "",
y = "",
color = "",
caption = "Data Source: DATA_SOURCE\nVisualization: Allison Koh"
) +
guides(size = "none")
{extrafont}
and {showtext}
are useful for adding different fonts to viz. The former is for loading in existing fonts from system, the latter is for making sure your text shows up in all graphics(and for loading in fonts from google and other places).
LIFE HACK (or more likely, common sense thing that I often forget): Make sure to include font families in theme_set()
at the top of a script instead of in individual graphics.
Useful lines of code for {extrafont}
are as follows:
# load in system fonts
extrafont::load_fonts()
# show font names
fonts()
# show a data frame of all fonts available
fonttable()
Useful lines of code for {showtext}
are as follows:
showtext_auto() # put at the beginning of a script to automatically show text in new graphics devices
Add Bernie Sanders to your plots :D because why not
Install
remotes::install_github("R-CoderDotCom/ggbernie@main")
Geom
geom_bernie(aes(x = 1930, y = 20100), bernie = "sitting")
Helper Function
# helper function for writing alt text
# https://twitter.com/thomas_mock/status/1375853258145734660
write_alt_text <- function(
chart_type,
type_of_data,
reason,
misc,
source
){
glue::glue(
"{chart_type} of {type_of_data} where {reason}. \n\n{misc}\n\nData source from {source}"
)
}
Examples
The {TidyTuesdayAltText}
package contains examples of AltText from #TidyTuesday posts between 2019 and 2021.
A future version of this package will include an annotated dataset of alt text + ratings according to feature: https://twitter.com/spcanelon/status/1405488036989870080. Until it is integrated into the package, the data can be found here: https://github.com/spcanelon/csvConf2021/blob/main/data/annotatedRubric1.csv
devtools::install_github("spcanelon/TidyTuesdayAltText
https://datascienceplus.com/how-to-use-paletter-to-automagically-build-palettes-from-pictures/
devtools::install_github("andreacirilloac/paletter")
create_palette(image_path = "~/Desktop/410px-Piero_della_Francesca_046.jpg",
number_of_colors =20,
type_of_variable = “categorical")
pie(rep(1, 13), col=pal)
https://stackoverflow.com/questions/17552917/merging-existing-pdf-files-using-r
install.packages("qpdf")
qpdf::pdf_combine(input = c("file.pdf", "file2.pdf", "file3.pdf"),
output = "output.pdf")
Reset port (for error message: Selenium server signals port = 4444 is already in use.) https://stackoverflow.com/questions/74708282/rselenium-is-not-working-when-creating-servers
library(qdapRegex)
#clear busy port in windows
port <- 4444L
tintern <- system("netstat -a -n -o",intern=T)
irow1 <- grep(as.character(port),tintern)
if(length(irow1)>0){
irow1 <- irow1[1]
if(!is.na(irow1)){
irow1 <- irow1[1]
trow <- tintern[irow1]
trow <- trimws(rm_white(trow))
tpid <- word(trow,-1,-1)
system(paste0("taskkill /pid ",tpid," /F"))
}
}
conda create --name ENVNAME python=3.9.7
conda activate ENVNAME
conda deactivate ENVNAME
Every time you make a new environment, don't forget to install git! For my CLI (Anaconda Powershell Prompt for Windows 11):
conda install git
Set default format for displaying numbers; rounding to the nearest number
pd.options.display.float_format = '{:.0f}'.format
- Jupyter notebook is normally my go-to, especially if collaborators are comfortable working with Github.
- If browser-based IDEs aren't your favorite, Jupyter Ascending seems like a good option for using a text editor of your choice to generate Jupyter notebooks.
- Colab is another common tool used; it's not my preference given the workflow for file management, etc. is very different.
Useful resource comparing old torchtext (legacy) to new torchtext: https://lightrun.com/answers/pytorch-text-overview-of-issues-in-torchtext-and-the-plan-for-revamping
Use pathlib
for relative paths in Python. Docs: https://docs.python.org/3/library/pathlib.html
Stuff to import at the top of the script/NB
import pathlib
from pathlib import Path
The following lines of code identify a working directory and prints directory name/parent directory.
path = Path.cwd()
print(path)
print(path.parent)
Joining paths: In my workflow, I normally keep the working directory as my code folder and specify the path for storing data collected using relative paths. The following code specifies a path and specifies the file path for the data subdirectory of a project.
code_path = Path.cwd()
data_path = Path(path.parent, 'data')
twarc is a command line tool and Python library for collecting and archiving Twitter JSON data via the Twitter API. It has separate commands (twarc and twarc2) for working with the older v1.1 API and the newer v2 API and Academic Access (respectively).
Docs: https://github.com/DocNow/twarc More docs: https://scholarslab.github.io/learn-twarc/06-twarc-command-basics
You have to separately install twarc-csv
to convert jsonl output to csv in the CLI. (What I use: Anaconda/Powershell CLI)
Workflow for extracting tweets using this Python library, and converting the resulting file to CSV.
cd "C:\datadir\path"
twarc2 search --archive "search term" tweets.jsonl
twarc2 csv tweets.jsonl output.csv
How to stop tweet collection: Ctrl + C
Github repo: https://github.com/twintproject/twint
There have been some issues with twint lately, biggest issue is only being able to scrape a sample of tweets at a time. There are some fixes for it, depending on your OS and CLI.
For adjusting vertical alignment of text in the template
\documentclass[10pt,professionalfonts,t]{beamer}
Get rid of the t
to revert to vertically centering text.
The site for creating images of source code is https://carbon.now.sh/.
Add figure to slide
```{r figure-name, echo = F, out.width = '100%', fig.cap = "Source: SOURCE HERE"}
knitr::include_graphics(here("figures", "Figure.jpeg"))