Commit 0c0d5bbc authored by Francois Pelletier's avatar Francois Pelletier
Browse files

Conversion en projet R

parent 759a984f
......@@ -2,37 +2,28 @@
# History files
.Rhistory
.Rapp.history
# Session Data files
.RData
# Example code in package build process
*-Ex.R
# Output files from R CMD build
/*.tar.gz
# Output files from R CMD check
/*.Rcheck/
# RStudio files
.Rproj.user/
# produced vignettes
vignettes/*.html
vignettes/*.pdf
# OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
.httr-oauth
# knitr and R markdown default cache directories
/*_cache/
/cache/
# Temporary files created by R markdown
*.utf8.md
*.knit.md
# Shiny token, see https://shiny.rstudio.com/articles/shinyapps.html
rsconnect/
.Rproj.user
relational-database-training-r-sql-sas.Rproj
# Working with relational data using dplyr
In this short training session, we introduce some data concepts that will help you grasp more of the principles that underlies the dplyr package and its implementation.
## The relational data model
The relational model is a way to manage data using first-order predicate logic, introduced by Edgar F. Codd in 1969.
In this model, you specify data and queries in a declarative way, which is similar to natural language. You let the data management system deal with the mechanism to store and retrieve data.
The most common language to work with the relational data model is SQL. In R, the dplyr package allows to work in a functional or declarative way with dataframes.
### Concepts of the relational model
- A record is a single tuple of values identified by a key. It's also known as a tuple or row in a database or a row in R
- A relation is a link between a key and attributes. It is also known as a table in a database and as a dataframe in R
- A key allows to link together records from different tables through a join operator.
- A primary key uniquely identified a record in a relation
- A foreign key uniquely identifies a record in another relation
- A surrogate key is a key you create when you don't have a primary key. In `dplyr`, the `row_number` function provides it.
- An attribute is a set on values associated to a key through an attribute name. It is known as a column in a database and a feature (or factor) in R.
![](Relational_model_concepts.png)
[1]
### Operations in the relational model
There are 4 basic operations that can be accomplished on data, whether in a database or working with flat files.
- **Create**: In SQL, you would use the `INSERT` statement. In `dplyr`, you would use the `bind_rows` function, which is more similar to the `OUTER UNION` operator in SQL.
- **Read**: In SQL, you read data using a `SELECT` statement. In `dplyr`, you use the `select` function.
- **Update**: In SQL, you use an `UPDATE` statement to change values of attributes. In `dplyr`, you use the `mutate` function to update a record and to create new variables. It's a better practice to create a new dataframe from an existing one, and consider each dataframe as immutable, so you will always end up creating a new object in R. You can delete unnecessary data frames with `rm` which is the equivalent of `DROP TABLE` in SQL.
- **Delete**: In `dplyr`, you would use the `filter` function to create a new subsetted dataframe. It's similar to the `WHERE` operator.
## The grammar of data manipulation
`dplyr` is based on a grammar of data manipulation, which is a dirrerent way to approach our work compared to the CRUD principles of databases.
You can still easily compare the dplyr verbs with different SQL operators
- `arrange` relates to `ORDER BY`
- `group_by` relates to `GROUP BY`
- `filter` replaces both `WHERE` and `HAVING`
- `summarise` replaces the aggregation functions in `SELECT` statements
In a similar fashion, you can join tables using `dplyr` two-tables verbs. You define the keys in each tables with a `by` argument, which has the following syntax:
```{r eval=FALSE}
by = c(
"variable_in_table_1"="variable_in_table_2",
"variable_in_table_1"="variable_in_table_2",
...)
```
Regular join operators (affects the variables)
![join Venn diagram](join-venn.png)
Filtering join operators (affects the records)
- `semi_join(x, y)` keeps all observations in `x` that have a match in `y` .
- `anti_join(x, y)` drops all observations in `x` that have a match in `y` .
## Set operators
In relational algebra, there are 3 set operators available:
- Intersection: `INTERSECT` in SQL, `intersect` in `dplyr`
- Union: `UNION` in SQL, `union` in `dplyr`
- Set difference: `MINUS` in SQL, `setdiff` in `dplyr`
# References
## Websites
- [dplyr, a part of the tidyverse](https://dplyr.tidyverse.org/)
- [Chapter 13 - Relational data, in R for Data Science](https://r4ds.had.co.nz/relational-data.html)
## Images
[1] By User:AutumnSnow - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1313684
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment