class: center, middle, inverse, title-slide # Introduction --- layout: true <div class="dk-footer"> <span> <a href="https://rfortherestofus.com/" target="_blank">R for the Rest of Us </a> </span> </div> --- class: center, middle, dk-section-title background-image:url("images/gawn-australia-_OUvt8kLf0s-unsplash.jpg") background-size: cover # What is Data Cleaning? ??? Hello and welcome to Data Cleaning with R. Before we begin, let's define Data Cleaning. Let's look at some common definitions. --- class: center, middle >.large["Mechanistic, time consuming process of identifying and fixing incomplete, inaccurate, irrelevant, or duplicated data"] ??? One popular definition describes data cleaning as a mechanistic, time consuming process of identifying and fixing incomplete, inaccurate, irrelevant, or duplicated data -- >.large["Tedious tasks that get in the way of extracting true value and insights from data"] ??? Others see it as an obstacle to extracting true insights from data -- > .large["Not difficult, not _real_ work. Done (grudgingly) before the main analyses"] ??? and some say it is not difficult and not real work. that it is done grudgingly before the actual analyses. --- class: center, middle >.large["~~.gray[Mechanistic, time consuming]~~ process of identifying and _fixing_ incomplete, inaccurate, irrelevant, or duplicated data"] >.large["~~.gray[Tedious tasks that get in the way of]~~ extracting true value and insights from data"] > .large["~~.gray[Not difficult, not _real_ work. Done (grudgingly) before the main analyses"]~~] ??? I'd like to refute some of that. yes, data cleaning means fixing issues with our data, but it is real work, and it is not trivial. --- ## Data cleaning: .large[ A spectrum of **reusable data transformations** to make data **usable** and to obtain more meaningful results from our analysis or visualization methods.] ??? Let's try again. We can define data cleaning as a spectrum of data transformation steps to make data usable and to get more meaningful results from our workflows. -- .large[ Has no single end product, because different systems and methods consume different input data formats and structures.] ??? There is no one particular end product when we're cleaning data, because different programs take different inputs, and we need to adjust our approach for each one. Still, clean data has certan things in common that we will be discussing throughout this course. --- ## Cleaning data involves: ??? When we need to clean our data, these are the three main steps in the process. -- .large[ **1\.** Identifying specific **issues** that hinder an analysis, visualization, or reporting workflow.] ??? First, we identify the issues that hinder our workflows -- .large[ **2\.** Determining which approach or tool is needed to address each issue.] ??? Then we determine which approach or tool we will need -- .large[ **3\.** Transforming the data, making them more appropriate and valuable for a variety of downstream purposes. ] ??? Finally, we transform the data, making them more useful for various different purposes. --- ## Common Issues .fl.w-50[ - Non-rectangular data - Unusable variable names - Whitespace - Inconsistent letter case - Missing/implicit/misplaced grouping variables ] .fl.w-50[ - Compound values - Duplicated values - Broken values - Empty rows and columns - Numbers stored as text ] ??? While I was preparing this class, I noticed that most definitions of data cleaning mention fixing issues with the data, but these issues are often mentioned without descriptions or context. I collected the most common ones here. Data in the real world usually have these issues. This list here is for reference,but we will deal with all of them at a later section of this course. --- ## Why data cleaning ??? We often need to clean our data simply to get our software to import the data, and behave as we expect it to. -- .large[ - Enables downstream tasks by making data readable & usable ] ??? Clean data makes downstream tasks possible. -- .large[ - Saves time (rows and columns in a data object ready for use) ] ??? Having clean data saves time, for example, by having tables with rows and columns on which we can immediately perform operations. -- .large[ - Avoid potentially costly delays, or misinterpretations ] ??? After that, clean data is important to avoid delays, biases, or misinterpretations. --- ## Why data cleaning in R ??? Why should we use R for data cleaning. Because, by working with code and not interactively with pointing and clicking, we leave a record of our actions which we can revisit whenever we want. -- .large[ - Document and reproduce the process ] -- .large[ - Less time editing manually ] ??? with R, we can work on large files or with many files at once, which saves us time and might not even be feasible if done by hand. -- .large[ - Purpose-built tools available and well documented ] ??? I also like R so much for data cleaning because there are a number of tools and packages built specifically for this, which have good documentation and a large community of users ready to help us if we get stuck. -- .large[ - Code can be reproduced and repurposed ] ??? Finally, the code we write can be run many times and repurposed for new projects. --- class: center, middle # Examples ??? Let me show you some examples of what can happen when we don't clean our data. --- class: middle, center <img src="images/drugs.png" width="96%" /> ??? This graph is fictional, but it is based on a real used by a pharmaceutical company. It shows drug sales in a year. If, for example, an executive wants to know which was the top selling drug that year, this plot can be misleading. Why is that? Look at the labels on the x-axis. Each drug should only appear once, but some appear twice. This is because there are small variations in how the drug names were stored. The plotting software does not realize that underscores should be the same as spaces, or that a name in upper or lower case refers to the same drug. In this situation, the top-selling drug is not the one the chart is showing, so this is misleading. This looks obvious, but these kinds of issues are common and they can go unnoticed if we are working, for example with many years worth of data for hundreds of drugs. ---
Faculty Productivity Bonuses
2020 - School of Engineering
Dr. Deavan Smith
Dr. Jane Allen
??? This other example is also simplified and fictional, but it is based on a report used at a real university. Each year, the most productive faculty member receives a cash bonus, based on a report that is generated automatically. This is a good example of the consequences of not cleaning data before making calculations. Here, one faculty members appears twice in the data, because for whatever reason some entries include academic titles while the rest of the entries do not. This is also a simple fix, which could have avoided some embarrassment. --- class: middle, center <img src="images/fishlandings.png" width="100%" /> .small[Source: [Kenya Ministry of Agriculture, Livestock and Fisheries](https://africaopendata.org/organization/kenya-ministry-of-agriculture-livestock-and-fisheries)] ??? This next example is real, describing fisheries data for a lake in Kenya in 2016. There are no mistakes in the values, but the data are structured in a way that makes sense to humans but is harder to feed into any analysis software. If we had to restructure multiple files with this complicated layout, say, for many lakes and many years, the process can become time consuming. Fortunately, we can do this more efficiently in R, and will do so in a later lesson. --- </br> .left[ <img src="images/tymp.png" width="100%" /> ] .small[Source: Ojeda _et al._ [(1999)](https://doi.org/10.1006/jare.1999.0496)] ??? This next example is a table from a scientific publication. A lot of information is being presented here, and let's assume the only version of these data is in a PDF file. If wanted to work with the contents of this table, we would most likely run into some trouble. Nothing is inherently wrong with this table, but sometimes we might be forced to import and work with these kinds of data sources ---
.small[source: `gt:gtcars` Luxury Automobiles Dataset] ??? This last example describes some fancy cars. Things are OK for the most part, but notice these rows that group the cars by the country of their maker. This is informative and readable, but not ideal from an analysis perspective, and we will see why later on in the course. With these few examples, you hopefully were able to recognize some common issues and relate them to things you may have seen in your own work. --- class: inverse ## Your turn 1. Read this short blog [post](https://www.johndcook.com/blog/2021/01/12/data-pathology/) by John D. Cook. .center[https://www.johndcook.com/blog/2021/01/12/data-pathology/] 2. Reflect on the amount (if any) of data cleaning you perform for your day-to-day work. ??? Before we get to any code, please complete these short tasks. First, read a short text on why we should not consider data cleaning an obstacle to our work. Then, reflect on how much time you spent cleaning data in your own work, do you enjoy it? do you find it tedious? --- class: center, middle, dk-section-title background-image:url("images/willian-justen-de-vasconcellos-JJEQ1SQhkEg-unsplash.jpg") background-size: cover # Data Cleaning is Analysis ??? Before moving on, I want to really emphasize that data cleaning is the first step in an analysis. Let me elaborate. --- ### When we clean data, we make judgments and interpretations ??? When we clean data, we make judgments and interpretations -- - Are the data in a usable structure? ??? Can we use the data in their current state? -- - What constitutes 'unwanted' variation? ??? Also, what constitutes unwanted variation? As we saw earlier in an example, are spaces between words interchangeable with underscores? do we want to keep this kind of variation? It will always depend on the context. -- - Is there any implicit information that needs to made explicit? ??? Sometimes we need to figure out if there is any information that is only implied but not actually recorded within a dataset. Consider the example from earlier with the cars and the country of their makers. The country names show up in the data but it was up to us to interpret what this means. -- - What to do about missing data. ??? Finally, what to do about missing data. We will go over this in detail later, but it is up to us to interpret what missing values mean before moving on with a workflow. --- class: center, middle .large[After identifying issues in our data, we decide which tool is best suited to address them.] -- ## This is also data analysis. -- >.large[Data cleaning is an important stage in most data analyses, not a one-off task of lesser importance.] ??? Most of the time, diagnosing issues with our data and doing something about them is the first step in data analysis. --- .pull-left[ <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">I will let the data speak for itself when it cleans itself</p>— Allison Reichel (@AllisonReichel) <a href="https://twitter.com/AllisonReichel/status/1350107719236247558?ref_src=twsrc%5Etfw">January 15, 2021</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ] .pull-right[ <blockquote class="twitter-tweet" data-dnt="true"><p lang="en" dir="ltr">I was recently asked, “are you sure teaching Python & R is right? Isn’t the future all AI and ML?”<br><br>Munging data is going to be step 1 of analysis for the foreseeable future. <a href="https://t.co/n2tKDwI0wn">https://t.co/n2tKDwI0wn</a></p>— JD Long (@CMastication) <a href="https://twitter.com/CMastication/status/1361050957962952705?ref_src=twsrc%5Etfw">February 14, 2021</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ] ??? these two tweets summarize the importance of cleaning data before doing pretty much anything else. Being able to clean data is just as valuable as learning the most novel and complicated analytical methods. --- class: center, middle, dk-section-title background-image:url("images/endri-killo-tvWqbeOaJnw-unsplash.jpg") background-size: cover # The Data Cleaning Process ??? Now that we defined data cleaning, and that we recognize its importance in most workflows, lets go over the whole process. --- class: middle, center <img src="images/workflow.png" width="110%" /> ??? We start the inputs. R can import pretty much every kind of file from our computer or from a cloud service. We can import the files into R and keep the raw data intact, there is no need to modify it in place. With our data loaded into R, we can explore, restructure, and fix any issues before going on to models and plots, or to export a clean version of the data. All the while, by working with code and scripts, we are creating a record of what we did at each step. --- class: center, middle, dk-section-title background-image:url("images/bence-boros-l3_9j916sh0-unsplash.jpg") background-size: cover # Course Logistics ??? We are almost ready to start cleaning data but before we do, lets spell out the general logistics for this course. --- ## Requirements and Prerequisites ### You: - are familiar with basic data types and objects in R - can use functions and arguments - have R and RStudio running - can install and load packages ??? Before you move on, you might want to check if you meet these basic prerequisites, like having R and Rstudio installed and ready. If you don't feel comfortable working with data types, objects, and functions, I suggest the free Fundamentals of R course. --- ## Course conventions ??? Now, some conventions that I use throughout this course. -- - All packages available on **CRAN** (can be installed with .orange[`install.packages()`]) ??? All the packages we will be using are available on CRAN, which should make for easier installation. -- - All slides are available in html format ??? All the slides are available in html format, which you can open with any web browser. -- - Package names in .b.rrured[red], preceded by 📦 ??? All package names appear in red with a package emoji -- - Functions and important operators in .b.orange[orange] or .b.purple[purple] ??? Important functions and operators appear in orange or purple. -- - Relevant example data is provided in slides (see '**data-setup**' panels) ??? Finally, there is code for creating the example datasets that appear in the slides whenever a tabbed panel says data-setup. --- # My setup - R v4.0.4 - RStudio v1.4.1106 - Default theme and pane layout - Windows 10 ??? I'm running recent versions of R and Rstudio in Windows 10, and other than making the font slighly larger I made no additional customization. --- ## Course setup **1\.** Create a new RStudio project on your computer **2\.** Download course materials (slides and data needed for "your turn" exercises) with 📦 .b.rrured[`usethis`] ```r install.packages("usethis") # Install first if necessary usethis::use_course("rfortherestofus/data-cleaning-course") # Download course materials ``` **3\.** Copy the downloaded materials to the folder containing the RStudio project you created for this course. ??? For this course, we'll need to download some files. First, create a new RStudio project somewhere on your computer where you will be able to find it. Then, install the usethis package and call the use course function to download the course materials. Finally, copy contents of the new folder into the location of the Rstudio project you created for this course. I'll demonstrate these steps now.