class: center, middle, inverse, title-slide .title[ # Best Practice in Data Analysis ] .author[ ### John Thompson ] .date[ ### 29th March 2023 ] --- # A Data Analysis Workflow
--- # Notes on the Workflow <br> - People rarely make such a diagram .. instead, they have a folder of data files and R scripts <br> - They will have many R scripts - `read_local.R`, `read_remote.R` etc. <br> - The scripts will have been through many iterations - `read_local.R`, `read_local2.R`, `read_local3.R`, `read_local3_update.R` etc. <br> - Often, several people work on the same data --- # A Typical Scenario - You prepare `Analysis One` and find an association of `1.37` between gene A and Disease X, which you present at a conference in Newcastle <br> - A colleague prepares `Analysis Two` and presents their results in Paris <br> - Your supervisor decides to take the credit by writing an article about `Analysis Three` (combining One and Two) <br> - The supervisor asks for your results and discovers that Analysis One is based on `n=4112` and Analysis Two is based on `n=4114`. They tell you to sort it out <br> - The problem lies in the exclusions, so you agree to include only patients accepted in both analyses, `n=4107` <br> - You re-run Analysis One and get an updated association of `1.35`, which you email to the supervisor --- # Your Analysis Unravels <br> - In the middle of the night it dawns on you that you excluded the extra patients in `analysis_one_v8_for_prof.R`, but they were not excluded from the pre-processing <br> - You remove the new exclusions to create `analysis_one_v8_for_prof2.R` and add them to `local_clean3.R` to create `local_clean3_strict.R`, which you run <br> - You now re-run the pre-possessing scripts `local_pca_v5.R` and `local_normalise_v3.R`, but you cannot recall whether you previously normalised before pca, or vice versa. You are not sure whether it matters, so you try both, the association is either `1.32` or `1.41` <br> - You avoid your supervisor, while you try to sort it out. --- # Things Get Worse <br> - You get an email from someone who was at your Newcastle talk. Based on that work, they are planning to do some experiments on mice. They want to know what happens to 1.37 when you adjust for sex. The analysis will be quick and it will inform their experimental design. <br> - You try to reconstruct the analysis based on n=4112, but when you re-run it, you now get `1.31` and not `1.37`. You have no idea why. Sex adjustment of this new analysis changes the result to `1.16`, which could be important. You wonder if 1.37 was a typo for 1.31 in your slides or whether one of the R Packages has been updated. <br> - You work for days trying to reconstruct the Newcastle analysis, but are unsuccessful. A second email questions the delay. You are still avoiding your supervisor. <br> - You would like to ask for help, but fear that if this became public, you would never get a job. You are sure that nobody else is this disorganised. --- # Reasons for adopting Best Practice ### Error avoidance **Errors are inevitable. Adopt practices that minimise them or make them easier to spot** ### Efficiency **Data analysis is time consuming. Adopt practices that save time and effort** ### Clarity & Reproducibility **Your methods must be clear. If you cannot reproduce a result exactly, how do you know if it is correct?** ### Communication **Science is worthless unless it is shared. Cover-ups are unethical** --- # Best Practice in Data Analysis ### Reasons - Error Avoidance - Efficiency - Clarity & Reproducibility - Communication ### Ways to Achieve Good Practice - Standardisation - Archiving - Documentation - Modularisation - Testing and checking - Openness --- # Methods .panelset[ .panel[.panel-name[Standardise] ###Standardisation - work in a consistent structured way - standard folder structure - standard file naming convention - same archiving method - follow an R style guide - reuse the same skills (packages) ] .panel[.panel-name[Archive] ###Archive - archive your code frequently - archive the raw data when it changes - cache results - archive results if the compute time is very long <br> <br> - cache .. short-term storage for current results files - archive .. long-term storage for current and previous versions of the results ] .panel[.panel-name[Document] ###Documentation - documentation is tiresome but essential. Document as you go along. You will forget very quickly - do NOT use undocumented short cuts - click-based software or manual data editing - cut/paste into a report/slides - record the methods that you use - the order that scripts need to be run - the versions numbers of any software or packages - seeds used for simulations and Bayesian computations - make a README and keep it up to date - place comments within your scripts - document the data as well as your code - write internal reports on all intermediate steps and keep an analysis diary ] .panel[.panel-name[Modules] ###Modularisation - modularise your code with a different script for each task - DRY => Don't Repeat Yourself ... avoid repetitive code - write functions for common tasks - place your functions in a package - share your functions - use existing functions whenever you can - adopt the functional programming style ] .panel[.panel-name[Testing] ###Testing & Checking - start simple .. test .. add to the complexity - use a smaller dataset when developing the analysis - quicker and easier to check - know your data. Prepare a dashboard for data exploration - always print/plot/tabulate to inspect intermediate results - try to reproduce known findings - test each script/function before using it - check whether results change whenever you "improve" your code ] .panel[.panel-name[Openness] ###Openness - openness encourages higher standards - collaborate, share the load - make your code public - secrecy is anti-scientific - shared code might help others - someone might suggest an improvement - start a conversation with other researchers - write a blog - use social media to publicise your work ] ] --- # Resources <br> <br> - **Git**: professional archiving tool for code/scripts (version control) - **GitHub**: a cloud based repository that uses Git - **rmarkdown**: creates reproducible documentation; report, slides, website, article, thesis - **quarto**: the latest extension of rmarkdown - **renv**: R package for recording/controlling package version numbers - **repo** & **archivist**: R packages for archiving data and results - **makepipe**: R package for ordering scripts - **targets**: R package for ordering scripts and caching intermediate results (cf archivist + makepipe) There are many more R workflow packages, see https://github.com/jdblischak/r-project-workflows