class: center, middle, inverse, title-slide .title[ # What Next? ] .subtitle[ ## more skills to improve your data analysis workflow ] .author[ ### John Thompson ] .date[ ### 29th March 2023 ] --- # Rmarkdown to Quarto <br> - quarto extends rmarkdown to allow code chunks in R, Python, Julia and Javascript <br> - Developed by **Posit** as part of their move to support other languages <br> - At present rmarkdown files can be rendered with quarto without any adaptation <br> - In future, development efforts will be directed at quarto --- # Archiving Data <br> - archive scripts and the raw data and you do not need to archive data files or results files unless compute time is a issue - create a local data repo with the R Packages such as `repo` and `archivist` - small amounts of data can be saved to GitHub (50Mb limit) - data repositories (see https://www.nature.com/sdata/policies/repositories) will host data for sharing, e.g. as part of a publication - cloud computing providers will host your data. Amazon AWS, Microsoft Azure, Google cloud etc. (all offer a free service to get you started) --- # Make Files <br> - for reproducibility, a project should be controlled by a make file (a master script that runs everything else). - the R package `makepipe` creates a master script - the R package `targets` creates a master script and caches intermediate results - for maximum efficiency, `makepipe` and `targets` monitor changes to the code and only re-run code that is affected by the changes --- # Package Versions <br> - packages are updated regularly, with bug fixes, new features etc. <br> - code that works with one version of a package, may not work with another version <br> - packages are only guaranteed to work together when their versions are in sync <br> - code that works for you, might fail for someone else if they use a different version of a package <br> - R sets up a single library of packages that is accessed by all of your projects - updating might be good for one project and disastrous for another <br> - `renv` allows separate libraries for every project and keeps track of the version numbers that you are using so that you can pass that information to a collaborator --- # Your Own Packages <br> - a personal package is a convenient way to store your functions <br> - functions in a package can be documented with `roxygen2` to provide help on the syntax <br> - functions in a personal package can be used across all of your projects <br> - you can upload your package to GitHub and share it with colleagues or even the whole world <br> - do not go public without much thought. Maintaining a package is a big undertaking --- # Functional programming <br> - Good Practice says that - modularisation of project code makes it easier to write and maintain - using functions for repeated tasks makes scripts easier to write and maintain - well-named functions act like comments <br> - the `purrr` package provides ways of applying a function multiple times with different inputs, so called functional programming (FP) - coding without loops - it takes a while to adjust to full FP but eventually your code will be easier to follow and less likely to contain errors --- # Learn html and latex <br> - markdown only offers basic formatting. At some point, you will need a format that markdown does not provide <br> - it is good to have a house-style for your documents. You could start with a markdown theme found on the internet, but eventually you will want to design your own <br> - to have full control over an html document you need to know the html language and to control a pdf you need to know latex --- # Learn Shiny <br> - shiny provides interactive visualisations <br> - shiny dashboards are excellent for - exploring large datasets - communicating your results <br> - visit https://shiny.rstudio.com/gallery/ to see what shiny can do --- # Make Better Reports <br> - learn a tables package so that you can produce pretty tables - `gt` and `flextable` are the best - `kable` and `kableExtra` are simpler alternatives <br> - improve your ggplot2 skills. e.g. use `patchwork` <br> - learn how to produce maps in R <br> - learn how to make diagrams in R, e.g. R package `diagrammeR`, latex package **tikz** <br> - handle BibTeX citation with an R package. see https://ropensci.org/blog/2020/05/07/rmd-citations/ --- # Use Machine Learning Methods <br> - any fool can run fit an ML model, very few people understand those models - understand boasting, bagging, regularization etc. <br> - Machine Learning models often out-perform statistical models - try tree-based methods, such as random forests or XGBoost <br> - learn to analyse textual and image data <br> - learn the basics of deep learning - how do you fit a neural network? - why are large neural networks so good at prediction? - what are the problems of a model with a billion parameters? - what is the difference between a transformer and a GAN? --- # Integrate AI into Your Workflow <br> - deep learning will dominate data science, at least in the medium term <br> - explore ways of using ChatGPT or GPT-4 in your daily work - to explain a difficult concept - to review a topic - to write your next grant application - to write R code - to add comments to existing R code - to find errors in your code <br> - consider the ethics of using AI in research. - What about the ethics of not using AI? - Do you need a group policy? --- # Learn a Second Language <br> - many R packages use C++ in the background to provide speed - learn C/C++ and the Rcpp package <br> - learn python if you want to get deeper into machine learning <br> - learn Julia or another of the modern languages - nearly as fast a C but without the pain - some good ML and Bayesian packages