csv
data files?init_py.R
is only called if the virtual environment is created. Can I force a new
call?This package was designed with the aim of distributing educational resources for statistics courses targeted at students.
Once the teaching materials have been downloaded, the primary functions of this package include:
With this feature, students can access various educational materials such as interactive apps, R code, data files, and other resources that can be helpful in learning statistical concepts. By providing easy access to these materials, the package aims to facilitate the learning process for students and make it more interactive and engaging.
GitHub allows you to download the repository as a ZIP file. You can find the option to download under the Code button (Download ZIP) in the repository. mmstat4 works with this ZIP file, but you can also use one of your own ZIP files.
In my courses, I assume that all R programs run in a freshly started R environment, meaning there are no path dependencies, and all necessary libraries are loaded within the R program. My repositories contain not only the example programs for the students but also the programs I use to create images and tables, as well as the Shiny Apps I demonstrate.
You can install mmstat4
from CRAN using:
install.packages("mmstat4")
Alternatively, you can install the development version from GitHub
using devtools
:
devtools::install_github("sigbertklinke/mmstat")
A component of the package includes a small ZIP file containing
educational materials. Initially, we need to instruct
mmstat4
to utilize this ZIP file instead of the larger ZIP
file for my Data Analysis I and II courses.
ghget('local')
ghopen("example_mcnemar.R") # open a R example file
ghget
returns the key (local
) associated
with the currently active ZIP file. ghopen
launches the
example file in RStudio. To access the equivalent Python script,
use:
ghopen("example_mcnemar.py") # open a Python example file
Note: To run Python scripts, ensure local Python installation.
Scripts execute within mmstat4.xxxx
virtual environment,
created upon script run or open. User approval is crucial. Upon setup,
script checks for init_py.R
in ZIP file. If found, it’s
executed, often installing Python modules with
reticulate::py_install('module name')
.
Shiny apps can also be launched in RStudio and run locally.
ghopen("pca_best_line/app.R") # open a Shiny app
Data files can be loaded with:
x <- ghload("TelefonDaten.csv") # load a data set
head(x)
#> V1 V2
#> 1 1876 2600
#> 2 1877 9300
#> 3 1878 26300
#> 4 1879 30900
#> 5 1880 47900
#> 6 1881 71400
HTML and PDF files will open in the default application:
ghopen("Formelsammlung.pdf") # open a PDF file
ghget
A ZIP file or repository can be stored locally or in the internet. A
key-value approach can be used to determine the location of the source
ZIP file. If no key is defined then ghget
uses the base
name of the source ZIP file as the key.
ghget(dummy="https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip")
Three repository keys are predefined: hu.data
,
hu.stat
and dummy
. You can retrieve them
via
ghget('dummy')
ghget('hu.stat')
ghget('hu.data')
If you do not use a key, the programme will create one and return it as result.
ghget(system.file("zip/mmstat4.dummy.zip", package = "mmstat4"))
#> [1] "mmstat4.dummy"
ghget("https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip")
#> [1] "main"
# tries https://github.com/my/github_repo/archive/refs/heads/[main|master].zip
ghget("my/github_repo") # will fail
#> my/github_repo
#> https://github.com/my/github_repo/archive/refs/heads/main.zip
#> https://github.com/my/github_repo/archive/refs/heads/master.zip
#> Error in ghget("my/github_repo"): None of the previously displayed possible ZIP files were found!
#
ghget() # uses 'hu.data'
#> [1] "hu.data"
ghget
downloads the ZIP file, saves it to a temporary
location and unpacks it. For non-temporary locations, see the FAQ.
In addition, unique short names, related to the ZIP file content, are generated from the path components.
After unpacking the ZIP file, unique short names are generated for these files.
ghget('dummy')
#> [1] "dummy"
gd <- ghdecompose(ghlist(full.names=TRUE))
head(gd)
#> outpath inpath minpath filename
#> 1 /tmp/RtmpGONbxJ/mmstat4.dummy-main LICENSE
#> 2 /tmp/RtmpGONbxJ/mmstat4.dummy-main README.md
#> 3 /tmp/RtmpGONbxJ/mmstat4.dummy-main data 12411-0006.csv
#> 4 /tmp/RtmpGONbxJ/mmstat4.dummy-main data ArbeitsloseBerlin.csv
#> 5 /tmp/RtmpGONbxJ/mmstat4.dummy-main data BANK2.sav
#> 6 /tmp/RtmpGONbxJ/mmstat4.dummy-main data Preisindex.csv
#> source
#> 1 /tmp/RtmpGONbxJ/mmstat4.dummy-main/LICENSE
#> 2 /tmp/RtmpGONbxJ/mmstat4.dummy-main/README.md
#> 3 /tmp/RtmpGONbxJ/mmstat4.dummy-main/data/12411-0006.csv
#> 4 /tmp/RtmpGONbxJ/mmstat4.dummy-main/data/ArbeitsloseBerlin.csv
#> 5 /tmp/RtmpGONbxJ/mmstat4.dummy-main/data/BANK2.sav
#> 6 /tmp/RtmpGONbxJ/mmstat4.dummy-main/data/Preisindex.csv
The file name is split into four parts. The last two parts,
minpath
and filename
, are used to create short
names:
/tmp/RtmpXXXXXX/mmstat4.dummy-main/LICENSE
is
LICENSE
. There was no other file named LICENSE
in the ZIP file. Therefore, it is sufficient to address this file in the
ZIP file./tmp/RtmpXXXXXX/mmstat4.dummy-main/data/BANK2.sav
is
data/BANK2.sav
. There is another file called
BANK2.sav
in the ZIP file, but to address it uniquely,
data/BANK2.sav
is sufficient for this file in the ZIP file
(the other is dbscan/BANK2.sav
). Currently, no check is
made whether two files with identical basenames are also identical in
content.ghlist("BANK2", full.names=TRUE) # full names
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/BANK2.sav"
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/examples/data/cluster/dbscan/BANK2.sav"
ghlist("BANK2") # short names
#> [1] "data/BANK2.sav" "dbscan/BANK2.sav"
ghopen
, ghload
, ghsource
The short names (or the full names) can be used to work with the files
## x <- ghload("data/BANK2.sav") # load data via rio::import
## ghopen("univariate/example_ecdf.R") # open file in RStudio editor
## ghsource("univariate/example_ecdf.R") # execute file via source
ghlist("example_ecdf") # "univariate/" was unnecessary
#> [1] "example_ecdf.R"
ghlist
, ghquery
With ghlist
you can get a list of unique (short) names
for all files or a subset based on a regular expression
pattern
in the repository
str(ghlist()) # get all short names
#> chr [1:473] "LICENSE" "README.md" "12411-0006.csv" "ArbeitsloseBerlin.csv" ...
ghlist("\\.pdf$") # get all short names of PDF files
#> [1] "Aufgaben.pdf" "Formelsammlung.pdf" "Loesungen.pdf"
With ghquery
you can query the list of unique (short)
names for all files based on the overlap distance.
ghlist("bnk") # pattern = "bnk
#> character(0)
ghquery("bnk") # nearest string matching to "bnk"
#> [1] "data/BANK2.sav" "dbscan/BANK2.sav" "AverageGroupLinkage.R"
#> [4] "AverageLinkage.R" "CentroidLinkage.R" "CompleteLinkage.R"
ghfile
, ghpath
,
ghdecompose
ghfile
tries to find a unique match for a given file and
returns the full path. If there is no unique match, an error is returned
with some possible matches.
ghdecompose
builds a data frame and decomposes the full
names of the files into
outpath
the path part which is the same for all files
(basically the place where the ZIP file is extraced to),inpath
the path part that is not used in
minpath
, but in the ZIP file,minpath
the minimum path part, so that all files are
uniquely addressable,filename
the base name of the file, andsource
the input for shortpath.The short names for the files are built from the components
minpath
and filename
.
ghpath
builds up the short name with various path
components from a ghdecompose
object.
ghfile('data/BANK2.sav')
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/BANK2.sav"
ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
#> [1] "local"
fnf <- ghlist(full.names=TRUE)
dfn <- ghdecompose(fnf)
head(dfn)
#> outpath inpath minpath filename
#> 1 /tmp/RtmpGONbxJ/mmstat4.dummy data hhberlin.csv
#> 2 /tmp/RtmpGONbxJ/mmstat4.dummy data Preisindex.csv
#> 3 /tmp/RtmpGONbxJ/mmstat4.dummy data BANK2.sav
#> 4 /tmp/RtmpGONbxJ/mmstat4.dummy data 12411-0006.csv
#> 5 /tmp/RtmpGONbxJ/mmstat4.dummy data child_data.sav
#> 6 /tmp/RtmpGONbxJ/mmstat4.dummy data hhD.rda
#> source
#> 1 /tmp/RtmpGONbxJ/mmstat4.dummy/data/hhberlin.csv
#> 2 /tmp/RtmpGONbxJ/mmstat4.dummy/data/Preisindex.csv
#> 3 /tmp/RtmpGONbxJ/mmstat4.dummy/data/BANK2.sav
#> 4 /tmp/RtmpGONbxJ/mmstat4.dummy/data/12411-0006.csv
#> 5 /tmp/RtmpGONbxJ/mmstat4.dummy/data/child_data.sav
#> 6 /tmp/RtmpGONbxJ/mmstat4.dummy/data/hhD.rda
head(ghpath(dfn))
#> 1
#> "/tmp/RtmpGONbxJ/mmstat4.dummy/data/hhberlin.csv"
#> 2
#> "/tmp/RtmpGONbxJ/mmstat4.dummy/data/Preisindex.csv"
#> 3
#> "/tmp/RtmpGONbxJ/mmstat4.dummy/data/BANK2.sav"
#> 4
#> "/tmp/RtmpGONbxJ/mmstat4.dummy/data/12411-0006.csv"
#> 5
#> "/tmp/RtmpGONbxJ/mmstat4.dummy/data/child_data.sav"
#> 6
#> "/tmp/RtmpGONbxJ/mmstat4.dummy/data/hhD.rda"
The package comes with two RStudio addins (see under
Addins -> MMSTAT4
):
Open a file from a zip file (ghopenAddin
),
which gives access to the unzipped zip file and opens the selected file
in an RStudio editor window.
Execute a Shiny app from a zip file
(ghappAddin
), which extracts all directories containing
Shiny apps and opens the selected app in a web browser (using the
default browser).
Currently there are the following routines to support R/Python code snippets:
pkglist
or modlist
, which extracts all
library
/require
/import
calls from
code snippets and returns a frequency table of the packages or and
modules called.ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
#> [1] "local"
files <- ghlist(pattern="*.R$", full.names = TRUE)
cat(head(pkglist(files, repos="https://cloud.r-project.org"), 12))
#> if(!require("Amelia")) install.packages("Amelia", repos="https://cloud.r-project.org/src/contrib")
#> # if(!require("CHAID")) install.packages("CHAID")
#> if(!require("DescTools")) install.packages("DescTools", repos="https://cloud.r-project.org/src/contrib")
#> if(!require("GGally")) install.packages("GGally", repos="https://cloud.r-project.org/src/contrib")
#> if(!require("HMMpa")) install.packages("HMMpa", repos="https://cloud.r-project.org/src/contrib")
#> if(!require("Hmisc")) install.packages("Hmisc", repos="https://cloud.r-project.org/src/contrib")
#> # if(!require("MASS")) install.packages("MASS")
#> # if(!require("MissingDataGUI")) install.packages("MissingDataGUI")
#> if(!require("NbClust")) install.packages("NbClust", repos="https://cloud.r-project.org/src/contrib")
#> if(!require("QuantPsyc")) install.packages("QuantPsyc", repos="https://cloud.r-project.org/src/contrib")
#> if(!require("RColorBrewer")) install.packages("RColorBrewer", repos="https://cloud.r-project.org/src/contrib")
#> if(!require("TeachingDemos")) install.packages("TeachingDemos", repos="https://cloud.r-project.org/src/contrib")
Note that the line for CHAID
is commented out. The
package cannot be found in CRAN, but you can install it from
R-Forge.
cat(head(pkglist(files, repos=c("https://cloud.r-project.org", "http://R-Forge.R-project.org")), 12))
You can add a file init_R.R
or init_py.R
to
your ZIP file, which installs the necessary R packages or Python
modules.
checkFiles
checks whether each R code snippet runs
smoothly in a freshly started R.
# just check the last files from the list
# Note that the R console will show more output (warnings etc.)
checkFile(files, start=435) # alternatively: Rsolo
Three modes are available for checking a file
:
exist
: Does the source file exist?parse
: Is parse(file)
or
python -m "file"
successful? (default)run
: Is Rscript "file"
or
python3 "file"
successful?dupFiles
uses checksums to check whether files exist
twice.
files <- ghlist(full.names = TRUE)
head(dupFiles(files)) # alternatively: Rdups
#> $c300e8fe6f0bc562256e81670c23d8c0
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy/data/BANK2.sav"
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/cluster/dbscan/BANK2.sav"
#>
#> $`4efddb6dc6c7ed743221295d55133817`
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/nnet/mincer_nnet3.R"
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/nnet/mincer_nnet5.R"
#>
#> $`9f9fe7603aa82f33bbc85a9d32e39d03`
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/cluster/dbscan/app.tmpl"
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/mgraphics/scagnostics/app.tmpl"
#>
#> $`0b74b824367df429803599708daf2e2e`
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/subgroup/example_mosaic.R"
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/mgraphics/example_mosaic.R"
#>
#> $`8eaa4f89e233ba69fcda053d238699aa`
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/subgroup/example_mosaic_cotabplot.R"
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/mgraphics/example_mosaic_cotabplot.R"
#>
#> $`8ed6128aab796148df5e71cbeab547da`
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/subgroup/example_mosaic_graphics.R"
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/mgraphics/example_mosaic_graphics.R"
Note: there is also an error message if the necessary libraries are not installed!
Once you created your ZIP file you need to know under which names a
specific file can be accessed. In the example we use a ZIP file which
comes with the package mmstat4
:
ghget(local=system.file("zip", "mmstat4.dummy.zip", package="mmstat4"))
#> [1] "local"
ghnames <- ghdecompose(ghlist(full.names=TRUE))
ghnames[58,]
#> outpath inpath minpath filename
#> 58 /tmp/RtmpGONbxJ/mmstat4.dummy examples/data/cluster pcplot.R
#> source
#> 58 /tmp/RtmpGONbxJ/mmstat4.dummy/examples/data/cluster/pcplot.R
The shortest possible name is determined by minpath
and
filename
. But all other paths determined by
uniquepath
, minpath
and filename
should also work.
For file number 58, the following access names are possible:
BANK2.sav
will not work since more than one file named
BANK2.sav
in the ZIP file.dbscan/BANK2.sav
will work since this the shortest
possible name.cluster/dbscan/BANK2.sav
,
data/cluster/dbscan/BANK2.sav
, and
examples/data/cluster/dbscan/BANK2.sav
will work.x1 <- ghload("BANK2.sav")
#> Best matches:
#> ghload(x = "data/BANK2.sav")
#> ghload(x = "dbscan/BANK2.sav")
#> Error in ghfile(x, msg = msg): No (unique) file 'BANK2.sav' found, check matches!
x2 <- ghload("dbscan/BANK2.sav")
x3 <- ghload("cluster/dbscan/BANK2.sav")
x4 <- ghload("data/cluster/dbscan/BANK2.sav")
x5 <- ghload("examples/data/cluster/dbscan/BANK2.sav")
Please email me at sigbert@hu-berlin.de
. You can also
try the current development version of the package from GitHub:
# install.packages("devtools")
devtools::install_github("sigbertklinke/mmstat4")
No, this is not supported.
ghget("dummy", .force=TRUE)
ghget("dummy", .tempdir=FALSE) # install non-temporarily
ghget("dummy", .tempdir="~/mmstat4") # install non-temporarily to ~/mmstat4
ghget("dummy", .tempdir=TRUE) # install again temporarily
Note: If a repository was installed permanently and you switch back to temporarily storage then the downloaded files will not be deleted.
ghget("dummy", .tempdir=TRUE)
ghlist(pattern="/(app|server)\\.R$")
ghopen("dbscan") # open the app
csv
data files?ghget("dummy", .tempdir=TRUE)
#> [1] "dummy"
ghlist(pattern="\\.csv$", ignore.case=TRUE, full.names=TRUE)
#> [1] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/12411-0006.csv"
#> [2] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/ArbeitsloseBerlin.csv"
#> [3] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/Preisindex.csv"
#> [4] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/TelefonDaten.csv"
#> [5] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/haushalte.csv"
#> [6] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/haushalte_berlin.csv"
#> [7] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/hhberlin.csv"
#> [8] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/hhberlin_2017.csv"
#> [9] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/pechstein.csv"
#> [10] "/tmp/RtmpGONbxJ/mmstat4.dummy-main/data/rentcap.csv"
# use mmstat4::ghload for importing
ghlist(pattern="\\.csv$")
#> [1] "12411-0006.csv" "ArbeitsloseBerlin.csv" "Preisindex.csv"
#> [4] "TelefonDaten.csv" "haushalte.csv" "haushalte_berlin.csv"
#> [7] "hhberlin.csv" "hhberlin_2017.csv" "pechstein.csv"
#> [10] "rentcap.csv"
pechstein <- ghload("pechstein.csv")
str(pechstein)
#> 'data.frame': 29 obs. of 3 variables:
#> $ Datum : chr "04.02.00" "01.02.01" "10.11.01" "06.02.02" ...
#> $ Tag : int 34 397 679 767 771 783 1043 1160 1166 1421 ...
#> $ Retikulozyten: chr "2,3" "2,5" "2,45" "2,1" ...
For Ubuntu (Linux) install:
sudo apt-get install python3 python3-dev python3-pip python3-venv libbz2-dev
Note: mmstat4
installs these Python modules
numpy
, scipy
, statsmodels
,
pandas
, scikit-learn
, matplotlib
,
and seaborn
by default.
init_py.R
is only called if the virtual environment is
created. Can I force a new call?Yes, delete the virtual environment and recreate it
reticulate::virtualenv_remove('mmstat4')
ghinstall('py', force=TRUE)
The package recognises three standard repositories:
dummy
, hu.stat
, and hu.data
.
Repository | Size | ZIP file location |
---|---|---|
dummy |
3 MB | https://github.com/sigbertklinke/mmstat4.dummy/archive/refs/heads/main.zip |
hu.data |
29 MB | https://github.com/sigbertklinke/mmstat4.data/archive/refs/heads/main.zip |
hu.stat |
31 MB | https://github.com/sigbertklinke/mmstat4.stat/archive/refs/heads/main.zip |
dummy
is small subsample of hu.stat
and
hu.data
which is intended for examples and test
purposes.
Mathematische Grundlagen - Einführung - Grundbegriffe - Univariate Verteilungen - Parameter univariater Verteilungen - Bivariate Verteilungen - Parameter bivariater Verteilungen - Regressionanalyse - Zeitreihenanalyse - Indexzahlen - Wahrscheinlichkeitsrechnung - Zufallsvariablen - So lügt man mit Statistik - Wichtige Verteilungsmodelle - Stichprobentheorie - Statistische Schätzverfahren - Regressionsmodell - Konfidenzintervalle - Statistische Testverfahren - Parameterische Tests - Nichtparametrische Tests
ghget("hu.stat")
ghopen("Statistik.pdf")
ghopen("Aufgaben.pdf")
ghopen("Loesungen.pdf")
ghopen("Formelsammlung.pdf")
General - R - Basics and data generation - Test and estimation theory - Parameter of distributions - Distribution - Transformations - Robust statistics - Missing values - Subgroup analysis - Correlation and association - Multivariate graphics - Principal component analysis - Exploratory factor analysis - Reliability - Cluster analysis - Regression analysis - Linear regression - Nonparametric regression - Classification and regression trees - Neural networks
ghget("hu.data")
ghopen("dataanalysis.pdf")
Einführung - Entdeckung und Identifikation von Ausreißern - Prüfung der Verteilungsform von Variablen - Parametervergleiche bei unbhängigen Stichproben - Anhänge A-D, Literaturverzeichnis, Index
ghget("hu.data")
ghopen("cs1_roenz.pdf")
Vorwort - Überprüfung von Zusammenhängen - Regressionsanalyse - Reliabilitäts- und Homogenitätsanalyse von Konstrukten - Anhänge A-H, Literaturverzeichnis, Stichwortverzeichnis
ghget("hu.data")
ghopen("cs2_roenz.pdf")
Einführung - Verallgemeinerte lineare Modelle (generalized linear models, GLM) - Modellierung binärer Daten - Das multinomiale Logit Modell - Modellierung multinomialer Daten (log-lineare Modelle) - Literaturverzeichnis, Index
ghget("hu.data")
ghopen("glm_roenz.pdf")