Skip to content

Commit 1e9b421

Browse files
Update some examples to use new arviz (#822)
* update to use new arviz * add more examples * update BEST example * update per comments * update BF example * update model averaging * update model averaging * update and modify example * update data container example * update data container example * update more examples * remove unused code * rerun examples * rerun examples * update rugby example * update itr nba example * allow links from EABM online book * add more examples --------- Co-authored-by: Oriol Abril-Pla <oriol.abril.pla@gmail.com>
1 parent f637b9f commit 1e9b421

36 files changed

+7146
-10615
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ repos:
8484
examples/survival_analysis/weibull_aft.ipynb|
8585
examples/howto/custom_distribution.ipynb)
8686
entry: >
87-
(?x)(arviz-devs.github.io|
87+
(?x)(arviz-devs.github.io/arviz|
8888
pymc-experimental.readthedocs.io|
8989
docs.pymc.io|
9090
numpy.org/doc|

examples/bart/bart_categorical_hawks.ipynb

Lines changed: 652 additions & 1733 deletions
Large diffs are not rendered by default.

examples/bart/bart_categorical_hawks.myst.md

Lines changed: 19 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ jupytext:
55
format_name: myst
66
format_version: 0.13
77
kernelspec:
8-
display_name: pymc-examples
8+
display_name: pymc
99
language: python
10-
name: pymc-examples
10+
name: python3
1111
myst:
1212
substitutions:
1313
conda_dependencies: pymc-bart
@@ -35,7 +35,7 @@ In this example, we will model outcomes with more than two categories.
3535
import os
3636
import warnings
3737
38-
import arviz as az
38+
import arviz.preview as az
3939
import matplotlib.pyplot as plt
4040
import numpy as np
4141
import pandas as pd
@@ -51,7 +51,7 @@ warnings.simplefilter(action="ignore", category=FutureWarning)
5151
```{code-cell} ipython3
5252
# set formats
5353
RANDOM_SEED = 8457
54-
az.style.use("arviz-darkgrid")
54+
az.style.use("arviz-variat")
5555
```
5656

5757
## Hawks dataset
@@ -134,7 +134,11 @@ with model_hawks:
134134

135135
### Variable Importance
136136

137-
It may be that some of the input variables are not informative for classifying by species, so in the interest of parsimony and in reducing the computational cost of model estimation, it is useful to quantify the importance of each variable in the dataset. PyMC-BART provides the function {func}`~pymc_bart.plot_variable_importance()`, which generates a plot that shows on his x-axis the number of covariables and on the y-axis the R$^2$ (the square of the Pearson correlation coefficient) between the predictions made for the full model (all variables included) and the restricted models, those with only a subset of the variables. The error bars represent the 94 % HDI from the posterior predictive distribution.
137+
It may be that some of the input variables are not informative for classifying by species, so in the interest of parsimony and in reducing the computational cost of model estimation, it is useful to quantify the importance of each variable in the dataset. PyMC-BART provides the function {func}`~pymc_bart.compute_variable_importance()` and {func}`~pymc_bart.plot_variable_importance()`, that work together to generate a plot that shows on his x-axis the number of covariables and on the y-axis the R$^2$ between the predictions made for the full model (all variables included) and the restricted models, those with only a subset of the variables. The error bars represent the 94 % HDI from the posterior predictive distribution.
138+
139+
```{code-cell} ipython3
140+
idata.posterior_predictive
141+
```
138142

139143
```{code-cell} ipython3
140144
---
@@ -190,37 +194,19 @@ Hawks.groupby("Species").count()
190194

191195
### Predicted vs Observed
192196

193-
Now we are going to compare the predicted data with the observed data to evaluate the fit of the model, we do this with the Arviz function `az.plot_ppc()`.
197+
We are going to evaluate the model by comparing the predicted data against the observed data. This can be tricky to do with categorical data (and binary or ordinal data as well). For this reason we use PAV-adjusted calibration plots as described by {cite:t}`dimitriadis2021` and in a Bayesian context by {cite:t}`säilynoja2025`.
194198

195-
```{code-cell} ipython3
196-
ax = az.plot_ppc(idata, kind="kde", num_pp_samples=200, random_seed=123)
197-
# plot aesthetics
198-
ax.set_ylim(0, 0.7)
199-
ax.set_yticks([0, 0.2, 0.4, 0.6])
200-
ax.set_ylabel("Probability")
201-
ax.set_xticks([0.5, 1.5, 2.5])
202-
ax.set_xticklabels(["CH", "RT", "SS"])
203-
ax.set_xlabel("Species");
204-
```
199+
In these plots the observed events are replaced with conditional event probabilities (CEP), which is the probability that a certain event occurs given that the classifier has assigned a specific predicted probability.
205200

206-
We can see a good agreement between the observed data (black line) and those predicted by the model (blue and orange lines). As we mentioned before, the difference in the values between species is influenced by the amount of data for each one. Here there is no observed dispersion in the predicted data as we saw in the previous two plots.
207-
208-
+++
209-
210-
Below we see that the in-sample predictions provide very good agreement with the observations.
201+
We can esily generate this type of plots with the ArviZ's function `az.plot_ppc_pava()`, as this function also works for binary and ordinal data we need to specify `data_type="categorical"`. You can read more about these plots [here](https://arviz-devs.github.io/EABM/Chapters/Prior_posterior_predictive_checks.html#posterior-predictive-checks-for-discrete-data).
211202

212203
```{code-cell} ipython3
213-
np.mean((idata.posterior_predictive["y"] - y_0) == 0) * 100
204+
az.plot_ppc_pava(idata, data_type="categorical");
214205
```
215206

216-
```{code-cell} ipython3
217-
all = 0
218-
for i in range(3):
219-
perct_per_class = np.mean(idata.posterior_predictive["y"].where(y_0 == i) == i) * 100
220-
all += perct_per_class
221-
print(perct_per_class)
222-
all
223-
```
207+
Each subplot is a category vs the others. The ideal calibration plot is a diagonal line, represented by the gray dashed line, where the predicted probabilities are equal to the observed frequencies. If the line is above the diagonal, the model is underestimating the probabilities, and if the line is below the diagonal, the model is overestimating the probabilities.
208+
209+
+++
224210

225211
So far we have a very good result concerning the classification of the species based on the 5 covariables. However, if we want to select a subset of covariable to perform future classifications is not very clear which of them to select. Maybe something sure is that `Tail` could be eliminated. At the beginning when we plot the distribution of each covariable we said that the most important variables to make the classification could be `Wing`, `Weight` and, `Culmen`, nevertheless after running the model we saw that `Hallux`, `Culmen` and, `Wing`, proved to be the most important ones.
226212

@@ -270,35 +256,17 @@ With all these together, we can select `Hallux`, `Culmen`, and, `Wing` as covari
270256

271257
+++
272258

273-
Concerning the comparison between observed and predicted data, we obtain the same good result with less uncertainty for the predicted values (blue lines). And the same counts for the in-sample comparison.
274-
275-
```{code-cell} ipython3
276-
ax = az.plot_ppc(idata_t, kind="kde", num_pp_samples=100, random_seed=123)
277-
ax.set_ylim(0, 0.7)
278-
ax.set_yticks([0, 0.2, 0.4, 0.6])
279-
ax.set_ylabel("Probability")
280-
ax.set_xticks([0.5, 1.5, 2.5])
281-
ax.set_xticklabels(["CH", "RT", "SS"])
282-
ax.set_xlabel("Species");
283-
```
284-
285-
```{code-cell} ipython3
286-
np.mean((idata_t.posterior_predictive["y"] - y_0) == 0) * 100
287-
```
259+
Concerning the comparison between observed and predicted data, we can see that the model shows better calibration, as the lines are closer to the diagonal and the bands are in general less wide.
288260

289261
```{code-cell} ipython3
290-
all = 0
291-
for i in range(3):
292-
perct_per_class = np.mean(idata_t.posterior_predictive["y"].where(y_0 == i) == i) * 100
293-
all += perct_per_class
294-
print(perct_per_class)
295-
all
262+
az.plot_ppc_pava(idata_t, data_type="categorical");
296263
```
297264

298265
## Authors
299266
- Authored by [Pablo Garay](https://github.com/PabloGGaray) and [Osvaldo Martin](https://aloctavodia.github.io/) in May, 2024
300267
- Updated by Osvaldo Martin in Dec, 2024
301268
- Expanded by [Alex Andorra](https://github.com/AlexAndorra) in Feb, 2025
269+
- Updated by Osvaldo Martin in Dec, 2025
302270

303271
+++
304272

examples/bart/bart_heteroscedasticity.ipynb

Lines changed: 1458 additions & 118 deletions
Large diffs are not rendered by default.

examples/bart/bart_heteroscedasticity.myst.md

Lines changed: 18 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ jupytext:
66
format_name: myst
77
format_version: 0.13
88
kernelspec:
9-
display_name: Python 3 (ipykernel)
9+
display_name: pymc
1010
language: python
1111
name: python3
1212
---
@@ -27,7 +27,7 @@ In this notebook we show how to use BART to model heteroscedasticity as describe
2727
```{code-cell} ipython3
2828
import os
2929
30-
import arviz as az
30+
import arviz.preview as az
3131
import matplotlib.pyplot as plt
3232
import numpy as np
3333
import pandas as pd
@@ -37,7 +37,7 @@ import pymc_bart as pmb
3737

3838
```{code-cell} ipython3
3939
%config InlineBackend.figure_format = "retina"
40-
az.style.use("arviz-darkgrid")
40+
az.style.use("arviz-variat")
4141
plt.rcParams["figure.figsize"] = [10, 6]
4242
rng = np.random.default_rng(42)
4343
```
@@ -103,40 +103,25 @@ with model_marketing_full:
103103
We can now visualize the posterior predictive distribution of the mean and the likelihood.
104104

105105
```{code-cell} ipython3
106-
posterior_mean = idata_marketing_full.posterior["w"].mean(dim=("chain", "draw"))[0]
107-
108-
w_hdi = az.hdi(ary=idata_marketing_full, group="posterior", var_names=["w"], hdi_prob=0.5)
109-
110-
pps = az.extract(
111-
posterior_predictive_marketing_full, group="posterior_predictive", var_names=["y"]
112-
).T
106+
posterior_predictive_marketing_full
113107
```
114108

115109
```{code-cell} ipython3
116-
idx = np.argsort(X[:, 0])
117-
118-
119-
fig, ax = plt.subplots()
120-
az.plot_hdi(
121-
x=X[:, 0],
122-
y=pps,
123-
ax=ax,
124-
hdi_prob=0.90,
125-
fill_kwargs={"alpha": 0.3, "label": r"Observations $90\%$ HDI"},
126-
)
127-
az.plot_hdi(
128-
x=X[:, 0],
129-
hdi_data=np.exp(w_hdi["w"].sel(w_dim_0=0)),
130-
ax=ax,
131-
fill_kwargs={"alpha": 0.6, "label": r"Mean $50\%$ HDI"},
110+
dt_plot = az.from_dict(
111+
{
112+
"posterior_predictive": {
113+
"y": posterior_predictive_marketing_full.posterior_predictive["y"]
114+
},
115+
"observed_data": {"y": df["sales"].values},
116+
"constant_data": {"budget": X[:, 0]},
117+
},
118+
dims={
119+
"y": ["budget_dim"],
120+
"budget": ["budget_dim"],
121+
},
132122
)
133-
ax.plot(df["youtube"], df["sales"], "o", c="C0", label="Raw Data")
134-
ax.legend(loc="upper left")
135-
ax.set(
136-
title="Sales as a function of Youtube budget - Posterior Predictive",
137-
xlabel="budget",
138-
ylabel="sales",
139-
);
123+
124+
az.plot_lm(dt_plot, x="budget", y="y");
140125
```
141126

142127
The fit looks good! In fact, we see that the mean and variance increase as a function of the budget.

examples/bart/bart_introduction.ipynb

Lines changed: 125 additions & 502 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)