NOTE: For the codebase that will be used in our upcoming Learning Analytics and Knowledge Pre-Conference Workshop, please see this repo: https://github.com/CAHLR/pyBKT-examples
Python implementation of the Bayesian Knowledge Tracing algorithm and variants, estimating student cognitive mastery from problem solving sequences.
pip install pyBKT
Quick-start example in Colab notebook
Based on the work of Zachary A. Pardos (zp@berkeley.edu) and Matthew J. Johnson (mattjj@csail.mit.edu) @ https://github.com/CAHLR/xBKT. Python boost adaptation by Cristian Garay (c.garay@berkeley.edu). All-platform python adaptation and optimizations by Anirudhan Badrinath (abadrinath@berkeley.edu). For formulas and technical implementation details, please refer to section 4.3 of Xu, Johnson, & Pardos (2015) ICML workshop paper.
pyBKT examples can be found in pyBKT-examples repo.
Python >= 3.5
Supported OS: All platforms! (Yes, Windows too)
Libboost >= 1.58 (optional - will enable fast inference if installed)
pyBKT can be used to define and fit many BKT variants, including these from the literature:
- Individual student priors, learn rate, guess, and slip [1,2]
- Individual item guess and slip [3,4,5]
- Individual item or resource learn rate [4,5]
-
Pardos, Z. A., Heffernan, N. T. (2010) Modeling Individualization in a Bayesian Networks Implementation of Knowledge Tracing. In P. De Bra, A. Kobsa, D. Chin (Eds.) Proceedings of the 18th International Conference on User Modeling, Adaptation and Personalization (UMAP). Big Island of Hawaii. Pages. Springer. Pages 255-266. [doi]
-
Pardos, Z. A., Heffernan, N. T. (2010) Using HMMs and bagged decision trees to leverage rich features of user and skill from an intelligent tutoring system dataset. In J. Stamper & A. Niculescu-Mizil (Eds.) Proceedings of the KDD Cup Workshop at the 16th ACM Conference on Knowledge Discovery and Data Mining (SIGKDD). Washington, D.C. ACM. Pages 24-35. [kdd cup]
-
Pardos, Z. & Heffernan, N. (2011) KT-IDEM: Introducing Item Difficulty to the Knowledge Tracing Model. In Konstant et al. (eds.) Proceedings of the 20th International Conference on User Modeling, Adaptation and Personalization (UMAP). Girona, Spain. Springer. Pages 243-254. [doi]
-
Pardos, Z. A., Bergner, Y., Seaton, D., Pritchard, D.E. (2013) Adapting Bayesian Knowledge Tracing to a Massive Open Online College Course in edX. In S.K. D’Mello, R.A. Calvo, & A. Olney (Eds.) Proceedings of the 6th International Conference on Educational Data Mining (EDM). Memphis, TN. Pages 137-144. [edm]
-
Xu, Y., Johnson, M. J., Pardos, Z. A. (2015) Scaling cognitive modeling to massive open environments. In Proceedings of the Workshop on Machine Learning for Education at the 32nd International Conference on Machine Learning (ICML). Lille, France. [icml ml4ed]
This is intended as a quick overview of steps to install and setup and to run pyBKT locally. While both a pure Python port and a Cython version of pyBKT are offered, the former does not fit models or scale as quickly or efficiently as the latter (due to nested for loops needed for DP). Here are a few speed comparisons - both on the same machine - that may be useful in deciding which version is more appropriate given the usage (e.g. model fitting is far more demanding than prediction):
Test Description | pyBKT (Python) | pyBKT (Cython) |
---|---|---|
synthetic data, model fit (500 students) | ~1m55s | ~1.5s |
synthetic data, model fit (5000 students) | ~1h30m | ~45s |
cross validated cognitive tutor data | ~4m10s | ~3s |
synthetic data, predict onestep (500 students) | ~2s | ~0.8s |
synthetic data, predict onestep (5000 students) | ~2m15s | ~35s |
If you have Boost already installed, pip will install pyBKT with fast C++ inferencing. Boost is already installed on most recent Ubuntu distributions. If it is not installed on your machine, type sudo apt install libboost-all-dev
if using Debian based distributions. Otherwise, whichever package manager is appropriately suited to your distribution (dnf
, pacman
, etc.). Without Boost, pip will install pyBKT without C++ speed optimizations.
You can check if libboost has been installed properly with ldconfig -p | grep libboost_python
, which should yield an output on Linux machines. Note that the version on the dynamic library should match the Python installation version.
Python 3.8 is necessary for OS X. If homebrew is installed, run the following commands to download the necessary dependencies:
brew install boost
brew install boost-python3
brew install libomp
You can simply run:
pip install pyBKT
Alternatively, if pip
poses some problems, you can clone the repository as such and then run the setup.py
script manually.
git clone https://github.com/CAHLR/pyBKT.git
cd pyBKT
python3 setup.py install
pyBKT models student mastery of a skills as they progress through series of learning resources and checks for understanding. Mastery is modelled as a latent variable has two states - "knowing" and "not knowing". At each checkpoint, students may be given a learning resource (i.e. watch a video) and/or question(s) to check for understanding. The model finds the probability of learning, forgetting, slipping and guessing that maximizes the likelihood of observed student responses to questions.
To run the pyBKT model, define the following variables:
num_subparts
: The number of unique questions used to check understanding. Each subpart has a unique set of emission probabilities.num_resources
: The number of unique learning resources available to students.num_fit_initialization
: The number of iterations in the EM step.
Next, create an input object Data
, containing the following attributes:
-
data
: a matrix containing sequential checkpoints for all students, with their responses. Each row represents a different subpart, and each column a checkpoint for a student. There are three potential values: {0 = no response or no question asked, 1 = wrong response, 2 = correct response}. If at a checkpoint, a resource was given but no question asked, the associated column would have0
values in all rows. For example, to set up data containing 5 subparts given to two students over 2-3 checkpoints, the matrix would look as follows:| 0 0 0 0 2 | | 0 1 0 0 0 | | 0 0 0 0 0 | | 0 0 0 0 0 | | 0 0 2 0 0 |
In the above example, the first student starts out with just a learning resource, and no checks for understanding. In subsequent checkpoints, this student also responds to subpart 2 and 5, and gets the first wrong and the second correct.
-
starts
: defines each student's starting column on thedata
matrix. For the above matrix,starts
would be defined as:| 1 4 |
-
lengths
: defines the number of check point for each student. For the above matrix,lengths
would be defined as:| 3 2 |
-
resources
: defines the sequential id of the resources at each checkpoint. Each position in the vector corresponds to the column in thedata
matrix. For the above matrix, the learningresources
at each checkpoint would be structured as:| 1 2 1 1 3 |
-
stateseqs
: this attribute is the true knowledge state for above data and should be left undefined before running thepyBKT
model.
The output of the model can will be stored in a fitmodel
object, containing the following probabilities as attributes:
As
: the transition probability between the "knowing" and "not knowing" state. Includes both thelearns
andforgets
probabilities, and their inverse.As
creates a separate transition probability for each resource.learns
: the probability of transitioning to the "knowing" state given "not known".forgets
: the probability of transitioning to the "not knowing" state given "known".prior
: the prior probability of "knowing".
The fitmodel
also includes the following emission probabilities:
guesses
: the probability of guessing correctly, given "not knowing" state.slips
: the probability of picking incorrect answer, given "knowing" state.
A basic BKT parameter fitting example using synthesized data is can found in this (Google Colab link) notebook, and in the code below:
import sys
import numpy as np
from pyBKT.generate import synthetic_data, random_model_uni
from pyBKT.fit import EM_fit
from copy import deepcopy
#parameters classes
num_gs = 1 #number of guess/slip classes
num_learns = 1 #number of learning rates
num_fit_initializations = 20
#true params used for synthetic data generation
p_T = 0.30
p_F = 0.00
p_G = 0.10
p_S = 0.03
p_L0 = 0.10
#generate synthetic model and data.
truemodel = {}
truemodel["As"] = np.zeros((num_learns,2,2), dtype=np.float_)
for i in range(num_learns):
truemodel["As"][i] = np.transpose([[1-p_T, p_T], [p_F, 1-p_F]])
truemodel["learns"] = truemodel["As"][:,1, 0,]
truemodel["forgets"] = truemodel["As"][:,0, 1]
truemodel["pi_0"] = np.array([[1-p_L0], [p_L0]])
truemodel["prior"] = truemodel["pi_0"][1][0]
truemodel["guesses"] = np.full(num_gs, p_G, dtype=np.float_)
truemodel["slips"] = np.full(num_gs, p_S, dtype=np.float_)
#can optionally set learn class sequence - set randomly by synthetic_data if not included
#truemodel["resources"] = np.random.randint(1, high = num_resources, size = sum(observation_sequence_lengths))
#data!
print("generating data...")
observation_sequence_lengths = np.full(500, 100, dtype=np.int) #specifies 500 students with 100 observations for synthetic data
data = synthetic_data.synthetic_data(truemodel, observation_sequence_lengths)
#fit models, starting with random initializations
print('fitting! each dot is a new EM initialization')
num_fit_initializations = 5
best_likelihood = float("-inf")
for i in range(num_fit_initializations):
fitmodel = random_model_uni.random_model_uni(num_learns, num_gs) # include this line to randomly set initial param values
(fitmodel, log_likelihoods) = EM_fit.EM_fit(fitmodel, data)
if(log_likelihoods[-1] > best_likelihood):
best_likelihood = log_likelihoods[-1]
best_model = fitmodel
# compare the fit model to the true model
print('')
print('\ttruth\tlearned')
print('prior\t%.4f\t%.4f' % (truemodel['prior'], best_model["pi_0"][1][0]))
for r in range(num_learns):
print('learn%d\t%.4f\t%.4f' % (r+1, truemodel['As'][r, 1, 0].squeeze(), best_model['As'][r, 1, 0].squeeze()))
for r in range(num_learns):
print('forget%d\t%.4f\t%.4f' % (r+1, truemodel['As'][r, 0, 1].squeeze(), best_model['As'][r, 0, 1].squeeze()))
for s in range(num_gs):
print('guess%d\t%.4f\t%.4f' % (s+1, truemodel['guesses'][s], best_model['guesses'][s]))
for s in range(num_gs):
print('slip%d\t%.4f\t%.4f' % (s+1, truemodel['slips'][s], best_model['slips'][s]))
- Support for parameter tieing and fixing
- Boostless Cython