THIS README FILE IS A WORK IN PROGRESS! I suggest checking the pdf file and the bibliographic references for a better understanding of the code.

The main archive is active.py which is a class that does all active learning work. This is a work in progress but for the purpose of the final project, it has all implementations ready to go. Since providing these files werent required I haven't added explanations. However, looking at the paper's diagram, the main file, and the active class should give a good idea of whats happening in the background.

Again there are implementations on this file that goes beyond the scope of the project. For instance, there is a function called clustering, which is a preliminary test on using data structuring, bagging and stochastic method to select new queries (please disregard, though it is working).

[1] class active(): def init(self,X=[],y=[],X_=[],y_=[],method='random',density_method='none',sparse=False):

X,y: Train data and labels
X_,y_: Test data and labels
method: type of active learning approach (see final_paper.pdf)
    eg: 'random', 'least_confident', 'margin' or 'entropy'

density_method: 'none', 'euclidean' or 'cosine' (cosine similarity)

sparse: True or False, indicate if the input data is sparse or not.

[2] relevant attributes:

self.clf = LogisticRegression(penalty='l2',C=1):
  it can take the form of any classifier as long as the methods fit and predict_proba are available.

self.sample_size=10
  Number of samples to select from the pool at each query

self.min_pool=20
  minimum number of datapoints in the pool (starting point)

self.set_pool()
  initialize the pool

self.update()
  update the pool

self.qbc_method
  criteria used by the committee: 'least_confident', 'margin' or 'entropy'

self.qbc_sampling
  percentage of the data sampled at each iteration (bootstraping)

self.n_splits=3
  number of commitees formed using the boostraped data.

[3] Some possible scenarios

  scenario_data = {'uncertainty':{0:['least_confident','none','none'],
  1:['margin','none','none'],
  2:['entropy','none','none'],
  3:['least_confident','euclidean','none'],
  4:['margin','euclidean','none'],
  5:['entropy','euclidean','none']}}
  6:{'boosting':{0:['boosting','euclidean','margin']},
  'bagging':{0:['bagging','euclidean','margin']}}

[4] Example: result = {} for j in np.arange(n): obj = active(X,y,X_,y_,method=case[0],density_method=case[1]) obj.qbc_sampling=0.4 obj.n_splits=1000 obj.min_pool = min_pool obj.sample_size = sample_size obj.set_pool() obj.set_density() obj.iter=iter_ obj.qbc_method = qbc_method obj.fit() result[j*sample_size]=obj.accuracy_bin

The main archive is active.py which is a class that does all active learning work. This is a work in progress but for the purpose of the final project, it has all implementations ready to go. Since providing these files werent required I haven't added explanations. However, looking at the paper's diagram, the main file, and the active class should give a good idea of whats happening in the background.

Again there are implementations on this file that goes beyond the scope of the project. For instance, there is a function called clustering, which is a preliminary test on using data structuring, bagging and stochastic method to select new queries (please disregard, though it is working).