common-ml provides Python library for Machine Learning.
Please file an issue.
$ pip install common-ml
CustomDictVectorizer transforms nested dictionary, such as JSON, to vectors. Unlike DictVectorizer, given properties are transformed with specified vectorizer or own function and then they are combined.
from commonml.text import CustomDictVectorizer
vect = CustomDictVectorizer(vect_rules=[
{'name': 'title',
'vectorizer': CountVectorizer(tokenizer=analyzer,
max_df=0.8,
min_df=10,
dtype=np.float32)},
{'name': 'description',
'vectorizer': CountVectorizer(tokenizer=analyzer,
max_df=0.8,
min_df=10,
dtype=np.float32)}
])
X = vect.fit_transform([
{'title':'Test 1','description':'Aaa'},
{'title':'Test 2','description':'Bbb'}
])
See notebook/text/custom_dict_vectorizer.ipynb.
ElasticsearchAnalyzer and ElasticsearchTextAnalyzer analyze texts with Elasticsearch analysis feature. Therefore, in Python, you can use text analyzer you want.
First of all, you need to setup elasticsearch,
$ wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/zip/elasticsearch/2.3.1/elasticsearch-2.3.1.zip
$ unzip elasticsearch-2.3.1.zip
$ cd elasticsearch-2.3.1
$ echo 'cluster.name: es-ml' >> config/elasticsearch.yml
$ echo 'network.host: "0"' >> config/elasticsearch.yml
$ ./bin/plugin install org.codelibs/elasticsearch-analyze-api/2.3.0
install analysis plugins you need,
$ ./bin/plugin install analysis-kuromoji
$ ./bin/plugin install analysis-icu
$ ./bin/plugin install org.codelibs/elasticsearch-analysis-synonym/2.3.0 -b
$ ./bin/plugin install org.codelibs/elasticsearch-analysis-ja/2.3.0 -b
$ ./bin/plugin install org.codelibs/elasticsearch-analysis-kuromoji-neologd/2.3.0 -b
and then start elasticsearch.
$ ./bin/elasticsearch &
To analyze texts, create elasticsearch's index with analyzers.
$ curl -XPUT localhost:9200/.analyzer -d '
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"kuromoji_neologd_tokenizer": {
"discard_punctuation": "false",
"type": "kuromoji_neologd_tokenizer",
"mode": "normal"
}
},
"analyzer": {
"kuromoji_neologd_analyzer": {
"tokenizer": "kuromoji_neologd_tokenizer",
"type": "custom"
}
}
},
"number_of_replicas": "0",
"number_of_shards": "10",
"refresh_interval": "60s"
}
}
}'
To check _analyze_api request, send the following request:
$ curl -XPOST "localhost:9200/.analyzer/_analyze_api?pretty&analyzer=kuromoji_neologd_analyzer&part_of_speech=true" -d'
{
"data":{
"text":"今日の天気は晴れです。"
}
}'
If the above request is succeeded, you can analyze texts with ElasticsearchTextAnalyzer in Python.
from commonml import es
analyzer_url = 'es://localhost:9200/.analyzer/kuromoji_neologd_analyzer'
es_analyzer = es.build_analyzer(analyzer_url)
for term in es_analyzer('今日の天気は晴れです。'):
print(term)
See notebook/elasticsearch/analyzer.ipynb.
ElasticsearchReader processes elasticsearch query and returns a list of dictionaries(JSON).
from commonml import es
list = es.reader(hosts=['localhost:9200'],
index='test_index',
source={"query":{"match_all":{}}})
# list is a list of dict(JSON) for document
See notebook/elasticsearch/reader.ipynb.
ChainerEstimator provides fit/predict interface of scikit-learn, and will make your code simple. For example, MNIST sample is abled to be replaced as below.
from commonml.sklearn import ChainerEstimator, SoftmaxCrossEntropyClassifier
...
model = net.MnistMLP(784, n_units, 10)
if gpu >= 0:
cuda.get_device(gpu).use()
model.to_gpu()
xp = np if gpu < 0 else cuda.cupy
clf = ChainerEstimator(model=SoftmaxCrossEntropyClassifier(model),
optimizer=optimizers.Adam(),
batch_size=batchsize,
gpu=gpu,
n_epoch=n_epoch)
clf.fit(x_train, y_train)
preds = clf.predict(x_test).argmax(axis=1) # [7, 2, 1, ..., 4, 5, 6]
See notebook/sklearn.
AutoEncoder is implemented as Vectorizer of scikit-learn. Therefore, you can put them into Pipeline.
from sklearn.pipeline import Pipeline
from commonml.sklearn import ChainerEstimator, MeanSquaredErrorRegressor, AutoEncoder
# 784 -> 1000 -> 1000 -> 1000 -> 1000 -> 10
clf = Pipeline([
('ae1', AutoEncoder(784, 1000, MeanSquaredErrorRegressor, dropout_ratio=0,
batch_size=batchsize, n_epoch=n_epoch, gpu=gpu)),
('ae2', AutoEncoder(1000, 1000, MeanSquaredErrorRegressor, dropout_ratio=0,
batch_size=batchsize, n_epoch=n_epoch, gpu=gpu)),
('nn', ChainerEstimator(model=SoftmaxCrossEntropyClassifier(net.MnistMLP(1000, n_units, 10)),
optimizer=optimizers.Adam(),
batch_size=batchsize,
gpu=gpu,
n_epoch=n_epoch))
])
clf.fit(x_train, y_train)