✈ Pilota: SCUD generator

Name	Input Utterance	Output SCUD
Agent	今回の旅行はどういったご旅行でしょうか?	-
User	家族で一泊して、USJに行こうと思ってます。	今回の旅行は家族で一泊して、USJに行く。
Agent	なるほど、ホテルはもうお決まりですか?	-
User	まだです。	ホテルはまだ決まっていない。
	ただ、近くが良いなとは思ってて。	ホテルはUSJの近くが良い。
	景色が良くて食事も美味しいところが良いです	景色が良いホテルが良い。食事が美味しいホテルが良い。

Quick start

Install

pip install -U 'pilota[ja-line] @ git+https://github.com/megagonlabs/pilota'

If you need compatible torch for your GPU, please install the specific package like the following step. Please read https://pytorch.org/.

pip install -U torch --extra-index-url https://download.pytorch.org/whl/cu118

Run

Prepare inputs (Input Format and plain2request)

Command

echo -e 'ご要望をお知らせください\tはい。部屋から富士山が見えて、夜景を見ながら食事のできるホテルがいいな。\nこんにちは\tこんにちは' | python -m pilota.convert.plain2request | tee input.jsonl

Output

{"context": [{"name": "agent", "text": "ご要望をお知らせください"}], "utterance": "はい。部屋から富士山が見えて、夜景を見ながら食事のできるホテルがいいな。", "sentences": null, "meta": {}}
{"context": [{"name": "agent", "text": "こんにちは"}], "utterance": "こんにちは", "sentences": null, "meta": {}}

Feed it to Pilota

Command

pilota -m megagonlabs/pilota_dialog --batch_size 1 --outlen 60 --nbest 1 --beam 5 < input.jsonl

Output

[{"scuds_nbest": [[]], "original_ranks": [0], "scores": [0.9911208689212798], "scores_detail": [{"OK": 0.9704028964042664, "incorrect_none": 0.04205145686864853, "lack": 0.0007874675211496651, "limited": 0.0003119863977190107, "non_fluent": 0.0002362923405598849, "untruth": 0.0013080810895189643}], "sentence": "はい。"}, {"scuds_nbest": [["部屋から富士山が見えるホテルが良い。", "夜景を見ながら食事のできるホテルが良い。"]], "original_ranks": [0], "scores": [0.9952289938926696], "scores_detail": [{"OK": 0.9840966463088989, "incorrect_none": 0.010280555114150047, "lack": 0.0032871251460164785, "limited": 0.00041511686868034303, "non_fluent": 0.0002954243100248277, "untruth": 0.003289491171017289}], "sentence": "部屋から富士山が見えて、夜景を見ながら食事のできるホテルがいいな。"}]
[{"scuds_nbest": [[]], "original_ranks": [0], "scores": [0.9831213414669036], "scores_detail": [{"OK": 0.9704028964042664, "incorrect_none": 0.04205145686864853, "lack": 0.0007874675211496651, "limited": 0.0003119863977190107, "non_fluent": 0.0002362923405598849, "untruth": 0.0013080810895189643}], "sentence": "こんにちは"}]

-m option also accepts paths of local models.

pilota -m /path/to/model --batch_size 1 --ol 60 < input.jsonl

Check other options by pilota -h.

Models

Models are available on https://huggingface.co/megagonlabs/.

Model	Input Context	Input Utterance	Output
megagonlabs/pilota_dialog	Dialog between a user looking for an accommodation and an agent	User's last utterance	SCUDs
megagonlabs/pilota_scud2query	(Not required)	Users' SCUDs	Queries for accommodation search
megagonlabs/pilota_hotel_review	(Not required)	Text of an accommodation review	SCUDs

Once downloaded, the model will not be downloaded again. If you cancel the download of a model halfway through the first start-up, or if you need to update it to the latest version, please run with --check_model_update.

You can check local path of downloaded models.

huggingface-cli scan-cache | grep ^megagonlabs

Documents

References

Yuta Hayashibe. Self-Contained Utterance Description Corpus for Japanese Dialog. Proc of LREC, pp.1249-1255. (LREC 2022) [PDF]
林部祐太．要約付き宿検索対話コーパス．言語処理学会第27回年次大会論文集，pp.340-344. 2021. (NLP 2021) [PDF]
林部祐太．発話とレビューに対する解釈文生成とトピック分類．言語処理学会第29回年次大会論文集，pp.2013-2017. 2023. (NLP 2023) [PDF]

License

Apache License 2.0