custom data format and how to do next step?

Question

custom data format and how to do next step?

lpfy opened this issue 3 years ago · 25 comments

For example, If I prepare my data as below as a txt file e.g. mydata.txt:

lovely peaceful $T$. went on school excursion with kids and had lots of fun despite rain. staff is great, very knowledgeable and friendly.
place
Positive
lovely peaceful place. went on school excursion with $T$ and had lots of fun despite rain. staff is great, very knowledgeable and friendly.
kids
Positive
lovely peaceful place. went on school excursion with kids and had lots of fun despite rain. $T$ is great, very knowledgeable and friendly.
Staff
Positive
Very disappointing. Extremely costly given that we were on a time restriction due to covid but at the same $T$ as pre restrictions.
admission price
Negative
Also there was no $T$ due to covid restrictions which is understandable but then I think the admission price should be reduced.
Koala holding
Negative
Also there was no Koala holding due to covid restrictions which is understandable but then I think the $T$ should be reduced.
admission price
Negative
On a positive note, there is ample $T$ and the staff at front desk were very friendly.
parking
Positive
On a positive note, there is ample parking and the $T$ at front desk were very friendly.
staff
Positive

Based on readme, does it means I should run
from pyabsa.functional import ABSADatasetList
from pyabsa.utils.file_utils import generate_inference_set_for_apc
generate_inference_set_for_apc(dataset_path=c:\mydata.txt)

If I send my data file to you, are you able to help us train it? this is all public reviews crawled from Google or TripAdvisor for a wildlife park in Adelaide. Just wondering for the best results, how many reviews we should prepared roughly? 100? 200? At moment we crawled around 5000 reviews

Answer 1 · 2021-10-22T05:37:12.000Z

Maybe I can help you provided that you have both train and test set. If it is possible, I would request you to share your dataset in PyABSA.

Answer 2 · 2021-10-22T05:39:10.000Z

For example, If I prepare my data as below as a txt file e.g. mydata.txt:

lovely peaceful $T$. went on school excursion with kids and had lots of fun despite rain. staff is great, very knowledgeable and friendly. place Positive lovely peaceful place. went on school excursion with $T$ and had lots of fun despite rain. staff is great, very knowledgeable and friendly. kids Positive lovely peaceful place. went on school excursion with kids and had lots of fun despite rain. $T$ is great, very knowledgeable and friendly. Staff Positive Very disappointing. Extremely costly given that we were on a time restriction due to covid but at the same $T$ as pre restrictions. admission price Negative Also there was no $T$ due to covid restrictions which is understandable but then I think the admission price should be reduced. Koala holding Negative Also there was no Koala holding due to covid restrictions which is understandable but then I think the $T$ should be reduced. admission price Negative On a positive note, there is ample $T$ and the staff at front desk were very friendly. parking Positive On a positive note, there is ample parking and the $T$ at front desk were very friendly. staff Positive

Based on readme, does it means I should run from pyabsa.functional import ABSADatasetList from pyabsa.utils.file_utils import generate_inference_set_for_apc generate_inference_set_for_apc(dataset_path=c:\mydata.txt)

If I send my data file to you, are you able to help us train it? this is all public reviews crawled from Google or TripAdvisor for a wildlife park in Adelaide. Just wondering for the best results, how many reviews we should prepared roughly? 100? 200? At moment we crawled around 5000 reviews

The inference dataset is for sentiment inference, however the training process provides the accuracy and f1 score on test dataset.

Answer 3 · 2021-10-22T05:42:37.000Z

如果您有邮箱，我也可以发邮件给您。我本来想发邮件的，但是没搞清楚是发给哪个邮箱，好像一个是Uni exeter的邮箱，一个是华南师范大学的. Sorry I don't have Chinese input on my office PC, I would think Dataset is ok to share, as it is public data, I mean nobody owns it. I don't think Google or TripAdvisor have policy banning to use their data.

Answer 4 · 2021-10-22T05:43:58.000Z

Any mail address is acceptable

Answer 5 · 2021-10-22T05:49:14.000Z

Great, thanks for your help. We will prepare the data file (both train and test) and email to you in next couple of days.

Best regards,
Sasha

Answer 6 · 2021-10-27T02:12:49.000Z

A quick question, do we need to do Lemmatization when we prepare training set? or we can just keep the original format? e.g. change had/has -> have, animals->animal

Answer 7 · 2021-10-27T02:17:07.000Z

No need for this, please keep the original format

Answer 8 · 2021-11-08T05:12:14.000Z

@yangheng95 您好，我也按照您给出的格式修改好了自己的数据，修改好的数据如下：
Went here on Sunday for first time. Plenty of parking which was nice. The jalapeno cornbread was very good! Mac n chz also good. Ordered the moist brisket because I thought it would be better since the lean is generally dry at times..the brisket was super moist, fatty, and very greasy..had alot of oil on the plate..most of what I got was fat...next time ill try the lean..the sausage was so so..im still a bigger fan of pokejoes.. There was a guy who looked homeless playing guitar and singing very loud inside..went to sit outside in the heat bc this guy was so loud and annoying..he seemed kinda drunk too..not sure why they let him do this bc ruined the $T$ for us.. I'll give it another shot but it is pricey and wasn't that great
atmosphere
Negative
Went here on Sunday for first time. Plenty of parking which was nice. The jalapeno cornbread was very good! Mac n chz also good. Ordered the moist brisket because I thought it would be better since the lean is generally dry at times..the brisket was super moist, fatty, and very greasy..had alot of oil on the plate..most of what I got was fat...next time ill try the lean..the sausage was so so..im still a bigger fan of pokejoes.. There was a guy who looked homeless playing guitar and singing very loud inside..went to sit outside in the heat bc this guy was so loud and annoying..he seemed kinda drunk too..not sure why they let him do this bc ruined the atmosphere for us.. I'll give it another shot but it is $T$ and wasn't that great
pricey
Negative
Still an exceptional restaurant with several new menu items including a gimmicky (and tasty) sorta-Asian fried chicken and waffles. I'm a spicy food fan with a fire resistant pallete, but the hot sauce with the otherwise-excellent chicken wings was WAY over applied and actually ruined my whole dinner. $T$ has gotten slow, forgetful and a little too cool for school (second time we've noticed this...).
Service
Negative
Still an exceptional restaurant with several new $T$ items including a gimmicky (and tasty) sorta-Asian fried chicken and waffles. I'm a spicy food fan with a fire resistant pallete, but the hot sauce with the otherwise-excellent chicken wings was WAY over applied and actually ruined my whole dinner. Service has gotten slow, forgetful and a little too cool for school (second time we've noticed this...).
menu
Positive

并且也按照您的提示，进行了转换inference sets的操作：
‘’‘
from pyabsa.functional import ABSADatasetList

from pyabsa.utils.file_utils import generate_inference_set_for_apc

generate_inference_set_for_apc(dataset_path="C:/Users/Li Wei/integrated_datasets/apc_datasets/SemEval/yelp restaurant")
’‘’
但进行转换后，该文件夹内并未生成一个inference dataset。请问是我缺少了哪一步骤吗？

Answer 9 · 2021-11-08T06:51:56.000Z

请问你的数据集名字中是否包含train或者test字样？

Answer 10 · 2021-11-08T07:40:24.000Z

请问你的数据集名字中是否包含train或者test字样？

包含的。这是我的数据集样式。

Answer 11 · 2021-11-08T08:07:07.000Z

请把控制台输出给我看一下，谢谢！

Answer 12 · 2021-11-08T08:34:39.000Z

请把控制台输出给我看一下，谢谢！

这是输出的一些结果

Answer 13 · 2021-11-08T08:45:59.000Z

请问你的数据集是否放在os.getcwd()目录下？

Answer 14 · 2021-11-08T08:53:04.000Z

请安装1.2.13a0测试版本再尝试：

pip install pyabsa==1.2.13a0

Answer 15 · 2021-11-08T08:54:01.000Z

请安装1.2.13a0测试版本再尝试：
pip install pyabsa==1.2.13a0

好的，您提到的两个问题我都check一下，再给您反馈。谢谢您的回复~

Answer 16 · 2021-11-08T09:52:18.000Z

请安装1.2.13a0测试版本再尝试：
pip install pyabsa==1.2.13a0

安装新版本之后，问题顺利解决了。真心感谢您的帮助！
还有一个小问题，我看到您新上传了DPT，我也想上传自己的csv文件来转换格式。但是点击网页上的”load“却没有弹出数据上传窗口。请问这个功能是还在开发中吗？

Answer 17 · 2021-11-08T10:00:09.000Z

这个功能由另一位contributor开发，你可以尝试克隆到本地再用chrome打开

Answer 18 · 2021-11-08T10:06:08.000Z

这个功能由另一位contributor开发，你可以尝试克隆到本地再用chrome打开
好的，感谢回复！

Answer 19 · 2021-11-08T14:23:38.000Z

这个功能由另一位contributor开发，你可以尝试克隆到本地再用chrome打开
好的，感谢回复！

Chrome和Edge浏览器肯定都可以用DPT，下载到本地后直接双击那个HTML文件就好。不支持IE，因为用了一些JS ES6的语法。暂时不支持load saved work. The save button do save the json file for unfinished work, 但是我没写代码assign json back to Vue data。目前我是console里手动赋值 ABSADPT.PreABSAData = JSON.parse(jsonstring)

所以尽量每次少弄点数据，我是每次50条，做完再保存 :P

Answer 20 · 2021-11-09T03:00:21.000Z

这个功能由另一位contributor开发，你可以尝试克隆到本地再用chrome打开
好的，感谢回复！

Chrome和Edge浏览器肯定都可以用DPT，下载到本地后直接双击那个HTML文件就好。不支持IE，因为用了一些JS ES6的语法。暂时不支持load saved work. The save button do save the json file for unfinished work, 但是我没写代码assign json back to Vue data。目前我是console里手动赋值 ABSADPT.PreABSAData = JSON.parse(jsonstring)

所以尽量每次少弄点数据，我是每次50条，做完再保存 :P

好的，正在标记数据的时候看到了您的建议，太及时了，十分感谢！

Answer 21 · 2021-11-10T04:55:56.000Z

请安装1.2.13a0测试版本再尝试：
pip install pyabsa==1.2.13a0

@yangheng95 您好！尝试了您上次给过的解决方案后，还出现了一个问题。就是aspect_extractor.extract_aspect时而正常出结果，时而又报错“RuntimeError: ['C:/Users/Li Wei/integrated_datasets/apc_datasets/SemEval/yelp restaurant'] is not an integrated dataset or not downloaded automatically, or it is not a path containing datasets!”

请问这有可能是我哪里操作不对呢？

Answer 22 · 2021-11-10T06:34:52.000Z

请安装1.2.13a0测试版本再尝试：
pip install pyabsa==1.2.13a0
@yangheng95 您好！尝试了您上次给过的解决方案后，还出现了一个问题。就是aspect_extractor.extract_aspect时而正常出结果，时而又报错“RuntimeError: ['C:/Users/Li Wei/integrated_datasets/apc_datasets/SemEval/yelp restaurant'] is not an integrated dataset or not downloaded automatically, or it is not a path containing datasets!”

请问这有可能是我哪里操作不对呢？

这是一些traceback和我的文件夹样式

Answer 23 · 2021-11-10T08:29:24.000Z

@yangheng95 修改文件夹名称后问题解决了，原来文件夹名称中最好不要带有“inference”字样。

Answer 24 · 2021-11-10T14:07:33.000Z

@yangheng95 修改文件夹名称后问题解决了，原来文件夹名称中最好不要带有“inference”字样。

只是一个思路，可能不对。我感觉问题是出在空格上，如果您的文件夹名称变为restaurant_inference估计也会运行成功。作为程序员，我们在编程时会避免用空格命名，无论是变量名，数据列名，还是文件夹名

Answer 25 · 2021-11-10T16:13:30.000Z

这来源于早期的设计失误，用于在训练和测试时过滤文件名包含“infer”的文件，我在1.2.13a1测试版中将过滤字符串改为".inference"，这将在大多数情况下避免类似的问题。将训练集测试集和推理集分开目录存放应该可以彻底避免依赖文件名字符串搜索和排除文件的问题。如果您有更好的想法，欢迎修改代码并PR