/dcard-corpus

Dcard post data for building corpus

Primary LanguagePython

Dcard post data

This repo hosts the post data retrieved from Dcard API, which were colleceted for the purpose of building a small corpus. These posts came from the top-100 popular forums of Dcard. Each post is at least 100-character-long.

The post data were segmented and PoS tagged using ckiplab/ckiptagger.

Files

Concordancer

The quickest way to query KWIC concordance in this corpus with this concordancer is using docker.

Download image:

docker pull liao961120/dcard

Run server:

docker run -it -p 127.0.0.1:1420:80 liao961120/dcard

When you see Corpus Loaded printed on the command line, you can visit https://kwic.yongfu.name to use the app.

The source code of the concordancer can be found in liao961120/kwic and liao961120/kwic-backend. Read more about the concordancer in this post.

Corpus Stats

  • Number of tokens: 5292615
  • Number of posts: 19224
    • Female author: 12007 (62.46%)
    • Male author: 7217 (37.54%)

Word List (Top 100 frequent)

token pos count
1 DE 219170
2 COMMACATEGORY 214385
3 Nh 110994
4 SHI 87591
5 V_2 57263
6 WHITESPACE 54959
7 PERIODCATEGORY 53661
8 D 49773
9 <URL> FW 46867
10 Neu 46811
11 Di 45339
12 . PERIODCATEGORY 43362
13 P 42562
14 D 40836
15 Nf 38905
16 D 36364
17 Nep 34835
18 D 34730
19 Dfa 31142
20 PAUSECATEGORY 28877
21 D 28162
22 EXCLAMATIONCATEGORY 26118
23 Nh 25930
24 Na 23626
25 QUESTIONCATEGORY 22487
26 Nh 22130
27 COLONCATEGORY 21278
28 VE 20998
29 Cbb 20839
30 D 20698
31 VE 19552
32 T 17949
33 PARENTHESISCATEGORY 17945
34 PARENTHESISCATEGORY 17902
35 PARENTHESISCATEGORY 17717
36 D 17090
37 自己 Nh 16816
38 可以 D 16785
39 ( PARENTHESISCATEGORY 16666
40 DASHCATEGORY 16400
41 PARENTHESISCATEGORY 16037
42 ) PARENTHESISCATEGORY 15383
43 P 14860
44 Di 14313
45 Nh 14053
46 因為 Cbb 13466
47 Nf 13462
48 大家 Nh 13311
49 VH 13173
50 真的 D 12655
51 VC 12612
52 T 12334
53 Nep 11579
54 Ncd 11502
55 知道 VK 11115
56 覺得 VK 11043
57 所以 Cbb 11017
58 P 10927
59 我們 Nh 10806
60 T 10511
61 VL 10505
62 D 10353
63 D 9750
64 什麼 Nep 9458
65 Ng 9090
66 D 8849
67 D 8535
68 Neu 8447
69 Di 8369
70 Nf 8249
71 Da 8117
72 D 8076
73 Da 8011
74 喜歡 VK 8001
75 D 7873
76 Nf 7870
77 還是 D 7831
78 Dfa 7646
79 VC 7582
80 Nes 7542
81 時候 Na 7460
82 ETCCATEGORY 7374
83 VC 7307
84 P 7278
85 如果 Cbb 7265
86 P 7013
87 這樣 VH 6938
88 VH 6930
89 P 6923
90 看到 VE 6879
91 沒有 VJ 6842
92 T 6571
93 Dfa 6539
94 時間 Na 6467
95 P 6467
96 VH 6439
97 比較 Dfa 6409
98 一下 Nd 6376
99 然後 D 6307
100 Caa 6291