ko-nlp/Korpora

[Corpus] Open subtitles 2018

lovit opened this issue · 2 comments

lovit commented

(snapshot)

<?xml version="1.0" encoding="UTF-8" ?>
<tmx version="1.4">
<header creationdate="Thu Oct 12 19:47:49 2017"
          srclang="en"
          adminlang="en"
          o-tmf="unknown"
          segtype="sentence"
          creationtool="Uplug"
          creationtoolversion="unknown"
          datatype="PlainText" />
  <body>
    <tu>
      <tuv xml:lang="en"><seg>Through the snow and sleet and hail, through the blizzard, through the gales, through the wind and through the rain, over mountain, over plain, through the blinding lightning flash, and the mighty thunder crash,</seg></tuv>
      <tuv xml:lang="ko"><seg>폭설이 내리고 우박, 진눈깨비가 퍼부어도 눈보라가 몰아쳐도 강풍이 불고 비바람이 휘몰아쳐도</seg></tuv>
    </tu>
    <tu>
      <tuv xml:lang="en"><seg>ever faithful, ever true, nothing stops him, he'll get through.</seg></tuv>
      <tuv xml:lang="ko"><seg>우리의 한결같은 심부름꾼 황새 아저씨 가는 길을 그 누가 막으랴!</seg></tuv>
    </tu>
    <tu>
      <tuv xml:lang="en"><seg>Look out for Mr Stork That persevering chap</seg></tuv>
      <tuv xml:lang="ko"><seg>황새 아저씨를 기다리세요</seg></tuv>
    </tu>
    <tu>
      <tuv xml:lang="en"><seg>He'll come along and drop a bundle in your lap</seg></tuv>
      <tuv xml:lang="ko"><seg>찾아와 선물을 주실 거예요</seg></tuv>
    </tu>
    <tu>
      <tuv xml:lang="en"><seg>You may be poor or rich It doesn't matter which</seg></tuv>
      <tuv xml:lang="ko"><seg>가난하든 부자이든 상관이 없답니다</seg></tuv>
    </tu>
    <tu>
      <tuv xml:lang="en"><seg>Millionaires, they get theirs like the butcher and the baker</seg></tuv>
      <tuv xml:lang="ko"><seg>백만장자도 하나 가난뱅이도 하나</seg></tuv>
    </tu>
    <tu>
      <tuv xml:lang="en"><seg>So look out for Mr Stork and let me tell you, friend</seg></tuv>
      <tuv xml:lang="ko"><seg>황새 아저씨를 기다리세요</seg></tuv>
    </tu>
    <tu>
      <tuv xml:lang="en"><seg>Don't try to get away He'll find you in the end</seg></tuv>
      <tuv xml:lang="ko"><seg>도망쳐도 소용없어요 반드시 찾아내니까요</seg></tuv>
    </tu>
    <tu>
      <tuv xml:lang="en"><seg>He'll spot you out in China or he'll fly to County Cork</seg></tuv>
      <tuv xml:lang="ko"><seg>세상 끝에 있어도 하늘 꼭대기에 있어도</seg></tuv>
    </tu>
    <tu>
      <tuv xml:lang="en"><seg>So, you better look out for Mr Stork</seg></tuv>
      <tuv xml:lang="ko"><seg>황새 아저씨는 찾아간답니다</seg></tuv>

Line by line 으로 처리하기 위하여 다음의 파서를 이용

import os
import re
from tqdm import tqdm


def parse_document(path):
    pattern = re.compile('<seg>[\S ]+</seg>')

    def parse_segment(line):
        seg = pattern.findall(line)[0]
        return seg[5:-6]

    sources = []
    targets = []
    mode = 0
    
    source, target = None, None
    with open(path, encoding='utf-8') as f:
        for line in tqdm(f, desc=f'Loading {os.path.basename(path)}'):
            line = line.strip()
            if line[:4] == '<tu>':
                mode += 1
                continue
            elif line[:5] == '</tu>':
                mode = 0
                if source is not None and target is not None:
                    sources.append(source)
                    targets.append(target)
                source, target = None, None
                continue
            try:
                if mode == 1:
                    source = parse_segment(line)
                    mode += 1
                    continue
                if mode == 2:
                    target = parse_segment(line)
                    continue
            except:
                mode = 0
    return sources, targets

len(sources), len(targets)  # (1269683, 1269683)
lovit commented

(unzip gz file)

import gzip
import shutil


def web_download_ungzip(url, gzip_path, corpus_name='', force_download=False):
    web_download(url, tar_path, corpus_name, force_download)
    # assume that path/to/abc.gzip consists path/to/abc
    data_path = gzip_path[:-3]
    if (not force_download) and os.path.exists(data_path):
        return None
    with gzip.open(gzip_path, 'rb') as fi:
        with open(data_path, 'wb') as fo:
            shutil.copyfileobj(fi, fo)
    print(f'decompress {gzip_path}')