Separate documents?

Question

Separate documents?

fsimonjetz opened this issue 7 years ago · 3 comments

The readme says "All data have been processed automatically so that it is not possible to reconstruct the original source texts." I'm considering to use German-Korean data for my PhD project; however, for what I have in mind it would be helpful to have the documents separated. Is this information available?
Even stand-off indices would be nice..
I hope you can keep up this project, it looks like a promising resource!

Answer 1 · 2017-06-21T09:30:43.000Z

@fsimonjetz Thank you for your interest in this project.

I would suggest you to use following script to generate your own pair of parallel data,
https://github.com/ajinkyakulkarni14/How-I-Extracted-TED-talks-for-parallel-Corpus-

If you are still not been able to extract it, let me know.

Answer 2 · 2018-06-17T03:18:48.000Z

@ajinkyakulkarni14, I use https://github.com/ajinkyakulkarni14/How-I-Extracted-TED-talks-for-parallel-Corpus- to extract data for en-ja, but it get error:
Traceback (most recent call last):
File "extractTEDtalk.py", line 25, in
all_talk_names=enlist_talk_names(path,all_talk_names)
File "extractTEDtalk.py", line 13, in enlist_talk_names
r = urllib.request.urlopen(path).read()
File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.5/urllib/request.py", line 472, in open
response = meth(req, response)
File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.5/urllib/request.py", line 510, in error
return self._call_chain(*args)
File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 429: Rate Limited too many requests.

Can you help me to solve the error, please?
Thank you!

Answer 3 · 2018-10-18T15:35:40.000Z

I am reopening the project and going to update the corpus soon.