Incorrect in extract with Korea characters

Question

Incorrect in extract with Korea characters

Closed this issue 10 years ago · 5 comments

GoogleCodeExporter commented 10 years ago

What steps will reproduce the problem?
1. I don't know exactly how explain this, so please see the attachment file
2.
3.

What is the expected output? What do you see instead?
Expected: correct characters

but only half of it, I'm not family with perl

What version of the product are you using? On what operating system?


Please provide any additional information below.

Original issue reported on code.google.com by son...@gmail.com on 13 Oct 2008 at 7:01

Attachments:

Answer 1 · 2015-03-12T21:22:35.000Z

The file you've added claims it is XML (based on its file-extension), but it is 
not.
It looks like SGML to me. The encoding looks like UTF-8. The SGML to XML 
conversion
tool I use is sx, available from http://www.jclark.com/sp/howtoget.htm. You'll 
need
to configure those tools to handle UTF-8 to convert your TEI file to XML, after 
which
you should be able to apply the transform from HTML to TEI.

Original comment by jhellingman on 15 Oct 2008 at 6:09

Answer 2 · 2015-03-12T21:22:35.000Z

Original comment by jhellingman on 15 Oct 2008 at 7:05

Changed state: Accepted
Added labels: Type-Enhancement
Removed labels: Type-Defect

Answer 3 · 2015-03-12T21:22:35.000Z

Original comment by jhellingman on 15 Oct 2008 at 7:07

Answer 4 · 2015-03-12T21:22:35.000Z

Hi,

I found that it encodes in Unicode 16bit and little endian, so that why I got 
wrong
result.
I correct something as below.
Thank you very much.

#!/usr/local/bin/perl
use utf8;
use open ":utf8";
binmode(STDIN, ":utf8");
binmode(STDOUT,":utf8");
binmode(STDERR,":utf8");

opendir(D0,"./") || die "cannot open:./*.*\n"; ##First, open current directory

@dir=readdir(D0); ##Load all files under current the directory to the array @dir

closedir(D0);

@dictmain=grep {/^.*\.txt$/} @dir; ##Taking only files with extension .txt and
loading them to an array @dictmain

foreach $fileeee (@dictmain){      ##Taking out a file one by one

   open (PR, "<:raw:encoding(UTF-16LE):crlf", "$fileeee")
   or die $!; # from left to right
   open (OUTF, ">:utf8", "$fileeee.utf8") or die $!;
   $/=undef; #is used so that the file is not matched line by line like grep.

   while (<PR>) {
      $a = $_;
      chomp($a);
      $a =~ s/<!(.|[\r\n])*]>//g; #Remove comment
      #Remove header
      $a=~ s/<[Tt]ei[Hh]eader(.|[\r\n])*\/[Tt]ei[Hh]eader>//gse; #If you want keep
texts in header, commenting this line

      $a =~ s/^\s+//; #Remove heading and trailing space
      $a =~ s/\s+$//;

      $a =~ s/<.*?>//g; #Remove other tags
      printf OUTF $a . "\n";
   }
   close PR;
   close OUTF;

}

Original comment by son...@gmail.com on 19 Oct 2008 at 8:27

Answer 5 · 2015-03-12T21:22:36.000Z

Issue resolved outside tei2html.

Original comment by jhellingman on 3 Jan 2010 at 9:36

Changed state: Invalid