Incorrect in extract with Korea characters
Closed this issue · 5 comments
GoogleCodeExporter commented
What steps will reproduce the problem?
1. I don't know exactly how explain this, so please see the attachment file
2.
3.
What is the expected output? What do you see instead?
Expected: correct characters
but only half of it, I'm not family with perl
What version of the product are you using? On what operating system?
Please provide any additional information below.
Original issue reported on code.google.com by son...@gmail.com
on 13 Oct 2008 at 7:01
Attachments:
GoogleCodeExporter commented
The file you've added claims it is XML (based on its file-extension), but it is
not.
It looks like SGML to me. The encoding looks like UTF-8. The SGML to XML
conversion
tool I use is sx, available from http://www.jclark.com/sp/howtoget.htm. You'll
need
to configure those tools to handle UTF-8 to convert your TEI file to XML, after
which
you should be able to apply the transform from HTML to TEI.
Original comment by jhellingman
on 15 Oct 2008 at 6:09
GoogleCodeExporter commented
Original comment by jhellingman
on 15 Oct 2008 at 7:05
- Changed state: Accepted
- Added labels: Type-Enhancement
- Removed labels: Type-Defect
GoogleCodeExporter commented
Original comment by jhellingman
on 15 Oct 2008 at 7:07
GoogleCodeExporter commented
Hi,
I found that it encodes in Unicode 16bit and little endian, so that why I got
wrong
result.
I correct something as below.
Thank you very much.
#!/usr/local/bin/perl
use utf8;
use open ":utf8";
binmode(STDIN, ":utf8");
binmode(STDOUT,":utf8");
binmode(STDERR,":utf8");
opendir(D0,"./") || die "cannot open:./*.*\n"; ##First, open current directory
@dir=readdir(D0); ##Load all files under current the directory to the array @dir
closedir(D0);
@dictmain=grep {/^.*\.txt$/} @dir; ##Taking only files with extension .txt and
loading them to an array @dictmain
foreach $fileeee (@dictmain){ ##Taking out a file one by one
open (PR, "<:raw:encoding(UTF-16LE):crlf", "$fileeee")
or die $!; # from left to right
open (OUTF, ">:utf8", "$fileeee.utf8") or die $!;
$/=undef; #is used so that the file is not matched line by line like grep.
while (<PR>) {
$a = $_;
chomp($a);
$a =~ s/<!(.|[\r\n])*]>//g; #Remove comment
#Remove header
$a=~ s/<[Tt]ei[Hh]eader(.|[\r\n])*\/[Tt]ei[Hh]eader>//gse; #If you want keep
texts in header, commenting this line
$a =~ s/^\s+//; #Remove heading and trailing space
$a =~ s/\s+$//;
$a =~ s/<.*?>//g; #Remove other tags
printf OUTF $a . "\n";
}
close PR;
close OUTF;
}
Original comment by son...@gmail.com
on 19 Oct 2008 at 8:27
GoogleCodeExporter commented
Issue resolved outside tei2html.
Original comment by jhellingman
on 3 Jan 2010 at 9:36
- Changed state: Invalid