File convert.pl from phpCEDICT

This file assumes that you convert all the UTF-8 encoding to HTML excapes, and then intend on stripping out the “&#” and “;” from those escapes for the database. If you have a DB that supports UTF-8, simply comment out the lines with:

         $chineseline[0] =~ s/&#//g;
        $chineseline[0] =~ s/;/ /g;

Make sure to get $chineseline[0] and $chineseline[1]

If you want to use the HTML escapes you will need to convert cedict_ts.u8 from UFT-8 to the HTML escapes first. Use Unicode2Html for that.

I simply run this script from the command line using:

  perl convert.pl 
#!/usr/local/bin/perl;
 
#use warnings;
#use strict;
 
sub trim { $_[0] =~ s/^\s+//o; $_[0] =~ s/\s+$//o; $_[0]; }
 
open IN, "cedict_ts.u8" or die $!;
open OUT, ">cedict_u8.sql.txt" or die $!;
  
print OUT "Processing...\n\n";
 
while (<IN>) {
 
# ignore lines starting with #
if (index($_,"#") ne 0) {
        my @line = split(/[\[\]]/, $_, 3);
        print OUT "INSERT INTO cedict (traditional, simplified, pinyin, english) values (\"";
        
        my @chineseline = split(/ /, $line[0], 2);
        $chineseline[0] =~ s/&#//g;
        $chineseline[0] =~ s/;/ /g;
        print OUT trim($chineseline[0]);
        print OUT "\" , \"";
        $chineseline[1] =~ s/&#//g;
        $chineseline[1] =~ s/;/ /g;
        print OUT trim($chineseline[1]);
        
        
        print OUT "\" , \"";
        print OUT trim($line[1]);
        print OUT "\" , \"";
        #$line[2] =~ s/\s+$//;
        $line[2] =~ s/"/&quot;/g;
        print OUT trim($line[2]);
        print OUT "\");\n";       
}
 
 
}
print OUT "\n\nDONE\n";
 
  code/cedictconvert.txt · Last modified: 2005/01/06 11:37
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki