File convert.pl from phpCEDICT
This file assumes that you convert all the UTF-8 encoding to HTML excapes, and then intend on stripping out the “&#” and “;” from those escapes for the database. If you have a DB that supports UTF-8, simply comment out the lines with:
$chineseline[0] =~ s/&#//g; $chineseline[0] =~ s/;/ /g;
Make sure to get $chineseline[0] and $chineseline[1]
If you want to use the HTML escapes you will need to convert cedict_ts.u8 from UFT-8 to the HTML escapes first. Use Unicode2Html for that.
I simply run this script from the command line using:
perl convert.pl
#!/usr/local/bin/perl; #use warnings; #use strict; sub trim { $_[0] =~ s/^\s+//o; $_[0] =~ s/\s+$//o; $_[0]; } open IN, "cedict_ts.u8" or die $!; open OUT, ">cedict_u8.sql.txt" or die $!; print OUT "Processing...\n\n"; while (<IN>) { # ignore lines starting with # if (index($_,"#") ne 0) { my @line = split(/[\[\]]/, $_, 3); print OUT "INSERT INTO cedict (traditional, simplified, pinyin, english) values (\""; my @chineseline = split(/ /, $line[0], 2); $chineseline[0] =~ s/&#//g; $chineseline[0] =~ s/;/ /g; print OUT trim($chineseline[0]); print OUT "\" , \""; $chineseline[1] =~ s/&#//g; $chineseline[1] =~ s/;/ /g; print OUT trim($chineseline[1]); print OUT "\" , \""; print OUT trim($line[1]); print OUT "\" , \""; #$line[2] =~ s/\s+$//; $line[2] =~ s/"/"/g; print OUT trim($line[2]); print OUT "\");\n"; } } print OUT "\n\nDONE\n";
