Содержание

Конвертация кодировок utf-8 и win-1251 в PHP через iconv
Как перекодировать Windows-1252 в UTF-8?
3 ответа 3
Изменить кодировку строки UTF-8 to windows-1251
utf8_encode
Описание
Список параметров
Возвращаемые значения
Список изменений
Смотрите также
User Contributed Notes 23 notes

Конвертация кодировок utf-8 и win-1251 в PHP через iconv

Обычно php страница или сайт имеет одну базовую кодировку. Я работаю только с UTF-8, но бывают случаи, когда приходится использовать какой-нибудь php файл с кодировкой win-1251, а передаваемые в него значения идут в кодировке UTF-8. В этом случае после исполнения скрипта появляются кракозябры.

Таких ситуаций может быть множество, например, при использовании функции отправки сообщений mail(), работе с PDF обработкой или даже при различных действиях с БД. В идеале нужно обязательно избавляться от таких хвостов и не допускать в коде функции конвертации, но если это просто невозможно, то на помощь вам придет функция iconv.

Синтаксис функции: $string= iconv(‘начальная кодировка’, ‘конечная кодировка’, $string);

Здесь $string — строка, кодировку которой мы изменяем

Из этого выходит, что для того, чтобы преобразовать строку из кодировки utf-8 в win-1251 нужно писать следующее: $string= iconv(‘utf-8’, ‘win-1251’, $string);

Из win-1251 в utf-8: $string= iconv(‘win-1251’, ‘utf-8’, $string);

На заметку, помните, что кодировку в php странице можно указывать за счет установки header-а документа. Например, если ваша страница в UTF-8 без BOM, то в самом начале документа ставьте: . Если же windows 1251, то сам документ должен быть в ANSI, а в первой строке можете поставить: . Иногда эти манипуляции помогают с письмами, которые приходят в кракозябрах из-за неверной кодировки.

Как перекодировать Windows-1252 в UTF-8?

Спарсил один заказ, записываю в базу. Если записывать в базу как есть, то в базе будет: ÐžÑ„Ð¾Ñ€Ð¼Ð»ÐµÐ½

Пробовал переводить этот текст в декодере выводит: CP1252 → UTF-8 = Оформлен

В базу русский текст не записывается(нужно слово «Оформлен»), а пишутся каракули.

1) Пробую перекодировать:

Результат тот же. Вычитал в еще такой способ:

2) Пробую по-другому:

Результат: Îôîðìëåí Смотрю что это за кодировка, результат:

То есть кодировка не меняется? Как сменить кодировку?

3 ответа 3

Это обычная 1251 а не 1252 — очень часто все эти автоматические распознавалки путают эти две кодировки, при этом реально 1252 встречается намного реже.

Ваш вопрос нужно поделить на два вопроса. Сначала верно переконвертировать текст в правильную кодировку (iconv — правильное решение, вы сами написали); второе — верно вставить в базу данных. Вот тут уже нужно смотреть кодировку подключения/базы/сервера и может быть много ньюансов. Ведь вы же везде пишете «база», «в базу», верно?

Сначала посмотрите, чтобы выводилось правильно на страницу (файл, отладчик). Не забывайте, что страница с исходным кодом тоже может кодировку не uft-8, а cp1251. И при выводе на страницу важны заголовки, отдаваемые сервером и прописанные в хедерах.

Потом проверьте настройки подключения к базе. Иногда бывает можно set names utf8 поставить, если лень разбираться в вопросах кодировок.

Изменить кодировку строки UTF-8 to windows-1251

Не получается изменить кодировку utf-8
файлы сохраняю в phpexpert editor, выбираю ANSI, в шаблоне есть строка 6

convert_cyr_string
(PHP 3>= 3.0.6, PHP 4 )

convert_cyr_string — Convert from one Cyrillic character set to another
Description
string convert_cyr_string ( string str, string from, string to)

This function returns the given string converted from one Cyrillic character set to another. The from and to arguments are single characters that represent the source and target Cyrillic character sets. The supported types are:

Тематические курсы и обучение профессиям онлайн
Профессия‌ ‌PHP-разработчик (Skillbox)‌
Fullstack-разработчик на PHP (Skillbox)‌
Веб-разработчик с нуля (Нетология)
Профессия веб-разработчик (Skillbox)

Заказываю контрольные, курсовые, дипломные и любые другие студенческие работы здесь или здесь.

Post-передача кириллицы со страницы в utf-8 на страницу в windows-1251
Мое почтение, уважаемые форумчане! Подключаю Merchant WebMoney к некоему сайту и столкнулся со.

Парсинг из windows-1251 в UTF-8
Здравствуйте! Существует модуль для DLE который парсит тв программу с tv.mail.ru. Кодировка моего.

Не работают кодировки и utf-8 и windows-1251
проблема с кодировками: во всех браузерах весь текст (и с дримвивера и с БД) отображается.

Перевод записей БД MySQL из UTF-8 в Windows-1251
Здравствуйте! Я генерирую документ PDF с помощью FPDF. Этот документ должен содержать таблицу.

utf8_encode

(PHP 4, PHP 5, PHP 7, PHP 8)

utf8_encode — Кодирует строку ISO-8859-1 в кодировке UTF-8

Описание

Эта функция конвертирует строку string из кодировки ISO-8859-1 в UTF-8

Многие веб-страницы, отмеченные как использующие кодировку ISO-8859-1 , на самом деле используют схожую кодировку Windows-1252 , и веб-браузеры интерпретируют страницы ISO-8859-1 как Windows-1252 . Однако Windows-1252 содержит дополнительные печатные символы, такие как знак Евро ( € ) и фигурные кавычки ( “ ” ) вместо управляющих кодов ISO-8859-1 . Эта функция не конвертирует такие символы Windows-1252 корректно. Используйте другую функцию, если нужна конвертация из Windows-1252 .

Список параметров

Возвращаемые значения

Возвращает UTF-8 перевод данных string .

Список изменений

Версия	Описание
7.2.0	Эта функция была перенесена в ядро PHP, таким образом отменив требование расширения XML для использования этой функции.

Смотрите также

utf8_encode() (содержит описание UTF-8)
mb_convert_encoding() — Преобразует кодировку символов — Конвертирует между множеством различных кодировок, включая UTF-8, ISO-8859-1 and Windows-1252
iconv() — Преобразование строки в требуемую кодировку — Конвертирует между множеством различных кодировок
recode_string() — Перекодирует строку в соответствии с заданными параметрами — Конвертирует между множеством различных кодировок

User Contributed Notes 23 notes

Please note that utf8_encode only converts a string encoded in ISO-8859-1 to UTF-8. A more appropriate name for it would be «iso88591_to_utf8». If your text is not encoded in ISO-8859-1, you do not need this function. If your text is already in UTF-8, you do not need this function. In fact, applying this function to text that is not encoded in ISO-8859-1 will most likely simply garble that text.

If you need to convert text from any encoding to any other encoding, look at iconv() instead.

My version of utf8_encode_deep,
In case you need one that returns a value without changing the original.

/**
* Convert Anything To UTF-8
* @param mixed $var The variable you want to convert.
* @param boolean $deep Deep convertion? (*Default: TRUE).
* @return mixed
*/
function anything_to_utf8($var,$deep=TRUE) <
if(is_array($var)) <
foreach($var as $key => $value) <
if($deep) <
$var[$key] = anything_to_utf8($value,$deep);
>elseif(!is_array($value) && !is_object($value) && !mb_detect_encoding($value,’utf-8′,true)) <
$var[$key] = utf8_encode($var);
>
>
return $var;
>elseif(is_object($var)) <
foreach($var as $key => $value) <
if($deep) <
$var->$key = anything_to_utf8($value,$deep);
>elseif(!is_array($value) && !is_object($value) && !mb_detect_encoding($value,’utf-8′,true)) <
$var->$key = utf8_encode($var);
>
>
return $var;
>else <
return (!mb_detect_encoding($var,’utf-8′,true))?utf8_encode($var):$var;
>
>

Here’s some code that addresses the issue that Steven describes in the previous comment;

/* This structure encodes the difference between ISO-8859-1 and Windows-1252,
as a map from the UTF-8 encoding of some ISO-8859-1 control characters to
the UTF-8 encoding of the non-control characters that Windows-1252 places
at the equivalent code points. */

$cp1252_map = array(
«\xc2\x80» => «\xe2\x82\xac» , /* EURO SIGN */
«\xc2\x82» => «\xe2\x80\x9a» , /* SINGLE LOW-9 QUOTATION MARK */
«\xc2\x83» => «\xc6\x92» , /* LATIN SMALL LETTER F WITH HOOK */
«\xc2\x84» => «\xe2\x80\x9e» , /* DOUBLE LOW-9 QUOTATION MARK */
«\xc2\x85» => «\xe2\x80\xa6» , /* HORIZONTAL ELLIPSIS */
«\xc2\x86» => «\xe2\x80\xa0» , /* DAGGER */
«\xc2\x87» => «\xe2\x80\xa1» , /* DOUBLE DAGGER */
«\xc2\x88» => «\xcb\x86» , /* MODIFIER LETTER CIRCUMFLEX ACCENT */
«\xc2\x89» => «\xe2\x80\xb0» , /* PER MILLE SIGN */
«\xc2\x8a» => «\xc5\xa0» , /* LATIN CAPITAL LETTER S WITH CARON */
«\xc2\x8b» => «\xe2\x80\xb9» , /* SINGLE LEFT-POINTING ANGLE QUOTATION */
«\xc2\x8c» => «\xc5\x92» , /* LATIN CAPITAL LIGATURE OE */
«\xc2\x8e» => «\xc5\xbd» , /* LATIN CAPITAL LETTER Z WITH CARON */
«\xc2\x91» => «\xe2\x80\x98» , /* LEFT SINGLE QUOTATION MARK */
«\xc2\x92» => «\xe2\x80\x99» , /* RIGHT SINGLE QUOTATION MARK */
«\xc2\x93» => «\xe2\x80\x9c» , /* LEFT DOUBLE QUOTATION MARK */
«\xc2\x94» => «\xe2\x80\x9d» , /* RIGHT DOUBLE QUOTATION MARK */
«\xc2\x95» => «\xe2\x80\xa2» , /* BULLET */
«\xc2\x96» => «\xe2\x80\x93» , /* EN DASH */
«\xc2\x97» => «\xe2\x80\x94» , /* EM DASH */

«\xc2\x98» => «\xcb\x9c» , /* SMALL TILDE */
«\xc2\x99» => «\xe2\x84\xa2» , /* TRADE MARK SIGN */
«\xc2\x9a» => «\xc5\xa1» , /* LATIN SMALL LETTER S WITH CARON */
«\xc2\x9b» => «\xe2\x80\xba» , /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
«\xc2\x9c» => «\xc5\x93» , /* LATIN SMALL LIGATURE OE */
«\xc2\x9e» => «\xc5\xbe» , /* LATIN SMALL LETTER Z WITH CARON */
«\xc2\x9f» => «\xc5\xb8» /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
);

function cp1252_to_utf8 ( $str ) <
global $cp1252_map ;
return strtr ( utf8_encode ( $str ), $cp1252_map );
>

Walk through nested arrays/objects and utf8 encode all strings.

// Usage
class Foo <
public $somevar = ‘whoop whoop’ ;
>

$structure = array(
‘object’ => (object) array(
‘entry’ => ‘hello wörld’ ,
‘another_array’ => array(
‘string’ ,
1234 ,
‘another string’
)
),
‘string’ => ‘foo’ ,
‘foo_object’ => new Foo
);

// $structure is now utf8 encoded
print_r ( $structure );

unset( $value );
> else if ( is_object ( $input )) <
$vars = array_keys ( get_object_vars ( $input ));

foreach ( $vars as $var ) <
utf8_encode_deep ( $input -> $var );
>
>
>
?>

If you need a function which converts a string array into a utf8 encoded string array then this function might be useful for you:

This function may be useful do encode array keys and values [and checks first to see if it’s already in UTF format]:

public static function to_utf8 ( $in )
<
if ( is_array ( $in )) <
foreach ( $in as $key => $value ) <
$out [ to_utf8 ( $key )] = to_utf8 ( $value );
>
> elseif( is_string ( $in )) <
if( mb_detect_encoding ( $in ) != «UTF-8» )
return utf8_encode ( $in );
else
return $in ;
> else <
return $in ;
>
return $out ;
>
?>

Hope this may help.

[NOTE BY danbrown AT php DOT net: Original function written by (cmyk777 AT gmail DOT com) on 28-JAN-09.]

/**
* Convert all values of an array to utf8_encode
* @author Marcelo Ratton
* @version 1.0
*
* @param array $arr array to encode values
* @param bool $keys true to convert keys to UTF8
* @return array same array but with all values encoded to UTF8
*/
function arrayEncodeToUTF8(array $arr, bool $keys= false) : array <
$ret= [];
foreach ($arr as $k=>$v) <
if (is_array($v)) <
$ret[$k]= arrayEncodeToUTF8($v);
> else <
if ($keys) <
$k= utf8_encode((string)$k);
>
$ret[$k]= utf8_encode((string)$v);
>
>

I tried a lot of things, but this seems to be the final fail save method to convert any string to proper UTF-8.

function _convert ( $content ) <
if(! mb_check_encoding ( $content , ‘UTF-8’ )
OR !( $content === mb_convert_encoding ( mb_convert_encoding ( $content , ‘UTF-32’ , ‘UTF-8’ ), ‘UTF-8’ , ‘UTF-32’ ))) <

$content = mb_convert_encoding ( $content , ‘UTF-8’ );

if ( mb_check_encoding ( $content , ‘UTF-8’ )) <
// log(‘Converted to UTF-8’);
> else <
// log(‘Could not converted to UTF-8’);
>
>
return $content ;
>
?>

For reference, it may be insightful to point out that:
utf8_encode($s)
is actually identical to:
recode_string(‘latin1..utf8’, $s)
and:
iconv(‘iso-8859-1’, ‘utf-8’, $s)
That is, utf8_encode is a specialized case of character set conversions.

If your string to be converted to utf-8 is something other than iso-8859-1 (such as iso-8859-2 (Polish/Croatian)), you should use recode_string() or iconv() instead rather than trying to devise complex str_replace statements.

If you are looking for a function to replace special characters with the hex-utf-8 value (e.g. für Webservice-Security/WSS4J compliancy) you might use this:

$textstart = «Größe»;
$utf8 =»;
$max = strlen($txt);

$utf8 .= $neu;
> // for $i

In this example $textnew will be «GrÃ¶Ãe»

I recommend using this alternative for every language:

Don’t forget to set all your pages to «utf-8» encoding, otherwise just use HTML entities.

I was searching for a function similar to Javascript’s unescape(). In most cases it is OK to use url_decode() function but not if you’ve got UTF characters in the strings. They are converted into %uXXXX entities that url_decode() cannot handle.
I googled the net and found a function which actualy converts these entities into HTML entities (&#xxx;) that your browser can show correctly. If you’re OK with that, the function can be found here: http://pure-essence.net/stuff/code/utf8RawUrlDecode.phps

But it was not OK with me because I needed a string in my charset to make some comparations and other stuff. So I have modified the above function and in conjuction with code2utf() function mentioned in some other note here, I have managed to achieve my goal:

/**
* Function converts an Javascript escaped string back into a string with specified charset (default is UTF-8).
* Modified function from http://pure-essence.net/stuff/code/utf8RawUrlDecode.phps
*
* @param string $source escaped with Javascript’s escape() function
* @param string $iconv_to destination character set will be used as second paramether in the iconv function. Default is UTF-8.
* @return string
*/
function unescape ( $source , $iconv_to = ‘UTF-8’ ) <
$decodedStr = » ;
$pos = 0 ;
$len = strlen ( $source );
while ( $pos $len ) <
$charAt = substr ( $source , $pos , 1 );
if ( $charAt == ‘%’ ) <
$pos ++;
$charAt = substr ( $source , $pos , 1 );
if ( $charAt == ‘u’ ) <
// we got a unicode character
$pos ++;
$unicodeHexVal = substr ( $source , $pos , 4 );
$unicode = hexdec ( $unicodeHexVal );
$decodedStr .= code2utf ( $unicode );
$pos += 4 ;
>
else <
// we have an escaped ascii character
$hexVal = substr ( $source , $pos , 2 );
$decodedStr .= chr ( hexdec ( $hexVal ));
$pos += 2 ;
>
>
else <
$decodedStr .= $charAt ;
$pos ++;
>
>

if ( $iconv_to != «UTF-8» ) <
$decodedStr = iconv ( «UTF-8» , $iconv_to , $decodedStr );
>

/**
* Function coverts number of utf char into that character.
* Function taken from: http://sk2.php.net/manual/en/function.utf8-encode.php#49336
*
* @param int $num
* @return utf8char
*/
function code2utf ( $num ) <
if( $num 128 )return chr ( $num );
if( $num 2048 )return chr (( $num >> 6 )+ 192 ). chr (( $num & 63 )+ 128 );
if( $num 65536 )return chr (( $num >> 12 )+ 224 ). chr ((( $num >> 6 )& 63 )+ 128 ). chr (( $num & 63 )+ 128 );
if( $num 2097152 )return chr (( $num >> 18 )+ 240 ). chr ((( $num >> 12 )& 63 )+ 128 ). chr ((( $num >> 6 )& 63 )+ 128 ) . chr (( $num & 63 )+ 128 );
return » ;
>
?>

Avoiding use of preg_match to detect if utf8_encode is needed:

= $string_input ; // avoid being destructive

$string = preg_replace ( «#[\x09\x0A\x0D\x20-\x7E]#» , «» , $string ); // ASCII
$string = preg_replace ( «#[\xC2-\xDF][\x80-\xBF]#» , «» , $string ); // non-overlong 2-byte
$string = preg_replace ( «#\xE0[\xA0-\xBF][\x80-\xBF]#» , «» , $string ); // excluding overlongs
$string = preg_replace ( «#[\xE1-\xEC\xEE\xEF][\x80-\xBF]<2>#» , «» , $string ); // straight 3-byte
$string = preg_replace ( «#\xED[\x80-\x9F][\x80-\xBF]#» , «» , $string ); // excluding surrogates
$string = preg_replace ( «#\xF0[\x90-\xBF][\x80-\xBF]<2>#» , «» , $string ); // planes 1-3
$string = preg_replace ( «#[\xF1-\xF3][\x80-\xBF]<3>#» , «» , $string ); // planes 4-15
$string = preg_replace ( «#\xF4[\x80-\x8F][\x80-\xBF]<2>#» , «» , $string ); // plane 16

$rc = ( $string == «» ? true : false );
?>

/**
* Encodes an ISO-8859-1 mixed variable to UTF-8 (PHP 4, PHP 5 compat)
* @param mixed $input An array, associative or simple
* @param boolean $encode_keys optional
* @return mixed ( utf-8 encoded $input)
*/

function utf8_encode_mix ( $input , $encode_keys = false )
<
if( is_array ( $input ))
<
$result = array();
foreach( $input as $k => $v )
<
$key = ( $encode_keys )? utf8_encode ( $k ) : $k ;
$result [ $key ] = utf8_encode_mix ( $v , $encode_keys );
>
>
else
<
$result = utf8_encode ( $input );
>

// Reads a file story.txt ascii (as typed on keyboard)
// converts it to Georgian character using utf8 encoding
// if I am correct(?) just as it should be when typed on Georgian computer
// it outputs it as an html file
//
// http://www.comweb.nl/keys_to_georgian.html
// http://www.comweb.nl/keys_to_georgian.php
// http://www.comweb.nl/story.txt

keys to unicode code

// this meta tag is needed

// note the sylfean font seems to be standard installed on Windows XP
// It supports Georgian

This function I use convert Thai font (iso-8859-11) to UTF-8. For my case, It work properly. Please try to use this function if you have a problem to convert charset iso-8859-11 to UTF-8.

if ( ! ereg(«[\241-\377]», $string) )
return $string;

$iso8859_11 = array(
«\xa1» => «\xe0\xb8\x81»,
«\xa2» => «\xe0\xb8\x82»,
«\xa3» => «\xe0\xb8\x83»,
«\xa4» => «\xe0\xb8\x84»,
«\xa5» => «\xe0\xb8\x85»,
«\xa6» => «\xe0\xb8\x86»,
«\xa7» => «\xe0\xb8\x87»,
«\xa8» => «\xe0\xb8\x88»,
«\xa9» => «\xe0\xb8\x89»,
«\xaa» => «\xe0\xb8\x8a»,
«\xab» => «\xe0\xb8\x8b»,
«\xac» => «\xe0\xb8\x8c»,
«\xad» => «\xe0\xb8\x8d»,
«\xae» => «\xe0\xb8\x8e»,
«\xaf» => «\xe0\xb8\x8f»,
«\xb0» => «\xe0\xb8\x90»,
«\xb1» => «\xe0\xb8\x91»,
«\xb2» => «\xe0\xb8\x92»,
«\xb3» => «\xe0\xb8\x93»,
«\xb4» => «\xe0\xb8\x94»,
«\xb5» => «\xe0\xb8\x95»,
«\xb6» => «\xe0\xb8\x96»,
«\xb7» => «\xe0\xb8\x97»,
«\xb8» => «\xe0\xb8\x98»,
«\xb9» => «\xe0\xb8\x99»,
«\xba» => «\xe0\xb8\x9a»,
«\xbb» => «\xe0\xb8\x9b»,
«\xbc» => «\xe0\xb8\x9c»,
«\xbd» => «\xe0\xb8\x9d»,
«\xbe» => «\xe0\xb8\x9e»,
«\xbf» => «\xe0\xb8\x9f»,
«\xc0» => «\xe0\xb8\xa0»,
«\xc1» => «\xe0\xb8\xa1»,
«\xc2» => «\xe0\xb8\xa2»,
«\xc3» => «\xe0\xb8\xa3»,
«\xc4» => «\xe0\xb8\xa4»,
«\xc5» => «\xe0\xb8\xa5»,
«\xc6» => «\xe0\xb8\xa6»,
«\xc7» => «\xe0\xb8\xa7»,
«\xc8» => «\xe0\xb8\xa8»,
«\xc9» => «\xe0\xb8\xa9»,
«\xca» => «\xe0\xb8\xaa»,
«\xcb» => «\xe0\xb8\xab»,
«\xcc» => «\xe0\xb8\xac»,
«\xcd» => «\xe0\xb8\xad»,
«\xce» => «\xe0\xb8\xae»,
«\xcf» => «\xe0\xb8\xaf»,
«\xd0» => «\xe0\xb8\xb0»,
«\xd1» => «\xe0\xb8\xb1»,
«\xd2» => «\xe0\xb8\xb2»,
«\xd3» => «\xe0\xb8\xb3»,
«\xd4» => «\xe0\xb8\xb4»,
«\xd5» => «\xe0\xb8\xb5»,
«\xd6» => «\xe0\xb8\xb6»,
«\xd7» => «\xe0\xb8\xb7»,
«\xd8» => «\xe0\xb8\xb8»,
«\xd9» => «\xe0\xb8\xb9»,
«\xda» => «\xe0\xb8\xba»,
«\xdf» => «\xe0\xb8\xbf»,
«\xe0» => «\xe0\xb9\x80»,
«\xe1» => «\xe0\xb9\x81»,
«\xe2» => «\xe0\xb9\x82»,
«\xe3» => «\xe0\xb9\x83»,
«\xe4» => «\xe0\xb9\x84»,
«\xe5» => «\xe0\xb9\x85»,
«\xe6» => «\xe0\xb9\x86»,
«\xe7» => «\xe0\xb9\x87»,
«\xe8» => «\xe0\xb9\x88»,
«\xe9» => «\xe0\xb9\x89»,
«\xea» => «\xe0\xb9\x8a»,
«\xeb» => «\xe0\xb9\x8b»,
«\xec» => «\xe0\xb9\x8c»,
«\xed» => «\xe0\xb9\x8d»,
«\xee» => «\xe0\xb9\x8e»,
«\xef» => «\xe0\xb9\x8f»,
«\xf0» => «\xe0\xb9\x90»,
«\xf1» => «\xe0\xb9\x91»,
«\xf2» => «\xe0\xb9\x92»,
«\xf3» => «\xe0\xb9\x93»,
«\xf4» => «\xe0\xb9\x94»,
«\xf5» => «\xe0\xb9\x95»,
«\xf6» => «\xe0\xb9\x96»,
«\xf7» => «\xe0\xb9\x97»,
«\xf8» => «\xe0\xb9\x98»,
«\xf9» => «\xe0\xb9\x99»,
«\xfa» => «\xe0\xb9\x9a»,
«\xfb» => «\xe0\xb9\x9b»
);

$string=strtr($string,$iso8859_11);
return $string;
>

Re the previous post about converting GB2312 code to Unicode code which displayed the following function:

I found that a small change was needed in the code to properly handle latin characters embedded in the middle of gb2312 text, as when the text includes a URL or email address. Just reverse the two lines in the part of the statement above that handles ord vals !>127.

In the original function, the first latin chacter was dropped and it was not converting the first non-latin character after the latin text (everything was shifted one character too far to the right). Reversing those two lines makes it work correctly in every example I have tried.

Also, the source of the gb2312.txt file needed for this to work has changed. You can find it a couple places:

/*
Every function seen so far is incomplete or resource consumpting. Here are two — integer 2 utf sequence (i3u) and utf sequence to integer (u3i). Below is a code snippet that checks well behavior at the range boundaries.

Someday they might be hardcoded into PHP.
*/

function u3i($s,$strict=1) < // returns integer on valid UTF-8 seq, NULL on empty, else FALSE
// NOT strict: takes only DATA bits, present or not; strict: length and bits checking
if ($s==») return NULL;
$l=strlen($s); $o=ord($s<0>);
if ($o 6 && $strict) return false;
if ($strict) for ($i=1;$i 0xbf || ord($s<$i>) [» . u3i($o) . «]\n»;
>

// Validate Unicode UTF-8 Version 4
// This function takes as reference the table 3.6 found at http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
// It also flags overlong bytes as error

function is_validUTF8($str)
<
// values of -1 represent disalloweded values for the first bytes in current UTF-8
static $trailing_bytes = array (
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
);

$ups = unpack(‘C*’, $str);
if (!($aCnt = count($ups))) return true; // Empty string *is* valid UTF-8
for ($i = 1; $i 0 && $i 0x9F) return false;
break;
case 0xF0:
if ($cbyte 0x8F) return false;
break;
default:
break;
>
$first = false;
>
$tbytes—;
>
if ($tbytes) return false; // incomplete sequence at EOS
>
return true;
>

/**
* takes a string of unicode entities and converts it to a utf-8 encoded string
* each unicode entitiy has the form &#nnn(nn); n= <0..9>and can be displayed by utf-8 supporting
* browsers. Ascii will not be modified.
* @param $source string of unicode entities [STRING]
* @return a utf-8 encoded string [STRING]
* @access public
*/
function utf8Encode ($source) <
$utf8Str = »;
$entityArray = explode («&#», $source);
$size = count ($entityArray);
for ($i = 0; $i = 128 && $unicode = 2048 && $unicode 1)
$nonEntity = substr ($nonEntity, 1); // chop the first char (‘;’)
else
$nonEntity = »;

$utf8Str .= $utf8Substring . $nonEntity;
>
else <
$utf8Str .= $subStr;
>
>

Php перекодировка utf windows