Skip to content

Commit

Permalink
wip
Browse files Browse the repository at this point in the history
  • Loading branch information
overtrue committed Apr 21, 2022
1 parent 4857c86 commit e40e2c7
Show file tree
Hide file tree
Showing 24 changed files with 712,935 additions and 326 deletions.
36 changes: 13 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Pinyin

[![Build Status](https://travis-ci.org/overtrue/pinyin.svg?branch=master)](https://travis-ci.org/overtrue/pinyin)
[![Latest Stable Version](https://poser.pugx.org/overtrue/pinyin/v/stable.svg)](https://packagist.org/packages/overtrue/pinyin) [![Total Downloads](https://poser.pugx.org/overtrue/pinyin/downloads.svg)](https://packagist.org/packages/overtrue/pinyin) [![Latest Unstable Version](https://poser.pugx.org/overtrue/pinyin/v/unstable.svg)](https://packagist.org/packages/overtrue/pinyin) [![License](https://poser.pugx.org/overtrue/pinyin/license.svg)](https://packagist.org/packages/overtrue/pinyin)

:cn: 基于 [CC-CEDICT](http://cc-cedict.org/wiki/) 词典的中文转拼音工具,更准确的支持多音字的汉字转拼音解决方案。

[![Sponsor me](https://github.com/overtrue/overtrue/blob/master/sponsor-me-button-s.svg?raw=true)](https://github.com/sponsors/overtrue)


## 安装

使用 Composer 安装:
Expand All @@ -21,31 +21,26 @@ $ composer require "overtrue/pinyin:~4.0"

- 内存型,适用于服务器内存空间较富余,优点:转换快
- 小内存型(默认),适用于内存比较紧张的环境,优点:占用内存小,转换不如内存型快
- I/O型,适用于虚拟机,内存限制比较严格环境。优点:非常微小内存消耗。缺点:转换慢,不如内存型转换快,php >= 5.5
- I/O 型,适用于虚拟机,内存限制比较严格环境。优点:非常微小内存消耗。缺点:转换慢,不如内存型转换快,php >= 5.5

## 可用选项:

| 选项 | 描述 |
| ------------- | ---------------------------------------------------|
| `PINYIN_TONE` | UNICODE 式音调:`měi hǎo` |
| `PINYIN_ASCII_TONE` | 带数字式音调: `mei3 hao3` |
| `PINYIN_NO_TONE` | 无音调:`mei hao` |
| `PINYIN_KEEP_NUMBER` | 保留数字 |
| `PINYIN_KEEP_ENGLISH` | 保留英文 |
| `PINYIN_KEEP_PUNCTUATION` | 保留标点 |
| `PINYIN_UMLAUT_V` | 使用 `v` 代替 `yu`, 例如:吕 `lyu` 将会转为 `lv` |
| 选项 | 描述 |
| ------------------------- | ------------------------------------------------ |
| `PINYIN_TONE` | UNICODE 式音调:`měi hǎo` |
| `PINYIN_ASCII_TONE` | 带数字式音调: `mei3 hao3` |
| `PINYIN_NO_TONE` | 无音调:`mei hao` |
| `PINYIN_KEEP_NUMBER` | 保留数字 |
| `PINYIN_KEEP_ENGLISH` | 保留英文 |
| `PINYIN_KEEP_PUNCTUATION` | 保留标点 |
| `PINYIN_UMLAUT_V` | 使用 `v` 代替 `yu`, 例如:吕 `lyu` 将会转为 `lv` |

### 拼音数组

```php
use Overtrue\Pinyin\Pinyin;

// 小内存型
$pinyin = new Pinyin(); // 默认
// 内存型
// $pinyin = new Pinyin('\\Overtrue\\Pinyin\\MemoryFileDictLoader');
// I/O型
// $pinyin = new Pinyin('\\Overtrue\\Pinyin\\GeneratorFileDictLoader');

$pinyin->convert('带着希望去旅行,比到达终点更美好');
// ["dai", "zhe", "xi", "wang", "qu", "lyu", "xing", "bi", "dao", "da", "zhong", "dian", "geng", "mei", "hao"]
Expand All @@ -57,11 +52,6 @@ $pinyin->convert('带着希望去旅行,比到达终点更美好', PINYIN_ASCI
//["dai4","zhe","xi1","wang4","qu4","lyu3","xing2","bi3","dao4","da2","zhong1","dian3","geng4","mei3","hao3"]
```

- 小内存型: 将字典分片载入内存
- 内存型: 将所有字典预先载入内存
- I/O型: 不载入内存,将字典使用文件流打开逐行遍历并运用php5.5生成器(yield)特性分配单行内存


### 生成用于链接的拼音字符串

```php
Expand Down Expand Up @@ -107,13 +97,14 @@ $pinyin->name('单某某', PINYIN_TONE); // ["shàn","mǒu","mǒu"]
独立的包在这里:[overtrue/laravel-pinyin](https://github.com/overtrue/laravel-pinyin)

## Contribution

欢迎提意见及完善补充词库 [`overtrue/pinyin-dictionary-maker`](https://github.com/overtrue/pinyin-dictionary-maker/tree/master/patches) :kiss:

## 参考

- [详细参考资料](https://github.com/overtrue/pinyin-resources)

## :heart: Sponsor me
## :heart: Sponsor me

[![Sponsor me](https://github.com/overtrue/overtrue/blob/master/sponsor-me.svg?raw=true)](https://github.com/sponsors/overtrue)

Expand All @@ -125,7 +116,6 @@ Many thanks to Jetbrains for kindly providing a license for me to work on this a

[![](https://resources.jetbrains.com/storage/products/company/brand/logos/jb_beam.svg)](https://www.jetbrains.com/?from=https://github.com/overtrue)


## PHP 扩展包开发

> 想知道如何从零开始构建 PHP 扩展包?
Expand Down
100 changes: 100 additions & 0 deletions bin/build
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
#!/usr/bin/env php
<?php

$polyphones = explode(',', file_get_contents(__DIR__ . '/../sources/polyphones.txt'));
$charsSouce = __DIR__ . '/../sources/chars.txt';
$wordsSouce = __DIR__ . '/../sources/words.txt';
$surnamesSource = file(__DIR__.'/../sources/surnames.txt');


if (!file_exists($charsSouce)) {
file_put_contents($charsSouce, file_get_contents('https://raw.githubusercontent.com/mozillazg/pinyin-data/master/pinyin.txt'));
}

if (!file_exists($wordsSouce)) {
file_put_contents($wordsSouce, file_get_contents('https://raw.githubusercontent.com/mozillazg/phrase-pinyin-data/master/large_pinyin.txt'));
}


// ------------------------------------------------
$surnames = [];
foreach ($surnamesSource as $line) {
[$surname, $pinyin] = explode(',', trim($line));

$surnames[trim($surname)] = preg_split('/\s+/', trim($pinyin));
}


// ------------------------------------------------

$chars = [];
foreach (file($charsSouce) as $line) {
// U+4E2D: zhōng,zhòng # 中
preg_match('/^U\+(?<code>[0-9A-Z]+):\s+(?<pinyin>\S+)\s+#\s*(?<char>\S+)/', $line, $matched);

if ($matched && !empty($matched['pinyin'])) {
$pinyin = explode(',', $matched['pinyin']);
$chars[$matched['char']] = $pinyin;
} elseif (!str_starts_with($line, '#')) {
throw new Exception("行解析错误:$line");
}
}

// ------------------------------------------------
$words = [];
foreach (file($wordsSouce) as $line) {
// 㞎㞎: bǎ ba # 注释
preg_match('/^(?<word>[^#\s]+):\s+(?<pinyin>[\p{L} ]+)#?/u', $line, $matched);

if ($matched && !empty($matched['pinyin'])) {
$pinyin = explode(' ', trim($matched['pinyin']));

$wordChars = preg_split('//u', $matched['word'], -1, PREG_SPLIT_NO_EMPTY);

try {
$pinyinSegments = array_combine($wordChars, $pinyin);
} catch (Throwable $e) {
throw new Exception("行解析错误:$line");
}

// 多音字处理
$polyphoneChars = array_intersect_key($wordChars, $polyphones);

foreach ($polyphoneChars as $char) {
// 如果词里的任何一个多音字在词里的读音和常用读音不一致,则需要加入词典,否则抛弃该词
if (isset($chars[$char]) && $pinyinSegments[$char] != $chars[$char][0]) {
$words[$matched['word']] = join("\t", ["", ...$pinyin, ""]);
break;
}
}
}
}

// 清理
exec('rm -rf ' . __DIR__ . '/../data/*');

// 姓氏
file_put_contents(__DIR__ . '/../data/surnames.php', "<?php\nreturn ".var_export($surnames, true).";\n");
echo count($surnames)." surnames saved.\n";

// 单字:带多音
file_put_contents(__DIR__ . '/../data/char-with-polyphones.php', "<?php\nreturn ".var_export($chars, true).";\n");
echo count($chars)." chars with polyphones saved.\n";

// 单字:不带多音
$charsNoPolyphones = [];
foreach ($chars as $char => $pinyin) {
$charsNoPolyphones[$char] = "\t{$pinyin[0]}\t";
}

file_put_contents(__DIR__ . '/../data/chars.php', "<?php\nreturn ".var_export($charsNoPolyphones, true).";\n");
echo count($charsNoPolyphones)." chars saved.\n";

// 词:从长到短 + 单字
$words = array_merge($words, $charsNoPolyphones);
uksort($words, fn ($a, $b) => strlen($b) <=> strlen($a));

foreach (array_chunk($words, 8000, true) as $index => $group) {
file_put_contents(__DIR__ . "/../data/words-{$index}.php", "<?php\nreturn ".var_export($group, true).";\n");
echo count($group)." words saved in ".__DIR__ . "/../data/words-{$index}.php \n";
}
128 changes: 64 additions & 64 deletions composer.json
Original file line number Diff line number Diff line change
@@ -1,68 +1,68 @@
{
"name": "overtrue/pinyin",
"description": "Chinese to pinyin translator.",
"keywords": [
"chinese",
"pinyin",
"cn2pinyin"
"name": "overtrue/pinyin",
"description": "Chinese to pinyin translator.",
"keywords": [
"chinese",
"pinyin",
"cn2pinyin"
],
"homepage": "https://github.com/overtrue/pinyin",
"license": "MIT",
"authors": [
{
"name": "overtrue",
"homepage": "http://github.com/overtrue",
"email": "[email protected]"
}
],
"autoload": {
"psr-4": {
"Overtrue\\Pinyin\\": "src/"
}
},
"autoload-dev": {
"psr-4": {
"Overtrue\\Pinyin\\Tests\\": "tests/"
}
},
"require": {
"php": ">=7.1"
},
"require-dev": {
"phpunit/phpunit": "~9.5",
"brainmaestro/composer-git-hooks": "^2.7",
"friendsofphp/php-cs-fixer": "^3.2"
},
"extra": {
"hooks": {
"pre-commit": [
"composer test",
"composer fix-style"
],
"pre-push": [
"composer test",
"composer check-style"
]
}
},
"scripts": {
"post-update-cmd": [
"cghooks update"
],
"homepage": "https://github.com/overtrue/pinyin",
"license": "MIT",
"authors": [
{
"name": "overtrue",
"homepage": "http://github.com/overtrue",
"email": "[email protected]"
}
"post-merge": "composer install",
"post-install-cmd": [
"cghooks add --ignore-lock",
"cghooks update"
],
"autoload": {
"psr-4": {
"Overtrue\\Pinyin\\": "src/"
},
"files": ["src/const.php"]
},
"autoload-dev": {
"psr-4": {
"Overtrue\\Pinyin\\Test\\": "tests/"
}
},
"require": {
"php":">=7.1"
},
"require-dev": {
"phpunit/phpunit": "~9.5",
"brainmaestro/composer-git-hooks": "^2.7",
"friendsofphp/php-cs-fixer": "^3.2"
},
"extra": {
"hooks": {
"pre-commit": [
"composer test",
"composer fix-style"
],
"pre-push": [
"composer test",
"composer check-style"
]
}
},
"scripts": {
"post-update-cmd": [
"cghooks update"
],
"post-merge": "composer install",
"post-install-cmd": [
"cghooks add --ignore-lock",
"cghooks update"
],
"cghooks": "vendor/bin/cghooks",
"check-style": "php-cs-fixer fix --using-cache=no --diff --config=.php_cs --dry-run --ansi",
"fix-style": "php-cs-fixer fix --using-cache=no --config=.php_cs --ansi",
"test": "vendor/bin/phpunit --colors=always"
},
"scripts-descriptions": {
"test": "Run all tests.",
"check-style": "Run style checks (only dry run - no fixing!).",
"fix-style": "Run style checks and fix violations."
}
"cghooks": "vendor/bin/cghooks",
"check-style": "php-cs-fixer fix --using-cache=no --diff --dry-run --ansi",
"fix-style": "php-cs-fixer fix --using-cache=no --ansi",
"test": "vendor/bin/phpunit --colors=always",
"build": "php ./bin/build"
},
"scripts-descriptions": {
"test": "Run all tests.",
"check-style": "Run style checks (only dry run - no fixing!).",
"fix-style": "Run style checks and fix violations."
}
}
Loading

0 comments on commit e40e2c7

Please sign in to comment.