-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add ftchinese parser #264
Conversation
I've been looking into character encoding. I think cheerio encodes all entities when calling
In order to not encode all entities, the return normalizeSpaces($.html(node)); To this: return normalizeSpaces($.html(node, { decodeEntities: false })); In my use of mercury, I sanitize the output from mercury so this would be fine, but I'm guessing mercury wants to not pass on executable javascript? I saw that another project switched to JSDOM to avoid this. Not sure if that would be a suitable replacement for mercury. /cc @adampash |
@benubois thanks for the input, this reminds that I did use this method in my other project and it works flawlessly. Unfortunately, JSDOM is less performant than cheerio from my past experience. |
Actually, I think this is probably fine. Mercury sanitizes tags and attributes, and it looks like you're right, this could fix a lot o these problems. If someone wants to submit a PR with a test that adds the I should also follow up to say that if you do happen to see any significant security vulnerabilities in doing so, I'm all ears. |
@adampash adding https://github.com/postlight/mercury-parser/blob/e033835c7287904371371f922c487e6d0d7d7db8/src/resource/index.js#L63-L79 will skip the decoding of
other than
|
I think we want to keep that |
I did change that but it has no effect on Chinese characters. |
@adampash I've come to believe that Here's my pull request addressing this. |
There is another solution for Chinese, see https://github.com/HenryQW/OpenCC.henry.wang/blob/master/route.js#L7-L21 Not sure how would this merge into mercury-parser though, probably as a middleware? |
🤖 Automated Parsing Preview 🤖Commit: Original Article | HTML Fixture | Parsed Content Preview Parsed JSON{
"title": "英国认为华为风险可控",
"content": "<div><div id=\"story-body-container\"> <p>英国政府得出结论认为,它能够缓解在5G网络中使用华为(Huawei)设备的风险。这个结论沉重打击了美国说服盟国把华为挡在高速电信系统门外的努力。</p><p>两位知情人士将这一尚未公开的结论告诉英国《金融时报》,称英国国家网络安全中心(National Cyber Security Centre, NCSC)认定,有办法限制在未来5G超高速网络中使用华为设备的风险。</p><p>这一结论出炉之际,美国正加紧努力说服盟国禁止华为参与电信网络建设,理由是这家中国供应商可能帮助中国政府从事间谍活动或网络破坏。</p><p>美国国家安全局(NSA)近来与盟友和合作伙伴分享更多信息以强调相关风险,但是数个欧洲国家(包括英国和德国)并未被说服需要实施禁令。</p><p>一位熟悉相关争论的人士表示,英国的结论对欧洲各国领导人将“有很大分量”,因为英国是五眼联盟(Five Eyes)情报分享网络的成员,可以获得敏感的美国情报。</p><p>“其他国家可以提出这样的论点,即如果英国人有信心缓解国家安全威胁,那么他们也可以向国内公众和美国行政当局保证,只要他们采取了英国人推荐的各种预防措施,他们继续允许国内电信服务提供商使用中国组件就是审慎的,”此人表示。</p><p>美国提出,5G的速度将如此之快——并且有如此多的军事用途——以至于使用任何中国电信设备都带有太高的风险。美国官员还提出,虽然到目前为止可能没有恶意活动的证据,但华为可能会使用恶意软件更新为间谍活动创造条件。</p><p>英国信号情报机构——政府通信总部(GCHQ)前主任罗伯特•汉尼根(Robert Hannigan)最近在英国《金融时报》撰文,称NCSC“从未发现任何证据证明中国政府通过华为进行任何恶意网络活动”,而且“有关在5G网络的任何部分采用任何中国技术都代表着不可接受的风险的断言是无稽之谈”。</p><p>英国的结论与同为五眼成员的澳大利亚和新西兰形成鲜明对比,后两国去年就已禁止本国电信提供商在5G网络中使用华为设备。</p><p>与此同时,唐纳德•特朗普(Donald Trump)正在考虑发布实际上将禁止美国公司使用华为设备的行政命令。熟悉这道命令的一名人士表示,它将以“不对公司和国家点名”的方式写成。</p><p>美国副总统迈克•彭斯(Mike Pence)上周六在慕尼黑安全会议(Munich Security Conference)上发表演讲时指出,由于中国法律要求电信公司与中国政府共享数据,因此华为构成威胁。</p><p>在同一个论坛上,北约(NATO)秘书长延斯•斯托尔滕贝格(Jens Stoltenberg)告诉英国《金融时报》,北约联盟“非常认真地”对待围绕华为的担忧,数个盟国希望拿出协调一致的回应。</p><p>“我们必须看一看我们需要达到的回应协调水平。我们作为一个联盟还没有得出结论,但这表明了需要应对这个问题,”他表示。</p><p>英国秘密情报局(SIS,通称“军情六处”,即MI6)局长亚历克斯•扬格(Alex Younger)上周五表示,英国可能会对华为采取比美国更为温和的态度,称这个问题过于复杂,不宜简单地封杀该公司。他表示,这是“一个比‘进来还是出去’更加复杂的问题”,而且各国拥有“找到所有这些问题的答案的主权权利”。</p><p>对于已经认定使用华为设备的风险可控的说法,NCSC没有表示异议。</p></div></div>",
"author": null,
"date_published": "2018-01-02T07:17:00.000Z",
"lead_image_url": "http://i.ftimg.net/picture/3/000082493_piclink.jpg",
"dek": "英国政府得出结论认为,能够缓解在5G网络中使用华为设备的风险。这个尚未公开的结论沉重打击美国的游说努力。",
"next_page_url": null,
"url": "http://www.ftchinese.com/story/001081496?full=y",
"domain": "www.ftchinese.com",
"word_count": 13,
"direction": "ltr",
"total_pages": 1,
"rendered_pages": 1
}
3 failed tests 😱WwwFtchineseComExtractor initial test case returns the author See what went wrong AssertionError [ERR_ASSERTION]: null == '英国《金融时报》 迪米 华盛顿 , 戴维•邦德 慕尼黑报道'
at Object.equal (/home/circleci/project/src/extractors/custom/www.ftchinese.com/index.test.js:48:14)
at tryCatch (/home/circleci/project/node_modules/regenerator-runtime/runtime.js:62:40)
at Generator.invoke [as _invoke] (/home/circleci/project/node_modules/regenerator-runtime/runtime.js:288:22)
at Generator.prototype.(anonymous function) [as next] (/home/circleci/project/node_modules/regenerator-runtime/runtime.js:114:21)
at asyncGeneratorStep (/home/circleci/project/src/extractors/custom/www.ftchinese.com/index.test.js:17:103)
at _next (/home/circleci/project/src/extractors/custom/www.ftchinese.com/index.test.js:19:194)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:188:7) WwwFtchineseComExtractor initial test case returns the date_published See what went wrong AssertionError [ERR_ASSERTION]: null == '2018-01-02T07:17:00.000Z'
at Object.equal (/home/circleci/project/src/extractors/custom/www.ftchinese.com/index.test.js:61:14)
at tryCatch (/home/circleci/project/node_modules/regenerator-runtime/runtime.js:62:40)
at Generator.invoke [as _invoke] (/home/circleci/project/node_modules/regenerator-runtime/runtime.js:288:22)
at Generator.prototype.(anonymous function) [as next] (/home/circleci/project/node_modules/regenerator-runtime/runtime.js:114:21)
at asyncGeneratorStep (/home/circleci/project/src/extractors/custom/www.ftchinese.com/index.test.js:17:103)
at _next (/home/circleci/project/src/extractors/custom/www.ftchinese.com/index.test.js:19:194)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:188:7) WwwFtchineseComExtractor initial test case returns the content See what went wrong AssertionError [ERR_ASSERTION]: '' == '英国政府得出结论认为,它能够缓解在5G网络中使用华为(Huawei)设备的风险。这个结论沉重打击了美国说服盟国把华为挡在高速电信系统门外的努力。两位知情人士将这一尚未公开的结论告诉英国《金融时报》,称英国国家网络安全中心(National'
at Object.equal (/home/circleci/project/src/extractors/custom/www.ftchinese.com/index.test.js:105:14)
at tryCatch (/home/circleci/project/node_modules/regenerator-runtime/runtime.js:62:40)
at Generator.invoke [as _invoke] (/home/circleci/project/node_modules/regenerator-runtime/runtime.js:288:22)
at Generator.prototype.(anonymous function) [as next] (/home/circleci/project/node_modules/regenerator-runtime/runtime.js:114:21)
at asyncGeneratorStep (/home/circleci/project/src/extractors/custom/www.ftchinese.com/index.test.js:17:103)
at _next (/home/circleci/project/src/extractors/custom/www.ftchinese.com/index.test.js:19:194)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:188:7) |
It seems the parser can't handle Chinese characters for content, but works for title, author and dek.
Preview is showing a bunch of unicode rather than the actual Chinese characters.
Can be reproduced via using qdaily extractor too.