METADATA 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275
  1. Metadata-Version: 2.1
  2. Name: charset-normalizer
  3. Version: 2.0.4
  4. Summary: The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
  5. Home-page: https://github.com/ousret/charset_normalizer
  6. Author: Ahmed TAHRI @Ousret
  7. Author-email: ahmed.tahri@cloudnursery.dev
  8. License: MIT
  9. Keywords: encoding,i18n,txt,text,charset,charset-detector,normalization,unicode,chardet
  10. Platform: UNKNOWN
  11. Classifier: License :: OSI Approved :: MIT License
  12. Classifier: Intended Audience :: Developers
  13. Classifier: Topic :: Software Development :: Libraries :: Python Modules
  14. Classifier: Operating System :: OS Independent
  15. Classifier: Programming Language :: Python
  16. Classifier: Programming Language :: Python :: 3
  17. Classifier: Programming Language :: Python :: 3.5
  18. Classifier: Programming Language :: Python :: 3.6
  19. Classifier: Programming Language :: Python :: 3.7
  20. Classifier: Programming Language :: Python :: 3.8
  21. Classifier: Programming Language :: Python :: 3.9
  22. Classifier: Programming Language :: Python :: 3.10
  23. Classifier: Topic :: Text Processing :: Linguistic
  24. Classifier: Topic :: Utilities
  25. Classifier: Programming Language :: Python :: Implementation :: PyPy
  26. Requires-Python: >=3.5.0
  27. Description-Content-Type: text/markdown
  28. Provides-Extra: unicode_backport
  29. Requires-Dist: unicodedata2 ; extra == 'unicode_backport'
  30. <h1 align="center">Charset Detection, for Everyone 👋 <a href="https://twitter.com/intent/tweet?text=The%20Real%20First%20Universal%20Charset%20%26%20Language%20Detector&url=https://www.github.com/Ousret/charset_normalizer&hashtags=python,encoding,chardet,developers"><img src="https://img.shields.io/twitter/url/http/shields.io.svg?style=social"/></a></h1>
  31. <p align="center">
  32. <sup>The Real First Universal Charset Detector</sup><br>
  33. <a href="https://travis-ci.org/Ousret/charset_normalizer">
  34. <img src="https://travis-ci.org/Ousret/charset_normalizer.svg?branch=master"/>
  35. </a>
  36. <a href="https://pypi.org/project/charset-normalizer">
  37. <img src="https://img.shields.io/pypi/pyversions/charset_normalizer.svg?orange=blue" />
  38. </a>
  39. <a href="https://app.codacy.com/project/Ousret/charset_normalizer/dashboard">
  40. <img alt="Code Quality Badge" src="https://api.codacy.com/project/badge/Grade/a0c85b7f56dd4f628dc022763f82762c"/>
  41. </a>
  42. <a href="https://codecov.io/gh/Ousret/charset_normalizer">
  43. <img src="https://codecov.io/gh/Ousret/charset_normalizer/branch/master/graph/badge.svg" />
  44. </a>
  45. <a href='https://charset-normalizer.readthedocs.io/en/latest/?badge=latest'>
  46. <img src='https://readthedocs.org/projects/charset-normalizer/badge/?version=latest' alt='Documentation Status' />
  47. </a>
  48. <a href="https://pepy.tech/project/charset-normalizer/">
  49. <img alt="Download Count Total" src="https://pepy.tech/badge/charset-normalizer" />
  50. </a>
  51. </p>
  52. > A library that helps you read text from an unknown charset encoding.<br /> Motivated by `chardet`,
  53. > I'm trying to resolve the issue by taking a new approach.
  54. > All IANA character set names for which the Python core library provides codecs are supported.
  55. <p align="center">
  56. >>>>> <a href="https://charsetnormalizerweb.ousret.now.sh" target="_blank">👉 Try Me Online Now, Then Adopt Me 👈 </a> <<<<<
  57. </p>
  58. This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**.
  59. | Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) |
  60. | ------------- | :-------------: | :------------------: | :------------------: |
  61. | `Fast` | ❌<br> | ✅<br> | ✅ <br> |
  62. | `Universal**` | ❌ | ✅ | ❌ |
  63. | `Reliable` **without** distinguishable standards | ❌ | ✅ | ✅ |
  64. | `Reliable` **with** distinguishable standards | ✅ | ✅ | ✅ |
  65. | `Free & Open` | ✅ | ✅ | ✅ |
  66. | `License` | LGPL-2.1 | MIT | MPL-1.1
  67. | `Native Python` | ✅ | ✅ | ❌ |
  68. | `Detect spoken language` | ❌ | ✅ | N/A |
  69. | `Supported Encoding` | 30 | :tada: [93](https://charset-normalizer.readthedocs.io/en/latest/support.html) | 40
  70. <p align="center">
  71. <img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/>
  72. *\*\* : They are clearly using specific code for a specific encoding even if covering most of used one*<br>
  73. ## ⚡ Performance
  74. This package offer better performance than its counterpart Chardet. Here are some numbers.
  75. | Package | Accuracy | Mean per file (ns) | File per sec (est) |
  76. | ------------- | :-------------: | :------------------: | :------------------: |
  77. | [chardet](https://github.com/chardet/chardet) | 93.0 % | 150 ms | 7 file/sec |
  78. | charset-normalizer | **95.0 %** | **36 ms** | 28 file/sec |
  79. | Package | 99th percentile | 95th percentile | 50th percentile |
  80. | ------------- | :-------------: | :------------------: | :------------------: |
  81. | [chardet](https://github.com/chardet/chardet) | 647 ms | 250 ms | 24 ms |
  82. | charset-normalizer | 354 ms | 202 ms | 16 ms |
  83. Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload.
  84. > Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows.
  85. > And yes, these results might change at any time. The dataset can be updated to include more files.
  86. [cchardet](https://github.com/PyYoshi/cChardet) is a non-native (cpp binding) faster alternative. If speed is the most important factor,
  87. you should try it.
  88. ## Your support
  89. Please ⭐ this repository if this project helped you!
  90. ## ✨ Installation
  91. Using PyPi for latest stable
  92. ```sh
  93. pip install charset-normalizer
  94. ```
  95. Or directly from dev-master for latest preview
  96. ```sh
  97. pip install git+https://github.com/Ousret/charset_normalizer.git
  98. ```
  99. If you want a more up-to-date `unicodedata` than the one available in your Python setup.
  100. ```sh
  101. pip install charset-normalizer[unicode_backport]
  102. ```
  103. ## 🚀 Basic Usage
  104. ### CLI
  105. This package comes with a CLI.
  106. ```
  107. usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD]
  108. file [file ...]
  109. The Real First Universal Charset Detector. Discover originating encoding used
  110. on text file. Normalize text to unicode.
  111. positional arguments:
  112. files File(s) to be analysed
  113. optional arguments:
  114. -h, --help show this help message and exit
  115. -v, --verbose Display complementary information about file if any.
  116. Stdout will contain logs about the detection process.
  117. -a, --with-alternative
  118. Output complementary possibilities if any. Top-level
  119. JSON WILL be a list.
  120. -n, --normalize Permit to normalize input file. If not set, program
  121. does not write anything.
  122. -m, --minimal Only output the charset detected to STDOUT. Disabling
  123. JSON output.
  124. -r, --replace Replace file when trying to normalize it instead of
  125. creating a new one.
  126. -f, --force Replace file without asking if you are sure, use this
  127. flag with caution.
  128. -t THRESHOLD, --threshold THRESHOLD
  129. Define a custom maximum amount of chaos allowed in
  130. decoded content. 0. <= chaos <= 1.
  131. --version Show version information and exit.
  132. ```
  133. ```bash
  134. normalizer ./data/sample.1.fr.srt
  135. ```
  136. :tada: Since version 1.4.0 the CLI produce easily usable stdout result in JSON format.
  137. ```json
  138. {
  139. "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
  140. "encoding": "cp1252",
  141. "encoding_aliases": [
  142. "1252",
  143. "windows_1252"
  144. ],
  145. "alternative_encodings": [
  146. "cp1254",
  147. "cp1256",
  148. "cp1258",
  149. "iso8859_14",
  150. "iso8859_15",
  151. "iso8859_16",
  152. "iso8859_3",
  153. "iso8859_9",
  154. "latin_1",
  155. "mbcs"
  156. ],
  157. "language": "French",
  158. "alphabets": [
  159. "Basic Latin",
  160. "Latin-1 Supplement"
  161. ],
  162. "has_sig_or_bom": false,
  163. "chaos": 0.149,
  164. "coherence": 97.152,
  165. "unicode_path": null,
  166. "is_preferred": true
  167. }
  168. ```
  169. ### Python
  170. *Just print out normalized text*
  171. ```python
  172. from charset_normalizer import from_path
  173. results = from_path('./my_subtitle.srt')
  174. print(str(results.best()))
  175. ```
  176. *Normalize any text file*
  177. ```python
  178. from charset_normalizer import normalize
  179. try:
  180. normalize('./my_subtitle.srt') # should write to disk my_subtitle-***.srt
  181. except IOError as e:
  182. print('Sadly, we are unable to perform charset normalization.', str(e))
  183. ```
  184. *Upgrade your code without effort*
  185. ```python
  186. from charset_normalizer import detect
  187. ```
  188. The above code will behave the same as **chardet**. We ensure that we offer the best (reasonable) BC result possible.
  189. See the docs for advanced usage : [readthedocs.io](https://charset-normalizer.readthedocs.io/en/latest/)
  190. ## 😇 Why
  191. When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a
  192. reliable alternative using a completely different method. Also! I never back down on a good challenge !
  193. I **don't care** about the **originating charset** encoding, because **two different tables** can
  194. produce **two identical files.**
  195. What I want is to get readable text, the best I can.
  196. In a way, **I'm brute forcing text decoding.** How cool is that ? 😎
  197. Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.
  198. ## 🍰 How
  199. - Discard all charset encoding table that could not fit the binary content.
  200. - Measure chaos, or the mess once opened (by chunks) with a corresponding charset encoding.
  201. - Extract matches with the lowest mess detected.
  202. - Finally, if there is too much match left, we measure coherence.
  203. **Wait a minute**, what is chaos/mess and coherence according to **YOU ?**
  204. *Chaos :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then
  205. **I established** some ground rules about **what is obvious** when **it seems like** a mess.
  206. I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to
  207. improve or rewrite it.
  208. *Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought
  209. that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.
  210. ## ⚡ Known limitations
  211. - Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
  212. - Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.
  213. ## 👤 Contributing
  214. Contributions, issues and feature requests are very much welcome.<br />
  215. Feel free to check [issues page](https://github.com/ousret/charset_normalizer/issues) if you want to contribute.
  216. ## 📝 License
  217. Copyright © 2019 [Ahmed TAHRI @Ousret](https://github.com/Ousret).<br />
  218. This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) licensed.
  219. Characters frequencies used in this project © 2012 [Denny Vrandečić](http://denny.vrandecic.de)