[go: nahoru, domu]

Talk:UTF-8

This is an old revision of this page, as edited by Un1Gfn (talk | contribs) at 02:58, 5 April 2021 (→‎Microsoft script dead link: new section). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.


Latest comment: 3 years ago by AnonMoos in topic Adoption and non-adoption

Template:Vital article

Table should not only use color to encode information (but formatting like bold and underline)

As in a previous comment https://en.wikipedia.org/wiki/Talk:UTF-8/Archive_1#Colour_in_example_table? this has been done before, and is *better* so that everyone can clearly see the different part of the code. Relying on color alone is not good, due to color vision deficiencies and varying color rendition on devices.

Runes

The Google-developed programming language Go defines a datatype called rune. A rune is "an int32 containing a Unicode character of 1,2,3, or 4 bytes". It is not clear from the documentation whether a rune contains the Unicode character number (code point) or the usual UTF-8 encoding used in Go. Testing reveals that a rune appears to be the Unicode character number.

I found a good reference to confirm this at https://blog.golang.org/strings, so this information should be added prominently to this article and similar articles that are missing it. It can be quite frustrating to read about runes in Go and not have this information. David Spector (talk) 00:42, 4 September 2020 (UTC)Reply

It sounds like that belongs in the page for Go, not here. This is about UTF-8, not datatypes specific to one language. Tarl N. (discuss) 01:26, 4 September 2020 (UTC)Reply
Furthermore, this isn't a software reference manual. It shouldn't be added here at all, let alone "prominently", precisely because it is an obscure implementation feature of a relatively new programming language. Chris Cunningham (user:thumperward) (talk) 17:10, 4 September 2020 (UTC)Reply
I believe the Plan9 documentation also called unicode code points "runes" so it might be relevant here, though really does not sound very important.Spitzak (talk) 18:12, 4 September 2020 (UTC)Reply
Given the shared heritage of all three systems it's unsurprising that they share idiosyncrasies in nomenclature, but this is probably something more pertinent to the biographies of the creators than to the individual systems. Chris Cunningham (user:thumperward) (talk) 18:32, 4 September 2020 (UTC)Reply

Byte order mark trivia

This article has seen significant work recently to try to elevate the important aspects of the subject and reduce the amount of coverage on trivia. One such change has been reverted on the grounds that "BOM killing usage of UTF-8 is very well documented". Of course the material in question has been unreferenced ever since it was added. I don't dispute that people using e.g. Windows Notepad in the middle of the decade were very annoyed by this, but it truly isn't an important enough aspect of the subject today to warrant its own subheading. All that we need to do is note that historically some software insisted on adding BOMs to UTF-8 files and that this caused interoperability issues, with a good reference. We currently lack the latter entirely, but we should at least restore the reduced version of the content such that we aren't inflating what is basically a historic bug that has no impact on the vast majority of uses of the spec. Chris Cunningham (user:thumperward) (talk) 17:08, 4 September 2020 (UTC)Reply

Probably not very important nowadays, and continued watering-down by people trying to whitewash bad behavior by certain companies is making it unreadable. The problem was not actually programs adding the BOM, it was software that refused to recognize UTF-8 without the BOM, which *forced* software to write it and destroyed the ASCII-compatibilty, as well as basically introducing magic bytes to a file format that is intended to be the most generic text with no structure at all, and complicates even the most trivial operations such as concatenating files. I agree that an awful lot of software has been patched to ignore a leading BOM and the only real bad result is the programming time wasted making these modifications. It actually appears that now there is an inverse problem and some Microsoft compilers work better with UTF-8 if the BOM is *missing*, the reason is that they leave the bytes in quoted string constants alone, while if the BOM is there they perform a translation to UTF-16 and back again, which introduces a lot of annoyances such as mangling any invalid byte sequences.
My main concern with the article here though was to move the description of the BOM out of the "description" section, since it is strongly discouraged by the Unicode consortium and a thing that should not exist has no right to be in the introductory description. It could be reduced a lot further. I also don't think there is much software that will show legacy letters any more.Spitzak (talk) 18:20, 4 September 2020 (UTC)Reply
If you're accusing me of somehow having some pro-Microsoft agenda then I'd encourage you to go and have a walk or pet a dog or something. My only concern here is making the article as accessible as possible, which means minimising the amount of material in it which exists primarily to air editors' grudges against historic implementation bugs.
This material is still unsourced, and warrants a paragraph at best (and no subheader). It should be obvious to any reader that there is no actual need for a marker indicating byte order in a single-byte encoding, and without any (referenced!) context which shows this is a true and notable problem (as opposed to a historic quibble) then the reader is left wondering why the hell such a big deal is being made of it. Chris Cunningham (user:thumperward) (talk) 18:30, 4 September 2020 (UTC)Reply

How is utf8mb3 exactly the same as CESU-8?

Spitzak, you've repeatedly asserted that MySQL UTF8mb3 and CESU-8 are exactly the same in the edit comments. I believe you, but I can't follow you, because the source materials seem to say otherwise, and the citations seem insufficient.

In Unicode Technical Report #26, CESU-8 is explicitly defined to support supplemental characters: "In CESU-8, supplementary characters are represented as six-byte sequences". Whereas the MySQL 8.0 Reference Manual explicitly states that supplemental characters are not supported: "Supports BMP characters only (no support for supplementary characters)". And the MySQL 3.23, 4.0, 4.1 Reference Manual (when utf8mb3 first appears, as "utf8") says the same: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP."

How do you reconcile these conflicting definitions of CESU-8 and utf8mb3? Is one of them wrong, or do they require further interpretation? If so, is that cited somewhere? I checked the citations, but I'm not seeing how they back up what you're saying -- they only seem to note that utf8mb3 doesn't support supplemental characters. If what you're saying is in fact true, I think further explication is needed beyond saying it is so, because the MySQL docs and UTR#26 seem to suggest that utf8mb3 and CESU-8 are definitionally different, at least when perused by a non-expert like myself trying to learn about the subject.

While I think the introductory paragraph is trying to shed some light, "many programs" is vague and not cited, and nor is it cited that MySQL is definitively one of those many programs, and nor is it cited that MySQL "transforms UCS-2 codes to three bytes or fewer" for utf8mb3. Does it? How do we know?

If what you're trying to say is that when UTF-16 supplemental characters are converted to UTF-8 as though they are UCS-2 (and not UTF-16), the result is what came to be called CESU-8, then I think you also need to say that while utf8mb3 is not intended to support supplemental characters at all, it functionally operates as CESU-8 if they are present. And ideally that should be backed up with a citation, or an example sufficient to demonstrate that this article is not the only place where one will find this assertion.

And, even if you're right that utf8mb3 and CESU-8 (and Oracle UTF8) are technically identical, it's still not correct to say that "MySQL calls [UTF-16 supplemental characters converted to UTF-8 as though they were UCS-2 characters] utf8mb3", because MySQL quite clearly defines utf8mb3 as being BMP-only; so MySQL is not "calling" anything involving supplemental characters utf8mb3.

Having now been trying to understand this for hours, I think this Oracle document explains it pretty well: "The UTF8 character set encodes characters in one, two, or three bytes...If supplementary characters are inserted into a UTF8 database...the supplementary characters are treated as two separate, user-defined characters that occupy 6 bytes." If what you're saying is correct (and I don't know that it is, because I don't have anything authoritative saying so), then it sounds like this could be equally applicable to utf8mb3. The article could make that clear, if properly cited or demonstrated.

TL;DR: It's not accurate to describe utf8mb3 as having any representation of supplemental characters, even if it can technically can do so as described by CESU-8, because it is defined otherwise. Further, claiming utf8mb3 is technically identical to CESU-8 warrants citation or demonstration, and the claim would benefit from greater clarity. Ivanxqz (talk) 00:45, 15 September 2020 (UTC)Reply

Both of then translate a UTF-16 supplemental pair into exactly the same 6 bytes, and unpaired surrogate halves into exactly the same 3 bytes, therefore they are identical.Spitzak (talk) 21:20, 15 September 2020 (UTC)Reply
Can you cite this anywhere? No original research, etc. The only source for your information is you. (And you haven't responded to anything that I wrote above, not even the TLDR -- even if technically identical, which you have only asserted and not cited, MySQL does not "call" CESU-8 "utf8mb3" as you state -- utf8mb3 explicitly does not support supplemental characters, and therefore any handling of them in the style of CESU-8 is an accident, not a design.) Ivanxqz (talk) 04:55, 16 September 2020 (UTC)Reply

I decided to rewrite the CESU-8 section for what I think is greater clarity and accuracy. I included that CESU-8 in utf8mb3 is possible (though unsupported), on the basis of Spitzak's claim that it's the case. I noted that it needs a citation. I think it's not actually true, though, on the basis of Bernt's counter-demonstration at Talk:CESU-8#Comments, which I also just verified myself, and also the original references regarding utf8mb3 in the previous version, but I'll leave it for now. (Spitzak? Can you show somewhere why your claim that utf8mb3 can support supplemental characters via CESU-8 is accurate?)

I also gave utf8mb3 its own section again, since it is definitionally not CESU-8, even if technically it's the same thing (which, again, I don't think it is). It's like saying that Mountain Standard Time and Pacific Daylight Time are the same thing; they represent the exact same time of day in California and Arizona in the winter, but they're not the same thing, because they have different definitions. Ivanxqz (talk) 10:53, 16 September 2020 (UTC)Reply

Adoption and non-adoption

https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful

Under "Adoption": "Internally in software usage is even lower, with UCS-2, UTF-16, and UTF-32 in use, particularly in the Windows API, but also by Python". What I don't like about this is that Windows API only has a Unicode API for one encoding (plus legacy; codepages). It used to be UCS-2 (in now discontinued Windows versions, I believe they all are), but it's now UTF-16. And it doesn't have direct indexing, to Unicode characters so what follows isn't too helpful (it's outdated from UCS-2 era): "This is due to a belief that direct indexing of code points is more important than 8-bit compatibility". I think we should concentrate first on the main alternative to UTF-8 in use, UTF-16, then possibly explain programming languages. Since there are many and that text misrepresents Python (it also stores Latin1 internally) maybe just leave it out? Just as text on other encodings such as GB 18030 were moved to another page, possibly we need not mention all UTF-8 alternatives, or what all programming languages do, e.g. Python, as it's not strictly about adoption, rather non-adoption? comp.arch (talk) 12:34, 26 March 2021 (UTC)Reply

In the work I do, the #1 impediment to using UTF-8 is that Qt uses UTF-16. The #2 impediment is that Python does not use UTF-8, in our code it uses UTF-32, though you are correct that they are trying to improve this to some selection between 8,16, and 32 bit storage of UTF-32 based on the highest code point value, and also by caching a UTF-8 version as they have finally realized the cost of conversion. It is also quite likely the underlying reason Python and Qt don't use UTF-8 is because of the Windows API using UTF-16, so for me that is the #3 reason (though for Windows programmers it probalby is #1). In any case Python and Qt are extremely similar in their guilt in preventing adoption of UTF-8 and should and must be mentioned together.Spitzak (talk) 19:06, 26 March 2021 (UTC)Reply
Microsoft first developed its "Multi-Byte Character Set" APIs for Windows NT in the early 1990s, before UTF-8 had achieved much usage, and when Japanese Kanji character sets were more practically important than Unicode. UTF-8, if Microsoft programmers even knew about it at that time, would not have helped them deal with Shift_JIS or whatever... AnonMoos (talk) 22:51, 26 March 2021 (UTC)Reply
Microsoft was pretty far ahead of everybody in figuring out muiti-byte character encodings, and thus were in a better position to start using UTF-8. I was working for Sun and they were way behind and convinced that 16-bit characters were necessary, and they were even incapable of handling 8-bit non-ASCII in any intelligent way, often insisting on converting it to 3 octal digits. Microsoft really blew it when they decided to scrap all that work and use UCS-2. Some of this may have been misguided political correctness, there was certainly sentiment that Americans should not get the "better" 1-byte codes. The end result is that ASCII-only software still exists even today!Spitzak (talk) 23:46, 26 March 2021 (UTC)Reply
Microsoft was part of the initial alliance that launched Unicode, of course, but I find it difficult to imagine how it could have done a bunch of UTF-8 software implementation work in the early 1990s, which was then pulled out and replaced by 16-bit wide character interfaces. UTF-8 apparently didn't even exist until September 1992, at a time when Microsoft's MBCS people had to be focused mainly on making Japanese character sets work on the forthcoming Windows NT operating system (there was certainly more money to be made from that than from Unicode in 1992-1993). UTF-8 wasn't even introduced as a formal proposal until 1993, the same year that the first version of Windows NT was released, so the dates don't really seem to align... AnonMoos (talk) 16:14, 29 March 2021 (UTC)Reply
What I meant is that the multi-byte Japanese encodings you are talking about were much more similar to UTF-8 and should have provided a method of transitioning to it, and Microsoft was doing far more to support these transparently than others. Instead Microsoft abandoned all the progress they had done with multibyte encodings to try to use UCS-2 and we are all paying the price even today.Spitzak (talk) 19:03, 29 March 2021 (UTC)Reply
OK -- I'm skeptical as to whether Shift_JIS could prepare the way for UTF-8 in any practical or concretely-useful way, but now I understand what you're saying... AnonMoos (talk) 13:34, 31 March 2021 (UTC)Reply
   and Microsoft has a script for Windows 10, to enable it by default for its program Microsoft Notepad
   "Script How to set default encoding to UTF-8 for notepad by PowerShell". gallery.technet.microsoft.com. Retrieved 2018-01-30.
   https://gallery.technet.microsoft.com/scriptcenter/How-to-set-default-2d9669ae?ranMID=24542&ranEAID=TnL5HPStwNw&ranSiteID=TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w&tduid=(1f29517b2ebdfe80772bf649d4c144b1)(256380)(2459594)(TnL5HPStwNw-1ayuyj6iLWwQHN_gI6Np_w)()

This link is dead. How to fix it?