What is the KOI8-R encoding and what did it give? Coding KOI8-R Coding koi 8

Today we will talk to you about where krakozyabrs come from on a website and in programs, what text encodings exist and which ones should be used. Let's take a closer look at the history of their development, starting from basic ASCII, as well as its extended versions CP866, KOI8-R, Windows 1251 and ending with modern encodings of the Unicode consortium UTF 16 and 8. Table of contents: To some, this information may seem unnecessary, but would you know how many questions I receive specifically regarding the crawling krakozyabrs (unreadable set of characters). Now I will have the opportunity to refer everyone to the text of this article and find my own mistakes. Well, get ready to absorb the information and try to follow the flow of the story.

ASCII - basic text encoding for the Latin alphabet

The development of text encodings occurred simultaneously with the formation of the IT industry, and during this time they managed to undergo quite a lot of changes. Historically, it all started with EBCDIC, which was rather dissonant in Russian pronunciation, which made it possible to encode letters of the Latin alphabet, Arabic numerals and punctuation marks with control characters. But still, the starting point for the development of modern text encodings should be considered the famous ASCII(American Standard Code for Information Interchange, which in Russian is usually pronounced as “ask”). It describes the first 128 characters most commonly used by English-speaking users - Latin letters, Arabic numerals and punctuation marks. These 128 characters described in ASCII also included some service characters like brackets, hash marks, asterisks, etc. In fact, you can see them yourself:

It is these 128 characters from the original version of ASCII that have become the standard, and in any other encoding you will definitely find them and they will appear in this order. But the fact is that with one byte of information you can encode not 128, but as many as 256 different values (two to the power of eight equals 256), so following basic version A whole series of Asukas appeared extended ASCII encodings, in which, in addition to 128 basic characters, it was also possible to encode symbols of the national encoding (for example, Russian). Here, it’s probably worth saying a little more about the number systems that are used in the description. Firstly, as you all know, a computer only works with numbers in the binary system, namely with zeros and ones (“Boolean algebra”, if anyone took it at an institute or school). One byte consists of eight bits, each of which is a power of two, starting from zero, and ending with two to the seventh power:

It is not difficult to understand that all possible combinations of zeros and ones in such a design can only be 256. Converting a number from the binary system to the decimal system is quite simple. You just need to add up all the powers of two with ones above them. In our example, this turns out to be 1 (2 to the power of zero) plus 8 (two to the power of 3), plus 32 (two to the fifth power), plus 64 (to the sixth power), plus 128 (to the seventh power). The total is 233 in decimal notation. As you can see, everything is very simple. But if you look closely at the table with ASCII characters, you will see that they are represented in hexadecimal encoding. For example, "asterisk" corresponds to the hexadecimal number 2A in Aski. You probably know that in the hexadecimal number system, in addition to Arabic numerals, Latin letters from A (means ten) to F (means fifteen) are also used. Well then, for converting binary number to hexadecimal resort to the following simple and obvious method. Each byte of information is divided into two parts of four bits, as shown in the above screenshot. That. In each half byte, only sixteen values (two to the fourth power) can be encoded in binary, which can easily be represented as a hexadecimal number. Moreover, in the left half of the byte the degrees will need to be counted again starting from zero, and not as shown in the screenshot. As a result, through simple calculations, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solution to this puzzle were clear to you. Well, now let’s continue, in fact, talking about text encodings.

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

So, we started talking about ASCII, which was, as it were, the starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8). Initially, it contained only 128 characters of the Latin alphabet, Arabic numerals and something else, but in the extended version it became possible to use all 256 values that can be encoded in one byte of information. Those. It became possible to add symbols of letters of your language to Aski. Here we will need to digress again to explain - Why do we need text encodings at all? and why it's so important. The characters on your computer screen are formed on the basis of two things - sets of vector shapes (representations) of all kinds of characters (they are in files with fonts that are installed on your computer) and code that allows you to pull out exactly that one from this set of vector shapes (font file). symbol that will need to be inserted in the right place. It is clear that the fonts themselves are responsible for the vector shapes, but the operating system and the programs used in it are responsible for the encoding. Those. any text on your computer will be a set of bytes, each of which encodes one single character of this very text. The program that displays this text on the screen (text editor, browser, etc.), when parsing the code, reads the encoding of the next character and looks for the corresponding vector form in the required file font that is connected to display this text document. Everything is simple and banal. This means that in order to encode any character we need (for example, from the national alphabet), two conditions must be met - the vector form of this character must be in the font used and this character could be encoded in extended ASCII encodings in one byte. Therefore, there are a whole bunch of such options. Just for encoding Russian language characters, there are several varieties of extended Aska. For example, originally appeared CP866, which had the ability to use characters from the Russian alphabet and was an extended version of ASCII. Those. her top part completely coincided with the basic version of Aska (128 Latin characters, numbers and other crap), which is presented in the screenshot just above, but the lower part of the table with CP866 encoding had the form indicated in the screenshot just below and allowed you to encode another 128 characters (Russian letters and all sorts of pseudo-graphics):

You see, in the right column the numbers start with 8, because... numbers from 0 to 7 refer to the basic part of ASCII (see first screenshot). That. the Russian letter “M” in CP866 will have the code 9C (it is located at the intersection of the corresponding row with 9 and column with the number C in the hexadecimal number system), which can be written in one byte of information, and if there is a suitable font with Russian characters, this letter without problems will appear in the text. Where did this amount come from? pseudographics in CP866? The whole point is that this encoding for Russian text was developed back in those shaggy years when graphical operating systems were not as widespread as they are now. And in Dosa and similar text operating systems, pseudographics made it possible to at least somehow diversify the design of texts, and therefore CP866 and all its other peers from the category of extended versions of Asuka abound in it. CP866 was distributed by IBM, but in addition to this, a number of encodings were developed for Russian language characters, for example, the same type (extended ASCII) can be attributed KOI8-R:

The principle of its operation remains the same as that of the CP866 described a little earlier - each character of text is encoded by one single byte. The screenshot shows the second half of the KOI8-R table, because the first half is completely consistent with the basic Asuka, which is shown in the first screenshot in this article. Among the features of the KOI8-R encoding, it can be noted that Russian letters in its table do not go in alphabetical order, as, for example, they did in CP866. If you look at the very first screenshot (of the basic part, which is included in all extended encodings), you will notice that in KOI8-R Russian letters are located in the same cells of the table as the corresponding letters of the Latin alphabet from the first part of the table. This was done for the convenience of switching from Russian to Latin characters by discarding just one bit (two to the seventh power or 128).

Windows 1251 - modern version of ASCII and why the cracks come out

The further development of text encodings was due to the fact that graphical operating systems were gaining popularity and the need to use pseudographics in them disappeared over time. As a result, a whole group arose that, in essence, were still extended versions of Asuka (one character of text is encoded with just one byte of information), but without the use of pseudographic symbols. They belonged to the so-called ANSI encodings, which were developed by the American Standards Institute. In common parlance, the name Cyrillic was also used for the version with Russian language support. An example of this could be Windows 1251. It differed favorably from the previously used CP866 and KOI8-R in that the place of pseudographic symbols in it was taken by the missing symbols of Russian typography (except for the accent mark), as well as symbols used in Slavic languages close to Russian (Ukrainian, Belarusian, etc.). ):

Due to such an abundance of Russian language encodings, font manufacturers and manufacturers software headaches constantly arose, and you and I, dear readers, often got those same notorious krakozyabry when there was confusion with the version used in the text. Very often they appeared when sending and receiving messages via e-mail, which entailed the creation of very complex conversion tables, which, in fact, could not solve this problem at all, and often users used transliteration of Latin letters for correspondence in order to avoid the notorious gimmicks when using Russian encodings like CP866, KOI8-R or Windows 1251 In fact, the krakozyabrs appearing instead of the Russian text were the result of incorrect use of the encoding of this language, which did not match the one in which it was encoded text message initially. Let’s say that if you try to display characters encoded using CP866 using the Windows 1251 code table, then these same gibberish (a meaningless set of characters) will come out, completely replacing the text of the message.

A similar situation very often arises when creating and setting up websites, forums or blogs, when text with Russian characters is mistakenly saved in the wrong encoding that is used on the site by default, or in the wrong encoding text editor, which adds gags to the code that are not visible to the naked eye. In the end, many people got tired of this situation with a lot of encodings and constantly creeping out crap, and the prerequisites appeared for the creation of a new universal variation that would replace all the existing ones and would finally solve the problem with the appearance of unreadable texts. In addition, there was the problem of languages like Chinese, where there were much more language characters than 256.

Unicode - universal encodings UTF 8, 16 and 32

These thousands of characters of the Southeast Asian language group could not possibly be described in one byte of information that was allocated for encoding characters in extended versions of ASCII. As a result, a consortium was created called Unicode(Unicode - Unicode Consortium) with the collaboration of many IT industry leaders (those who produce software, who encode hardware, who create fonts), who were interested in the emergence of a universal text encoding. The first variation released under the auspices of the Unicode Consortium was UTF 32. The number in the encoding name means the number of bits that are used to encode one character. 32 bits equal 4 bytes of information that will be needed to encode one single character in the new universal UTF encoding. As a result, the same file with text encoded in the extended version of ASCII and in UTF-32, in the latter case, will have a size (weigh) four times larger. This is bad, but now we have the opportunity to encode using YTF a number of characters equal to two to the thirty-second power ( billions of characters, which will cover any really necessary value with a colossal margin). But many countries with languages of the European group did not need to use such a huge number of characters in encoding at all, however, when using UTF-32, they for no reason received a fourfold increase in weight text documents, and as a result, an increase in the volume of Internet traffic and the amount of stored data. This is a lot, and no one could afford such waste. As a result of the development of Unicode, UTF-16, which turned out to be so successful that it was adopted by default as the base space for all the characters that we use. It uses two bytes to encode one character. Let's see how this thing looks. In the Windows operating system, you can follow the path “Start” - “Programs” - “Accessories” - “System Tools” - “Character Table”. As a result, a table will open with the vector shapes of all the fonts installed on your system. If you select the Unicode character set in the “Advanced options”, you will be able to see for each font separately the entire range of characters included in it. By the way, by clicking on any of them, you can see its two-byte code in UTF-16 format, consisting of four hexadecimal digits:

How many characters can be encoded in UTF-16 using 16 bits? 65,536 (two to the power of sixteen), and this is the number that was adopted as the base space in Unicode. In addition, there are ways to encode about two million characters using it, but they were limited to an expanded space of a million characters of text. But even this successful version of the Unicode encoding did not bring much satisfaction to those who wrote, for example, programs only in English language, because after the transition from the extended version of ASCII to UTF-16, the weight of documents doubled (one byte per character in Aski and two bytes per the same character in UTF-16). It was precisely for the satisfaction of everyone and everything in the Unicode consortium that it was decided come up with an encoding variable length. It was called UTF-8. Despite the eight in the name, it actually has a variable length, i.e. Each character of text can be encoded into a sequence of one to six bytes in length. In practice, UTF-8 only uses the range from one to four bytes, because beyond four bytes of code it is no longer even theoretically possible to imagine anything. All Latin characters in it are encoded into one byte, just like in the good old ASCII. What is noteworthy is that in the case of encoding only the Latin alphabet, even those programs that do not understand Unicode will still read what is encoded in YTF-8. Those. the core part of Asuka was simply transferred to this creation of the Unicode consortium. Cyrillic characters in UTF-8 are encoded in two bytes, and, for example, Georgian ones - in three bytes. The Unicode Consortium, after creating UTF 16 and 8, solved the main problem - now we have fonts have a single code space. And now their manufacturers can only fill it with vector forms of text characters based on their strengths and capabilities. In the “Character Table” above you can see that different fonts support different numbers of characters. Some Unicode-rich fonts can be quite heavy. But now they differ not in the fact that they were created for different encodings, but in the fact that the font manufacturer has filled or not completely filled the single code space with certain vector forms.

Crazy words instead of Russian letters - how to fix it

Let's now see how krakozyabrs appear instead of text or, in other words, how the correct encoding for Russian text is selected. Actually, it is set in the program in which you create or edit this very text, or code using text fragments. For editing and creating text files Personally, I use a very good, in my opinion, Html and PHP editor Notepad++. However, it can highlight the syntax of hundreds of other programming and markup languages, and also has the ability to be extended using plugins. Read detailed review this wonderful program at the link provided. IN top menu Notepad++ has an “Encodings” item, where you will have the opportunity to convert an existing option to the one used by default on your site:

In the case of a site on Joomla 1.5 and higher, as well as in the case of a blog on WordPress, you should choose the option to avoid the appearance of cracks UTF 8 without BOM. What is the BOM prefix? The fact is that when they were developing the YUTF-16 encoding, for some reason they decided to attach to it such a thing as the ability to write the character code both in direct sequence (for example, 0A15) and in reverse (150A). And in order for programs to understand exactly in what sequence to read the codes, it was invented BOM(Byte Order Mark or, in other words, signature), which was expressed in adding three additional bytes to the very beginning of the documents. In the UTF-8 encoding, no BOMs were provided for in the Unicode consortium, and therefore adding a signature (those notorious extra three bytes at the beginning of the document) simply prevents some programs from reading the code. Therefore, when saving files in UTF, we must always select the option without BOM (without signature). So you are in advance protect yourself from crawling krakozyabrs. What is noteworthy is that some programs in Windows cannot do this (they cannot save text in UTF-8 without a BOM), for example, the same notorious Windows Notepad. It saves the document in UTF-8, but still adds the signature (three extra bytes) to the beginning of it. Moreover, these bytes will always be the same - read the code in direct sequence. But on servers, because of this little thing, a problem can arise - crooks will come out. Therefore, under no circumstances Don't use regular Windows notepad to edit documents on your site if you don’t want any cracks to appear. The best and most simple option I consider the already mentioned Notepad++ editor, which has practically no disadvantages and consists only of advantages. In Notepad++, when you select an encoding, you will have the option to convert text to UCS-2 encoding, which is very close in nature to the Unicode standard. Also in Notepad it will be possible to encode text in ANSI, i.e. in relation to the Russian language, this will be Windows 1251, which we have already described just above. Where does this information come from? It is registered in the registry of your operating room Windows systems- which encoding to choose in the case of ANSI, which to choose in the case of OEM (for the Russian language it will be CP866). If you set another default language on your computer, then these encodings will be replaced with similar ones from the ANSI or OEM category for that same language. After you save the document in Notepad++ in the encoding you need or open the document from the site for editing, you can see its name in the lower right corner of the editor: To avoid rednecks, in addition to the actions described above, it will be useful to write in its header source code all pages of the site information about this very encoding, so that there is no confusion on the server or local host. In general, all hypertext markup languages except Html use a special xml declaration, which specifies the text encoding.< ? xml version= "1.0" encoding= "windows-1251" ? >Before parsing the code, the browser knows which version is being used and how exactly it needs to interpret the character codes of that language. But what’s noteworthy is that if you save the document in the default Unicode, then this xml declaration can be omitted (the encoding will be considered UTF-8 if there is no BOM or UTF-16 if there is a BOM). In the case of a document HTML language used to indicate encoding Meta element, which is written between the opening and closing Head tags: < head> . . . < meta charset= "utf-8" > . . . < / head>This entry is quite different from the one adopted in the standard in Html 4.01, but it fully complies with the new Html 5 standard that is being gradually introduced, and it will be completely understood correctly by any browsers currently used. In theory, a Meta element indicating the encoding HTML document it would be better to put as high as possible in the document header so that at the time of encountering the first character in the text not from the basic ANSI (which are always read correctly and in any variation), the browser should already have information on how to interpret the codes of these characters. Link to first

— Zampolit (@ComradZampolit) August 17, 2017

How does KOI8-R work?

KOI8-R is an eight-bit code page designed for encoding letters of the Cyrillic alphabets. The developers placed the characters of the Russian alphabet in such a way that the positions of the Cyrillic characters corresponded to their phonetic counterparts in the English alphabet at the bottom of the table. And if in a text written in this encoding the eighth bit of each character is removed, then the result is a text similar to a transliteration in Latin letters.

This information exchange code was used in the seventies on computers of the ES computer series, and from the mid-eighties it began to be used in the first Russified versions operating system UNIX.

The coding consisted of assigning each character unique code: from 00000000 to 11111111. Thus, a person distinguished characters by their outline, and a computer - by their code.

Is Chernoff encoding currently used?

No. It was relevant for old eight-bit computers; now Unicode in various formats is mainly used.

KOI-8 became the first Russian standardized encoding on the Internet.

The IETF has approved several RFCs on KOI-8 encoding options:

RFC 1489 - KOI8-R (letters of the Russian alphabet);
RFC 2319 - KOI8-U (letters of the Ukrainian alphabet);
RFC 1345 - ISO-IR-111 (with an error in the definition of the main range).

In the tables below, the numbers below the letters indicate the hexadecimal Unicode code for the letter.

Encoding KOI8-R (Russian)

	.0	.1	.2	.3	.4	.5	.6	.7	.8	.9	.A	.B	.C	.D	.E	.F
8.	─ 2500	│ 2502	┌ 250C	┐ 2510	└ 2514	┘ 2518	├ 251C	┤ 2524	┬ 252C	┴ 2534	┼ 253C	▀ 2580	▄ 2584	█ 2588	▌ 258C	▐ 2590
9.	░ 2591	▒ 2592	▓ 2593	⌠ 2320	■ 25A0	∙ 2219	√ 221A	≈ 2248	≤ 2264	≥ 2265	A0	⌡ 2321	° B0	² B2	· B7	÷ F7
A.	═ 2550	║ 2551	╒ 2552	e 451	╓ 2553	╔ 2554	╕ 2555	╖ 2556	╗ 2557	╘ 2558	╙ 2559	╚ 255A	╛ 255B	╜ 255C	╝ 255D	╞ 255E
B.	╟ 255F	╠ 2560	╡ 2561	Yo 401	╢ 2562	╣ 2563	╤ 2564	╥ 2565	╦ 2566	╧ 2567	╨ 2568	╩ 2569	╪ 256A	╫ 256B	╬ 256C	© A9
C.	Yu 44E	A 430	b 431	ts 446	d 434	e 435	f 444	G 433	X 445	And 438	th 439	To 43A	l 43B	m 43C	n 43D	O 43E
D.	P 43F	I 44F	R 440	With 441	T 442	at 443	and 436	V 432	b 44C	s 44B	h 437	w 448	uh 44D	sch 449	h 447	ъ 44A
E.	YU 42E	A 410	B 411	C 426	D 414	E 415	F 424	G 413	X 425	AND 418	Y 419	TO 41A	L 41B	M 41C	N 41D	ABOUT 41E
F.	P 41F	I 42F	R 420	WITH 421	T 422	U 423	AND 416	IN 412	b 42C	Y 42B	Z 417	Sh 428	E 42D	SCH 429	H 427	Kommersant 42A

Other options

Only table rows that do not match are shown, since everything else matches.

Encoding KOI8-U (Russian-Ukrainian)

	.0	.1	.2	.3	.4	.5	.6	.7	.8	.9	.A	.B	.C	.D	.E	.F
A.	═ 2550	║ 2551	╒ 2552	e 451	є 454	╔ 2554	і 456	ї 457	╗ 2557	╘ 2558	╙ 2559	╚ 255A	╛ 255B	ґ 491	╝ 255D	╞ 255E
B.	╟ 255F	╠ 2560	╡ 2561	Yo 401	Є 404	╣ 2563	І 406	Ї 407	╦ 2566	╧ 2567	╨ 2568	╩ 2569	╪ 256A	Ґ 490	╬ 256C	© A9

Encoding KOI8-RU (Russian-Belarusian-Ukrainian)

	.0	.1	.2	.3	.4	.5	.6	.7	.8	.9	.A	.B	.C	.D	.E	.F
A.	═ 2550	║ 2551	╒ 2552	e 451	є 454	╔ 2554	і 456	ї 457	╗ 2557	╘ 2558	╙ 2559	╚ 255A	╛ 255B	ґ 491	ў 45E	╞ 255E
B.	╟ 255F	╠ 2560	╡ 2561	Yo 401	Є 404	╣ 2563	І 406	Ї 407	╦ 2566	╧ 2567	╨ 2568	╩ 2569	╪ 256A	Ґ 490	Ў 40E	© A9

Encoding KOI8-C (Central Asia)

	.0	.1	.2	.3	.4	.5	.6	.7	.8	.9	.A	.B	.C	.D	.E	.F
8.	ғ 493	җ 497	қ 49B	ҝ 49D	ң 4A3	ү 4AF	ұ 4B1	ҳ 4B3	ҷ 4B7	ҹ 4B9	һ 4BB	▀ 2580	ә 4D9	ӣ 4E3	ө 4E9	ӯ 4EF
9.	Ғ 492	Җ 496	Қ 49A	Ҝ 49C	Ң 4A2	Ү 4AE	Ұ 4B0	Ҳ 4B2	Ҷ 4B6	Ҹ 4B8	Һ 4BA	⌡ 2321	Ә 4D8	Ӣ 4E2	Ө 4E8	Ӯ 4EE
A.	A0	ђ 452	ѓ 453	e 451	є 454	ѕ 455	і 456	ї 457	ј 458	љ 459	њ 45A	ћ 45B	ќ 45C	ґ 491	ў 45E	џ 45F
B.	№ 2116	Ђ 402	Ѓ 403	Yo 401	Є 404	Ѕ 405	І 406	Ї 407	Ј 408	Љ 409	Њ 40A	Ћ 40B	Ќ 40C	Ґ 490	Ў 40E	Џ 40F

Encoding KOI8-T (Tajik)

	.0	.1	.2	.3	.4	.5	.6	.7	.8	.9	.A	.B	.C	.D	.E	.F
8.	қ 49B	ғ 493	‚ 201A	Ғ 492	„ 201E	… 2026	† 2020	‡ 2021		‰ 2030	ҳ 4B3	‹ 2039	Ҳ 4B2	ҷ 4B7	Ҷ 4B6
9.	Қ 49A	‘ 2018	’ 2019	“ 201C	” 201D	2022	– 2013	- 2014		™ 2122		› 203A
A.		ӯ 4EF	Ӯ 4EE	e 451	¤ A4	ӣ 4E3	¦ A6	§ A7				« AB	¬ A.C.	AD	® A.E.
B.	° B0	± B1	² B2	Yo 401		Ӣ 4E2	¶ B6	· B7		№ 2116		» BB				© A9

Encoding KOI8-O, KOI8-S (Slavic, old spelling)

0407

	.0	.1	.2	.3	.4	.5	.6	.7	.8	.9	.A	.B	.C	.D	.E	.F
8.	Ђ 0402	Ѓ 0403	¸ 00B8	ѓ 0453	„ 201E	… 2026	† 2020	§ 00A7	€ 20AC	¨ 00A8	Љ 0409	‹ 2039	Њ 040A	Ќ 040C	Ћ 040B	Џ 040F
9.	ђ 0452	‘ 2018	’ 2019	“ 201C	” 201D	2022	– 2013	— 2014	£ 00A3	· 00B7	љ 0459	› 203A	њ 045A	ќ 045C	ћ 045B	џ 045F
A.	00A0	ѵ 0475	ѣ 0463	e 0451	є 0454	ѕ 0455	і 0456	ї 0457	ј 0458	® 00AE	™ 2122	« 00AB	ѳ 0473	ґ 0491	ў 045E	´ 00B4
B.	° 00B0	Ѵ 0474	Ѣ 0462	Yo 0401	Є 0404	Ѕ 0405	І 0406	Ї 0407	Ј 0408	№ 2116	¢ 00A2	» 00BB	Ѳ 0472	Ґ 0490	Ў 040E	© 00A9

Encoding ISO-IR-111, KOI8-E

	.0	.1	.2	.3	.4	.5	.6	.7	.8	.9	.A	.B	.C	.D	.E	.F
A.	00A0	ђ 0452	ѓ 0453	e 0451	є 0454	ѕ 0455	і 0456	ї 0457	ј 0458	љ 0459	њ 045A	ћ 045B	ќ 045C	00AD	ў 045E	џ 045F
B.	№ 2116	Ђ 0402	Ѓ 0403	Yo 0401	Є 0404	Ѕ 0405	І 0406	Ї 0407	Ј 0408	Љ 0409	Њ 040A	Ћ 040B	Ќ 040C	¤ 00A4	Ў 040E	Џ 040F

Coding KOI8-Unified, KOI8-F

The KOI8-Unified (KOI8-F) encoding was proposed by Fingertip Software.

	.0	.1	.2	.3	.4	.5	.6	.7	.8	.9	.A	.B	.C	.D	.E	.F
8.	─ 2500	│ 2502	┌ 250C	┐ 2510	└ 2514	┘ 2518	├ 251C	┤ 2524	┬ 252C	┴ 2534	┼ 253C	▀ 2580	▄ 2584	█ 2588	▌ 258C	▐ 2590
9.	░ 2591	‘ 2018	’ 2019	“ 201C	” 201D	2022	– 2013	— 2014	© 00A9	™ 2122	00A0	» 00BB	® 00AE	« 00AB	· 00B7	¤ 00A4
A.	00A0	ђ 0452	ѓ 0453	e 0451	є 0454	ѕ 0455	і 0456	ї 0457	ј 0458	љ 0459	њ 045A	ћ 045B	ќ 045C	ґ 0491	ў 045E	џ 045F
B.	№ 2116	Ђ 0402	Ѓ 0403	Yo 0401	Є 0404	Ѕ 0405	І 0406	Ї 0407	Ј 0408	Љ 0409	Њ 040A	Ћ 040B	Ќ 040C	Ґ 0490	Ў 040E	Џ 040F

Non-Cyrillic variants of KOI-8

In some CMEA countries, modifications of KOI-8 were created for national variants of the Latin alphabet. The basic idea was the same - when the eighth bit is “cut off,” the text should remain more or less understandable.

- Well, let's start! - said Dolokhov.
“Well,” said Pierre, still smiling. “It was getting scary.” It was obvious that the matter, which began so easily, could no longer be prevented, that it went on by itself, regardless of the will of people, and had to be accomplished. Denisov was the first to step forward to the barrier and proclaimed:
- Since the “opponents” refused to “name” them, would you like to begin: take pistols and, according to the word “t”, and begin to converge.
“G...”az! Two! T”i!...” Denisov shouted angrily and stepped aside. Both walked along the beaten paths closer and closer, recognizing each other in the fog. Opponents had the right, converging to the barrier, to shoot whenever they wanted. Dolokhov walked slowly, without raising his pistol, peering with his bright, shining, blue eyes into the face of his opponent. His mouth, as always, had the semblance of a smile.
- So when I want, I can shoot! - said Pierre, at the word three he walked forward with quick steps, straying from the well-trodden path and walking on solid snow. Pierre held the pistol with his right hand extended forward, apparently afraid that he might kill himself with this pistol. He carefully put his left hand back, because he wanted to support his right hand with it, but he knew that this was impossible. Having walked six steps and strayed off the path into the snow, Pierre looked back at his feet, again quickly looked at Dolokhov, and, pulling his finger, as he had been taught, fired. Not expecting such a strong sound, Pierre flinched from his shot, then smiled at his own impression and stopped. The smoke, especially thick from the fog, prevented him from seeing at first; but the other shot he was waiting for did not come. Only Dolokhov’s hurried steps were heard, and his figure appeared from behind the smoke. With one hand he held his left side, with the other he clutched the lowered pistol. His face was pale. Rostov ran up and said something to him.
“No...e...t,” Dolokhov said through his teeth, “no, it’s not over,” and taking a few more falling, hobbling steps right up to the saber, he fell on the snow next to it. His left hand was covered in blood, he wiped it on his coat and leaned on it. His face was pale, frowning and trembling.
“Please…” Dolokhov began, but couldn’t say right away... “Please,” he finished with an effort. Pierre, barely holding back his sobs, ran to Dolokhov, and was about to cross the space separating the barriers when Dolokhov shouted: “to the barrier!” - and Pierre, realizing what was happening, stopped at his saber. Only 10 steps separated them. Dolokhov lowered his head to the snow, greedily bit the snow, raised his head again, corrected himself, tucked his legs and sat down, looking for a strong center of gravity. He swallowed cold snow and sucked it; his lips trembled, but still smiling; the eyes sparkled with the effort and malice of the last collected strength. He raised the pistol and began to take aim.
“Sideways, cover yourself with a pistol,” said Nesvitsky.
“Watch yourself!” even Denisov, unable to bear it, shouted to his opponent.
Pierre, with a meek smile of regret and repentance, helplessly spreading his legs and arms, stood straight in front of Dolokhov with his broad chest and looked at him sadly. Denisov, Rostov and Nesvitsky closed their eyes. At the same time, they heard a shot and Dolokhov’s angry cry.
- Past! - Dolokhov shouted and lay helplessly face down on the snow. Pierre grabbed his head and, turning back, went into the forest, walking entirely in the snow and saying out loud incomprehensible words:
- Stupid... stupid! Death... lies... - he repeated, wincing. Nesvitsky stopped him and took him home.
Rostov and Denisov took the wounded Dolokhov.
Dolokhov, silently, with eyes closed, lay in the sleigh and did not answer a word to the questions that were asked of him; but, having entered Moscow, he suddenly woke up and, with difficulty raising his head, took Rostov, who was sitting next to him, by the hand. Rostov was struck by the completely changed and unexpectedly enthusiastically tender expression on Dolokhov’s face.
- Well? How do you feel? - asked Rostov.
- Bad! but that's not the point. My friend,” said Dolokhov in a broken voice, “where are we?” We are in Moscow, I know. I’m okay, but I killed her, killed her... She won’t stand it. She won't bear it...
- Who? - asked Rostov.
- My mother. My mother, my angel, my adored angel, mother,” and Dolokhov began to cry, squeezing Rostov’s hand. When he calmed down somewhat, he explained to Rostov that he lived with his mother, and that if his mother saw him dying, she would not bear it. He begged Rostov to go to her and prepare her.
Rostov went ahead to carry out the assignment, and to his great surprise he learned that Dolokhov, this brawler, the brute Dolokhov lived in Moscow with his old mother and hunchbacked sister, and was the most tender son and brother.

Pierre in Lately I rarely saw my wife face to face. Both in St. Petersburg and Moscow, their house was constantly full of guests. The next night after the duel, he, as he often did, did not go to the bedroom, but remained in his huge, father’s office, the same one in which Count Bezukhy died.
He lay down on the sofa and wanted to fall asleep in order to forget everything that happened to him, but he could not do it. Such a storm of feelings, thoughts, memories suddenly arose in his soul that he not only could not sleep, but could not sit still and had to jump up from the sofa and walk quickly around the room. Then he imagined her at first after her marriage, with open shoulders and a tired, passionate look, and immediately next to her he imagined the beautiful, insolent and firmly mocking face of Dolokhov, as it had been at dinner, and the same face of Dolokhov, pale, trembling and suffering as it was when he turned and fell into the snow.
“What happened? – he asked himself. “I killed my lover, yes, I killed my wife’s lover.” Yes, it was. From what? How did I get to this point? “Because you married her,” answered an inner voice.
“But what am I to blame for? - he asked. “The fact is that you married without loving her, that you deceived both yourself and her,” and he vividly imagined that minute after dinner at Prince Vasily’s when he said these words that never escaped him: “Je vous aime.” [I love you.] Everything from this! I felt then, he thought, I felt then that it was not that I had no right to it. And so it happened.” He remembered the honeymoon, and blushed at the memory. Particularly vivid, offensive and shameful for him was the memory of how one day, soon after his marriage, at 12 noon, in a silk robe, he came from the bedroom to the office, and in the office he found the chief manager, who bowed respectfully and looked at Pierre's face, on his robe, and smiled slightly, as if expressing with this smile respectful sympathy for the happiness of his principal.
“And how many times have I been proud of her, proud of her majestic beauty, her social tact,” he thought; he was proud of his home, in which she welcomed all of St. Petersburg, he was proud of her inaccessibility and beauty. So this is what I was proud of?! I thought then that I didn’t understand her. How often, pondering her character, I told myself that it was my fault that I didn’t understand her, that I didn’t understand this constant calm, contentment and absence of any attachments and desires, and the whole solution was in that terrible word that she was a depraved woman: said this terrible word to myself, and everything became clear!
“Anatole went to her to borrow money from her and kissed her bare shoulders. She didn't give him money, but she allowed him to kiss her. Her father, jokingly, aroused her jealousy; she said with a calm smile that she was not so stupid as to be jealous: let her do what she wants, she said about me. I asked her one day if she felt any signs of pregnancy. She laughed contemptuously and said that she was not a fool to want to have children, and that she would not have children from me.”
Then he remembered the rudeness, the clarity of her thoughts and the vulgarity of expressions characteristic of her, despite her upbringing in the highest aristocratic circle. “I’m not some kind of fool... go try it yourself... allez vous promener,” she said. Often, looking at her success in the eyes of old and young men and women, Pierre could not understand why he did not love her. Yes, I never loved her, Pierre told himself; I knew that she was a depraved woman, he repeated to himself, but he did not dare admit it.

Hello, dear readers of the blog site. Today we will talk to you about where krakozyabrs come from on a website and in programs, what text encodings exist and which ones should be used. Let's take a closer look at the history of their development, starting from basic ASCII, as well as its extended versions CP866, KOI8-R, Windows 1251 and ending with modern Unicode consortium encodings UTF 16 and 8.

To some, this information may seem unnecessary, but would you know how many questions I receive specifically regarding the crawling krakozyabrs (unreadable set of characters). Now I will have the opportunity to refer everyone to the text of this article and find my own mistakes. Well, get ready to absorb the information and try to follow the flow of the story.

ASCII - basic text encoding for the Latin alphabet

But still, the starting point for the development of modern text encodings should be considered the famous ASCII(American Standard Code for Information Interchange, which in Russian is usually pronounced as “ask”). It describes the first 128 characters most commonly used by English-speaking users - , Arabic numerals and punctuation marks.

These 128 characters described in ASCII also included some service characters like brackets, hash marks, asterisks, etc. In fact, you can see them yourself:

It is these 128 characters from the original version of ASCII that have become the standard, and in any other encoding you will definitely find them and they will appear in this order.

But the fact is that with one byte of information you can encode not 128, but as many as 256 different values (two to the power of eight equals 256), so after the basic version of Asuka a whole series of extended ASCII encodings, in which, in addition to 128 basic characters, it was also possible to encode symbols of the national encoding (for example, Russian).

Here, it’s probably worth saying a little more about the number systems that are used in the description. Firstly, as you all know, a computer only works with numbers in the binary system, namely with zeros and ones (“Boolean algebra”, if anyone took it at an institute or school). , each of which is a two to the power, starting from zero, and up to two to the seventh:

In our example, this turns out to be 1 (2 to the power of zero) plus 8 (two to the power of 3), plus 32 (two to the fifth power), plus 64 (to the sixth power), plus 128 (to the seventh power). The total is 233 in decimal notation. As you can see, everything is very simple.

But if you look closely at the table with ASCII characters, you will see that they are represented in hexadecimal encoding. For example, "asterisk" corresponds to the hexadecimal number 2A in Aski. You probably know that in the hexadecimal number system, in addition to Arabic numerals, Latin letters from A (means ten) to F (means fifteen) are also used.

Well then, for converting binary number to hexadecimal resort to the following simple and obvious method. Each byte of information is divided into two parts of four bits, as shown in the above screenshot. That. In each half byte, only sixteen values (two to the fourth power) can be encoded in binary, which can easily be represented as a hexadecimal number.

Moreover, in the left half of the byte the degrees will need to be counted again starting from zero, and not as shown in the screenshot. As a result, through simple calculations, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solution to this puzzle were clear to you. Well, now let’s continue, in fact, talking about text encodings.

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

So, we started talking about ASCII, which was, as it were, the starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8).

Initially, it contained only 128 characters of the Latin alphabet, Arabic numerals and something else, but in the extended version it became possible to use all 256 values that can be encoded in one byte of information. Those. It became possible to add symbols of letters of your language to Aski.

Here we will need to digress again to explain - why do we need encodings at all? texts and why it is so important. The characters on your computer screen are formed on the basis of two things - sets of vector forms (representations) of various characters (they are located in files with ) and code that allows you to pull out from this set of vector forms (font file) exactly the character that will need to be inserted into Right place.

It is clear that the fonts themselves are responsible for the vector shapes, but the operating system and the programs used in it are responsible for the encoding. Those. any text on your computer will be a set of bytes, each of which encodes one single character of this very text.

The program that displays this text on the screen (text editor, browser, etc.), when parsing the code, reads the encoding of the next character and looks for the corresponding vector form in the required font file, which is connected to display this text document. Everything is simple and banal.

This means that in order to encode any character we need (for example, from the national alphabet), two conditions must be met - the vector form of this character must be in the font used and this character could be encoded in extended ASCII encodings in one byte. Therefore, there are a whole bunch of such options. Just for encoding Russian language characters, there are several varieties of extended Aska.

For example, originally appeared CP866, which had the ability to use characters from the Russian alphabet and was an extended version of ASCII.

Those. its upper part completely coincided with the basic version of Aska (128 Latin characters, numbers and other crap), which is presented in the screenshot just above, but the lower part of the table with CP866 encoding had the appearance indicated in the screenshot just below and allowed you to encode another 128 signs (Russian letters and all sorts of pseudographics):

Where did this amount come from? pseudographics in CP866? The whole point is that this encoding for Russian text was developed back in those shaggy years when graphical operating systems were not as widespread as they are now. And in Dosa and similar text operating systems, pseudographics made it possible to at least somehow diversify the design of texts, and therefore CP866 and all its other peers from the category of extended versions of Asuka abound in it.

CP866 was distributed by IBM, but in addition to this, a number of encodings were developed for Russian language characters, for example, the same type (extended ASCII) can be attributed KOI8-R:

Among the features of the KOI8-R encoding, it can be noted that the Russian letters in its table are not in alphabetical order, as, for example, they did it in CP866.

If you look at the very first screenshot (of the basic part, which is included in all extended encodings), you will notice that in KOI8-R Russian letters are located in the same cells of the table as the corresponding letters of the Latin alphabet from the first part of the table. This was done for the convenience of switching from Russian to Latin characters by discarding just one bit (two to the seventh power or 128).

Windows 1251 - the modern version of ASCII and why the cracks come out

They belonged to the so-called ANSI encodings, which were developed by the American Standards Institute. In common parlance, the name Cyrillic was also used for the version with Russian language support. An example of this would be.

It differed favorably from the previously used CP866 and KOI8-R in that the place of pseudographic symbols in it was taken by the missing symbols of Russian typography (except for the accent mark), as well as symbols used in Slavic languages close to Russian (Ukrainian, Belarusian, etc.). ):

Due to such an abundance of Russian language encodings, font manufacturers and software manufacturers constantly had headaches, and you and I, dear readers, often got those same notorious krakozyabry when there was confusion with the version used in the text.

Very often they came out when sending and receiving messages by e-mail, which entailed the creation of very complex conversion tables, which, in fact, could not solve this problem fundamentally, and users often used for correspondence to avoid the notorious gimmicks when using Russian encodings like CP866, KOI8-R or Windows 1251.

In fact, the cracks appearing instead of the Russian text were the result of incorrect use of the encoding of this language, which did not correspond to the one in which the text message was originally encoded.

Let’s say that if you try to display characters encoded using CP866 using the Windows 1251 code table, then these same gibberish (a meaningless set of characters) will come out, completely replacing the text of the message.

A similar situation very often arises on forums or blogs, when text with Russian characters is mistakenly saved in the wrong encoding that is used on the site by default, or in the wrong text editor, which adds gags to the code that are not visible to the naked eye.

In the end, many people got tired of this situation with a lot of encodings and constantly creeping out crap, and the prerequisites appeared for the creation of a new universal variation that would replace all the existing ones and would finally solve the problem with the appearance of unreadable texts. In addition, there was the problem of languages like Chinese, where there were much more language characters than 256.

Unicode - universal encodings UTF 8, 16 and 32

The first variation released under the auspices of the Unicode Consortium was UTF 32. The number in the encoding name means the number of bits that are used to encode one character. 32 bits equal 4 bytes of information that will be needed to encode one single character in the new universal UTF encoding.

As a result, the same file with text encoded in the extended version of ASCII and in UTF-32, in the latter case, will have a size (weigh) four times larger. This is bad, but now we have the opportunity to encode using YTF a number of characters equal to two to the thirty-second power ( billions of characters, which will cover any really necessary value with a colossal margin).

But many countries with languages of the European group did not need to use such a huge number of characters in encoding at all, however, when using UTF-32, they for no reason received a fourfold increase in the weight of text documents, and as a result, an increase in the volume of Internet traffic and volume stored data. This is a lot, and no one could afford such waste.

As a result of the development of Unicode, UTF-16, which turned out to be so successful that it was adopted by default as the base space for all the characters that we use. It uses two bytes to encode one character. Let's see how this thing looks.

In the Windows operating system, you can follow the path “Start” - “Programs” - “Accessories” - “System Tools” - “Character Table”. As a result, a table will open with the vector shapes of all the fonts installed on your system. If you select the Unicode character set in the “Advanced options”, you will be able to see for each font separately the entire range of characters included in it.

By the way, by clicking on any of them, you can see its two-byte code in UTF-16 format, consisting of four hexadecimal digits:

But even this successful version of the Unicode encoding did not bring much satisfaction to those who wrote, say, programs only in English, because for them, after the transition from the extended version of ASCII to UTF-16, the weight of documents doubled (one byte per character in Aski and two bytes for the same character in YUTF-16).

It was precisely to satisfy everyone and everything in the Unicode consortium that it was decided to come up with variable length encoding. It was called UTF-8. Despite the eight in the name, it actually has a variable length, i.e. Each character of text can be encoded into a sequence of one to six bytes in length.

In practice, UTF-8 only uses the range from one to four bytes, because beyond four bytes of code it is no longer even theoretically possible to imagine anything. All Latin characters in it are encoded into one byte, just like in the good old ASCII.

What is noteworthy is that in the case of encoding only the Latin alphabet, even those programs that do not understand Unicode will still read what is encoded in YTF-8. Those. the core part of Asuka was simply transferred to this creation of the Unicode consortium.

Cyrillic characters in UTF-8 are encoded in two bytes, and, for example, Georgian characters are encoded in three bytes. The Unicode Consortium, after creating UTF 16 and 8, solved the main problem - now we have fonts have a single code space. And now their manufacturers can only fill it with vector forms of text characters based on their strengths and capabilities. Now they even come in sets.

In the “Character Table” above you can see that different fonts support different numbers of characters. Some Unicode-rich fonts can be quite heavy. But now they differ not in the fact that they were created for different encodings, but in the fact that the font manufacturer has filled or not completely filled the single code space with certain vector forms.

Crazy words instead of Russian letters - how to fix it

To edit and create text files, I personally use a very good, in my opinion, . However, it can highlight the syntax of hundreds of other programming and markup languages, and also has the ability to be extended using plugins. Read a detailed review of this wonderful program at the link provided.

In the top menu of Notepad++ there is an item “Encodings”, where you will have the opportunity to convert an existing option to the one used by default on your site:

The fact is that when they were developing the YUTF-16 encoding, for some reason they decided to attach to it such a thing as the ability to write the character code both in direct sequence (for example, 0A15) and in reverse (150A). And in order for programs to understand exactly in what sequence to read the codes, it was invented BOM(Byte Order Mark or, in other words, signature), which was expressed in adding three additional bytes to the very beginning of the documents.

In the UTF-8 encoding, no BOMs were provided for in the Unicode consortium, and therefore adding a signature (those notorious extra three bytes at the beginning of the document) simply prevents some programs from reading the code. Therefore, when saving files in UTF, we must always select the option without BOM (without signature). So you are in advance protect yourself from crawling krakozyabrs.

What is noteworthy is that some programs in Windows cannot do this (they cannot save text in UTF-8 without a BOM), for example, the same notorious Windows Notepad. It saves the document in UTF-8, but still adds the signature (three extra bytes) to the beginning of it. Moreover, these bytes will always be the same - read the code in direct sequence. But on servers, because of this little thing, a problem can arise - crooks will come out.

Therefore, under no circumstances Don't use regular Windows notepad to edit documents on your site if you don’t want any cracks to appear. I consider the already mentioned Notepad++ editor to be the best and simplest option, which has practically no drawbacks and consists only of advantages.

In Notepad++, when you select an encoding, you will have the option to convert text to UCS-2 encoding, which is very close in nature to the Unicode standard. Also in Notepad it will be possible to encode text in ANSI, i.e. in relation to the Russian language, this will be Windows 1251, which we have already described just above. Where does this information come from?

It is registered in the registry of your Windows operating system - which encoding to choose in the case of ANSI, which to choose in the case of OEM (for the Russian language it will be CP866). If you set another default language on your computer, then these encodings will be replaced with similar ones from the ANSI or OEM category for that same language.

After you save the document in Notepad++ in the encoding you need or open the document from the site for editing, you can see its name in the lower right corner of the editor:

To avoid rednecks In addition to the actions described above, it will be useful to write information about this encoding in the header of the source code of all pages of the site so that there is no confusion on the server or local host.

In general, all hypertext markup languages except Html use a special xml declaration, which specifies the text encoding.

Before parsing the code, the browser knows which version is being used and how exactly it needs to interpret the character codes of that language. But what’s noteworthy is that if you save the document in the default Unicode, then this xml declaration can be omitted (the encoding will be considered UTF-8 if there is no BOM or UTF-16 if there is a BOM).

In the case of an Html language document, the encoding is used to indicate Meta element, which is written between the opening and closing Head tags:

... ...

This entry is quite different from the one adopted in, but is fully compliant with the new Html 5 standard that is being slowly introduced, and it will be completely understood correctly by any browsers currently used.

In theory, it would be better to place a Meta element indicating the Html document encoding as high as possible in the document header so that at the time of encountering the first character in the text not from the basic ANSI (which are always read correctly and in any variation), the browser should already have information on how to interpret the codes of these characters.

Good luck to you! See you soon on the pages of the blog site

Coding KOI8-R

	.0	.1	.2	.3	.4	.5	.6	.7	.8	.9	.A	.B	.C	.D	.E	.F
8.	─ 2500	│ 2502	┌ 250C	┐ 2510	└ 2514	┘ 2518	├ 251C	┤ 2524	┬ 252C	┴ 2534	┼ 253C	▀ 2580	▄ 2584	█ 2588	▌ 258C	▐ 2590
9.	░ 2591	▒ 2592	▓ 2593	⌠ 2320	■ 25A0	∙ 2219	√ 221A	≈ 2248	≤ 2264	≥ 2265	A0	⌡ 2321	° B0	² B2	· B7	÷ F7
A.	═ 2550	║ 2551	╒ 2552	e 451	╓ 2553	╔ 2554	╕ 2555	╖ 2556	╗ 2557	╘ 2558	╙ 2559	╚ 255A	╛ 255B	╜ 255C	╝ 255D	╞ 255E
B.	╟ 255F	╠ 2560	╡ 2561	Yo 401	╢ 2562	╣ 2563	╤ 2564	╥ 2565	╦ 2566	╧ 2567	╨ 2568	╩ 2569	╪ 256A	╫ 256B	╬ 256C	© A9
C.	Yu 44E	A 430	b 431	ts 446	d 434	e 435	f 444	G 433	X 445	And 438	th 439	To 43A	l 43B	m 43C	n 43D	O 43E
D.	P 43F	I 44F	R 440	With 441	T 442	at 443	and 436	V 432	b 44C	s 44B	h 437	w 448	uh 44D	sch 449	h 447	ъ 44A
E.	YU 42E	A 410	B 411	C 426	D 414	E 415	F 424	G 413	X 425	AND 418	Y 419	TO 41A	L 41B	M 41C	N 41D	ABOUT 41E
F.	P 41F	I 42F	R 420	WITH 421	T 422	U 423	AND 416	IN 412	b 42C	Y 42B	Z 417	Sh 428	E 42D	SCH 429	H 427	Kommersant 42A

Encoding KOI8-U (Ukrainian)

	.0	.1	.2	.3	.4	.5	.6	.7	.8	.9	.A	.B	.C	.D	.E	.F
A.	═ 2550	║ 2551	╒ 2552	e 451	є 454	╔ 2554	і 456	ї 457	╗ 2557	╘ 2558	╙ 2559	╚ 255A	╛ 255B	ґ 491	╝ 255D	╞ 255E
B.	╟ 255F	╠ 2560	╡ 2561	Yo 401	Є 404	╣ 2563	І 406	Ї 407	╦ 2566	╧ 2567	╨ 2568	╩ 2569	╪ 256A	Ґ 490	╬ 256C	© A9

Popular

How to number pages in OpenOffice

« From: Belgorod Registered: 2015.03.18 Posts: 550 Likes: 81 Topic: Open Document How to number pages Page numbers in a text... »

The largest public pages on VKontakte Cook it!

« 1. Collections of movies, Public films with those “save them on your wall so you don’t forget to watch them.” For convenience, in publications along with lists... »

Samsung Galaxy A9 Star Pro: the first smartphone with a quad camera Information about the type of speakers and audio technologies supported by the device

« The Koreans presented the world's first smartphone with four cameras - Samsung Galaxy A9 (2018). The price at the start of sales is 39,990 rubles. Someday this... »