Unicode

Unicode
Blu-ray Disc.svg
Type Text encoding standard
Developer Unicode Consortium
First Release 1991
Open Format? Yes
Free Format? Yes

 

Unicode is a standard character set: an assignment of numeric values to characters. A huge number of characters from various writing systems (modern or ancient), as well as special symbols of many types, are each given a number. On Blu-ray, Unicode is used for text-based subtitles and text fonts, it is also the standard used in coding Java.

Unicode is an international standard and is the dominant text encoding format. It was first published in 1991. Subsequent revisions have continually expanded its character repertoire. Unicode was developed in reaction to the unwieldy multiplicity of character sets that had arisen to include various subsets of the many characters left out of the English-centric ASCII set.

The standard way to denote a Unicode code point is to prefix it with "U+", and write the number in hexadecimal, with a minimum of four hex digits. For example, code point 42is written as U+002A, and code point 1,114,109 is U+10FFFD. Code points are the numbers assigned by the Unicode Consortium to every character in every writing system. Code points are represented as U+ followed by four numbers and/or letters. Another example: Interrobang "‽" is U+203D.

Each code point is also assigned a human-readable name, which may be written after the "U+" notation. For example, you might see "U+002A ASTERISK" or "U+03A9 GREEK CAPITAL LETTER OMEGA".

In the Blu-ray technical specification, all text encoding (both for coding and displaying text) uses Unicode 2.0 (UTF-8 and UTF-16BE) which is defined in ISO/IEC 10646-1:1993. This version contains 38,885 characters (excluding private-use characters, control characters, non-characters, and surrogate code points) including Basic Latin, Cyrillic, Greek, Kanji, CJK characters, and etc. Unicode 2.0, released in July 1996, was a significant update to the Unicode standard, expanding the character repertoire to 38,885 assigned characters across multiple blocks. These blocks organize characters by script, symbol type, or usage, and they reflect the state of text encoding standardization at that time. This version of Unicode may be "outdated" by today's standards, but in Blu-ray context, it's still very relevant today for BD development. 

The reason why Unicode 2.0 was released because it was a well-established standard by the early 2000s. During Blu-ray’s development (starting around 2000, with specs finalized by 2004–2006), the BDA likely prioritized a mature standard with broad compatibility over a newer, less-tested version like Unicode 4.1. Newer Unicode versions often introduce additional characters and complexity, which could require more extensive validation and risk introducing bugs or incompatibilities in a consumer product aiming for a global launch. While “outdated” by 2005, Unicode 2.0 was a proven choice that met Blu-ray’s needs without overcomplicating the specification. 

In BD applications, Unicode can be used with bitmap fonts (PNG) or victor-based fonts (OpenType). Most BD-J titles use bitmap fonts and use Java classes like StringBuffer, BufferedReader, FileInputStream, etc. to display the bitmap text and it's code points. Rarely used, but if BD-J used victor-based fonts using OpenType, the fonts and text would be stored inside the Blu-ray's 4 MB text cache and powered by the BD player's font rendering engine. A BD-J app should include a font file (OpenType,) that's Unicode 2.0 compatible for their BD-J application, if not, then the player will use it's own default font. However, the majority of players may not include fonts on their own, so it's best to include fonts files.

Unicode 2.0 defines a total of 38,885 assigned characters across its code points from U+0000 to U+FFFF (the Basic Multilingual Plane, BMP). The exact count comes from tallying each named entry across the 55 blocks.

List of Unicode Blocks in Unicode 2.0

 

Range Block Name Assigned Characters Notes
U+0000–U+007F Basic Latin 128 ASCII characters (letters, digits, punctuation, controls).
U+0080–U+00FF Latin-1 Supplement 128 Additional Latin characters, symbols, and controls (e.g., £, ©).
U+0100–U+017F Latin Extended-A 128 Extended Latin for European languages (e.g., Œ, Š).
U+0180–U+024F Latin Extended-B 113 More Latin letters for African, Native American languages (e.g., Ɓ, ƒ).
U+0250–U+02AF IPA Extensions 89 Phonetic symbols for International Phonetic Alphabet (e.g., ɐ, ʃ).
U+02B0–U+02FF Spacing Modifier Letters 80 Modifiers for phonetics/typography (e.g., ʰ, ː).
U+0300–U+036F Combining Diacritical Marks 112 Marks combining with base characters (e.g., ◌̀, ◌̈).
U+0370–U+03FF Greek 135 Greek letters and symbols (e.g., α, Ω).
U+0400–U+04FF Cyrillic 256 Cyrillic script for Slavic languages (e.g., А, Я).
U+0530–U+058F Armenian 85 Armenian script (e.g., Ա, Ֆ).
U+0590–U+05FF Hebrew 87 Hebrew script (e.g., א, ת).
U+0600–U+06FF Arabic 237 Arabic script and symbols (e.g., ا, ى).
U+0900–U+097F Devanagari 114 Script for Hindi, Sanskrit (e.g., अ, ह).
U+0980–U+09FF Bengali 92 Bengali script (e.g., অ, হ).
U+0A00–U+0A7F Gurmukhi 79 Script for Punjabi (e.g., ਅ, ਹ).
U+0A80–U+0AFF Gujarati 83 Gujarati script (e.g., અ, હ).
U+0B00–U+0B7F Oriya 81 Oriya script (e.g., ଅ, ହ).
U+0B80–U+0BFF Tamil 72 Tamil script (e.g., அ, ஹ).
U+0C00–U+0C7F Telugu 88 Telugu script (e.g., అ, హ).
U+0C80–U+0CFF Kannada 86 Kannada script (e.g., ಅ, ಹ).
U+0D00–U+0D7F Malayalam 89 Malayalam script (e.g., അ, ഹ).
U+0E00–U+0E7F Thai 87 Thai script (e.g., ก, ๏).
U+0E80–U+0EFF Lao 65 Lao script (e.g., ກ, ຳ).
U+0F00–U+0FFF Tibetan 168 Tibetan script (e.g., ཀ, ྼ).
U+10A0–U+10FF Georgian 83 Georgian script (e.g., Ⴀ, ჶ).
U+1100–U+11FF Hangul Jamo 240 Korean Hangul components (e.g., ᄀ, ᇿ).
U+1E00–U+1EFF Latin Extended Additional 185 More Latin extensions (e.g., Ḁ, ỿ).
U+1F00–U+1FFF Greek Extended 233 Precomposed Greek with diacritics (e.g., ἀ, ῼ).
U+2000–U+206F General Punctuation 71 Punctuation marks (e.g., —, ‘).
U+2070–U+209F Superscripts and Subscripts 34 Superscript/subscript digits and letters (e.g., ⁰, ₓ).
U+20A0–U+20CF Currency Symbols 12 Currency signs (e.g., ₧).
U+20D0–U+20FF Combining Diacritical Marks for Symbols 33 Combining marks for symbols (e.g., ◌⃐, ◌⃡).
U+2100–U+214F Letterlike Symbols 55 Symbols resembling letters (e.g., ℂ, ℏ).
U+2150–U+218F Number Forms 50 Fractions, Roman numerals (e.g., ½, Ⅻ).
U+2190–U+21FF Arrows 91 Arrow symbols (e.g., ←, ).
U+2200–U+22FF Mathematical Operators 256 Math symbols (e.g., ∀, √).
U+2300–U+23FF Miscellaneous Technical 126 Technical symbols (e.g., ⌈, ).
U+2400–U+243F Control Pictures 39 Graphical representations of control codes (e.g., ␀, ␣).
U+2440–U+245F Optical Character Recognition 11 OCR-specific symbols (e.g., ⑀, ⑊).
U+2460–U+24FF Enclosed Alphanumerics 160 Circled numbers/letters (e.g., ①, ⓿).
U+2500–U+257F Box Drawing 128 Line-drawing characters (e.g., ─, ┼).
U+2580–U+259F Block Elements 32 Block graphic characters (e.g., ▀, █).
U+25A0–U+25FF Geometric Shapes 96 Shapes (e.g., ■, ◯).
U+2600–U+26FF Miscellaneous Symbols 171 Various symbols (e.g., , ).
U+2700–U+27BF Dingbats 174 Decorative symbols (e.g., ✁, ❏).
U+3000–U+303F CJK Symbols and Punctuation 63 CJK-specific punctuation (e.g., 、, 〿).
U+3040–U+309F Hiragana 93 Japanese Hiragana (e.g., ぁ, ん).
U+30A0–U+30FF Katakana 96 Japanese Katakana (e.g., ァ, ヿ).
U+3100–U+312F Bopomofo 27 Chinese phonetic script (e.g., ㄅ, ㄩ).
U+3130–U+318F Hangul Compatibility Jamo 94 Legacy Korean Jamo (e.g., ㄱ, ㅿ).
U+3200–U+32FF Enclosed CJK Letters and Months 191 Enclosed CJK characters (e.g., ㈀, ㋿).
U+3300–U+33FF CJK Compatibility 256 Compatibility CJK variants (e.g., ㌀, ㏿).
U+4E00–U+9FFF CJK Unified Ideographs 20,902 Core Chinese/Japanese/Korean characters (e.g., 一, 龥).
U+AC00–U+D7A3 Hangul Syllables 11,172 Precomposed Korean syllables (e.g., 가, 힣).
U+E000–U+F8FF Private Use Area 0 (reserved) No predefined characters; for custom use. (e.g., - https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/Apple_logo_white.svg/250px-Apple_logo_white.svg.png)
U+F900–U+FAFF CJK Compatibility Ideographs 302 Compatibility variants of CJK ideographs (e.g., 豈, ᄒ).
U+FB00–U+FB4F Alphabetic Presentation Forms 58 Precomposed ligatures (e.g., ff, ſt).
U+FB50–U+FDFF Arabic Presentation Forms-A 611 Arabic contextual forms (e.g., ﭐ, ﷿).
U+FE20–U+FE2F Combining Half Marks 16 Half-width combining marks (e.g., ◌︠, ◌︯).
U+FE30–U+FE4F CJK Compatibility Forms 32 Vertical CJK punctuation variants (e.g., ︐, ︴).
U+FE50–U+FE6F Small Form Variants 26 Small CJK punctuation (e.g., ﹐, ﹯).
U+FE70–U+FEFF Arabic Presentation Forms-B 141 More Arabic forms, includes U+FEFF (BOM) (e.g., ﹰ, zero-width no-break).
U+FF00–U+FFEF Halfwidth and Fullwidth Forms 225 Fullwidth ASCII, halfwidth Katakana/Hangul (e.g., !, ヲ).
U+FFF0–U+FFFF Specials 6 Special-purpose characters (e.g., , �).

 

 

Missing Characters

Since Unicode 2.0 was released in 1996, it's missing several key symbols that emerged later, most notably the Euro sign (€), introduced in 1999 and added in Unicode 2.1 (U+20AC). Other absences include modern currency symbols like the Indian Rupee (₹, U+20B9, Unicode 6.0), extensive emoji sets (e.g. Unicode 6.0+), and newer scripts like Cherokee (added in Unicode 3.0). These gaps reflect Unicode 2.0’s pre-1996 scope, limited to 38,885 characters in the BMP. The Private Use Area (PUA, U+E000–U+F8FF), with 6,400 unassigned code points, offers a workaround: developers can assign custom glyphs—like the Euro sign or proprietary icons—to PUA slots and pair them with a custom font. 

For the Euro sign (€), it is also possible to use the old Euro-currency sign ₠ (U+20A0) as an substitute and can be disguised as the Euro sign with a custom font design.

If a character from a newer Unicode version is used, it will appear as a replacement character "�", a box "􏿮", or nothing at all.


Special Symbols & Emoji Support

Since it uses Unicode 2.0, it has limited emoji-like support, offering only basic symbols like ☺ (U+263A) or ♥ (U+2665) in the Arrows, Miscellaneous Technical, Miscellaneous Symbols, Dingbats, Geometric Shapes and CJK Symbols and Punctuation blocks. The player does not display rasterized bitmap images or layered graphics by default, the developer will have to do that manually for the BD-J application (using small PNG graphics). The emoji-like characters will start as scalable vector-based symbols in font formats (e.g., OpenType, via fonts like Noto Emoji) by default. Example, ⌂ (U+2302) represents a house, if a small bitmap image was used, it will look like this,⌂. 


0 1 2 3 4 5 6 7 8 9
2 ‼️
203C
↔️
2194
↕️
2195
↖️
2196
↗️
2197
↘️
2198
↙️
2199
↩️
21A9
↪️
21AA

231A
3
231B
⌨️
2328

2397

2398

2399

239A
Ⓜ️
24C2
▶️
25B6
◀️
25C0
▪️
25AA
4 ▫️
25AB

2302
- - ☀️
2600
☁️
2601
☂️
2602
☃️
2603
☄️
2604

2605
5
2606
☎️
260E

260F

2610

2612

261A

261B

261C
☝️
261D

261E
6
261F
☠️
2620

2621
☢️
2622
☣️
2623
☦️
2626
☪️
262A
☮️
262E
☯️
262F
☸️
2638
7 ☹️
2639
☺️
263A

263B

263C

263D

263E
♀️
2640
♂️
2642

2648

2649
8
264A

264B

264C

264D

264E

264F

2650

2651

2652

2653
9
2654

2655

2656

2657

2658

2659

265A

265B

265C

265D
A
265E

265F
♠️
2660

2661

2662
♣️
2663

2664
♥️
2665
♦️
2666

2667
B ♨️
2668

2669

266A

266B

266C

266D

266E

266F

2701
✂️
2702
C
2703

2704

2706

2707
✈️
2708
✉️
2709
✌️
270C
✍️
270D
✏️
270F
✒️
2712
D ✔️
2714
✖️
2716

271A
✝️
271D

271E

2720
✡️
2721

2724

2727

2729
E
272A
✳️
2733

2740
❄️
2744
❇️
2747

2756
❣️
2763
❤️
2764

2765

2766
F
2767
➡️
27A1

3004

3020

3030

3036

327F
㊗️
3297
㊙️
3299

This is not a full list but these are top suggestions for text in BD applications such as subtitles, menus, or video games. They can be useful for showing instructions how to operate a feature on your remote, example, ← ↑ → ↓ are directional arrows, and ⓇⒼⒷⓎ can be used to represent the colored buttons.

 

 

Private Use Area

Private Use Area uses 6,400 unassigned code points that are reserved for private use. It can be used for various things, missing characters from newer Unicode versions, custom symbols, artificial characters like Klingon or Tengwar, or corporate symbols like the Apple logo.

Example: Apple logo - https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/Apple_logo_white.svg/250px-Apple_logo_white.svg.png (U+F8FF), Blu-ray logo - https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/Apple_logo_white.svg/250px-Apple_logo_white.svg.png (U+F79C), PlayStation logo - https://upload.wikimedia.org/wikipedia/commons/thumb/0/00/PlayStation_logo.svg/200px-PlayStation_logo.svg.png (U+F895), Twitter logo -  (U+EA00), X logo - 𝕏 (U+EA01) and the Linux Tux logo  -  (U+E000). Based on commonly used PUA code points.



Code charts and references


 

Author(s) : Æ Firestone

on Sunday, August 25, 2024 | , | A comment?
0 responses to “Unicode”

Leave a Reply

Popular Pages