The mystery of lost values: Understanding ASCII vs. UTF-8 in Database Queries

by Joche Ojeda | Jun 21, 2024 | Database

Understanding ASCII vs. UTF-8 in Database Queries: A Practical Guide

When dealing with databases, understanding how different character encodings impact queries is crucial. Two common encoding standards are ASCII and UTF-8. This blog post delves into their differences, how they affect case-sensitive queries, and provides practical examples to illustrate these concepts.

ASCII vs. UTF-8: What’s the Difference?

ASCII (American Standard Code for Information Interchange)

Description: A character encoding standard using 7 bits to represent each character, allowing for 128 unique symbols. These include control characters (like newline), digits, uppercase and lowercase English letters, and some special symbols.
Range: 0 to 127.

UTF-8 (8-bit Unicode Transformation Format)

Description: A variable-width character encoding capable of encoding all 1,112,064 valid character code points in Unicode using one to four 8-bit bytes. UTF-8 is backward compatible with ASCII.
Range: Can represent characters in a much wider range, including all characters in all languages, as well as many symbols and special characters.

ASCII and UTF-8 Position Examples

Let’s compare the positions of some characters in both ASCII and UTF-8:

Character	ASCII Position	UTF-8 Position
A	65	65
B	66	66
…	…	…
Y	89	89
Z	90	90
[	91	91
\	92	92
]	93	93
^	94	94
_	95	95
`	96	96
a	97	97
b	98	98
…	…	…
y	121	121
z	122	122
Last ASCII (DEL)	127	127
ÿ	Not present	195 191 (2 bytes)

Case Sensitivity in Database Queries

Case sensitivity can significantly impact database queries, as different encoding schemes represent characters differently.

ASCII Example

-- Case-sensitive query in ASCII-encoded database
SELECT * FROM users WHERE username = 'Alice';
-- This will not return rows with 'alice', 'ALICE', etc.

UTF-8 Example

-- Case-sensitive query in UTF-8 encoded database
SELECT * FROM users WHERE username = 'Ålice';
-- This will not return rows with 'ålice', 'ÅLICE', etc.

Practical Example with Positions

For ASCII, the characters included in the range >= 'A' and <= 'z' are:

A has a position of 65.
a has a position of 97.

In a case-sensitive search, these positions are distinct, so A is not equal to a.

For UTF-8, the characters included in this range are the same since UTF-8 is backward compatible with ASCII for characters in this range.

Query Example

Let’s demonstrate a query example for usernames within the range >= 'A' and <= 'z'.

-- Query for usernames in the range 'A' to 'z'
SELECT * FROM users WHERE username >= 'A' AND username <= 'z';

Included Characters

Based on the ASCII positions, the range >= 'A' and <= 'z' includes:

All uppercase letters: A to Z (positions 65 to 90)
Special characters: [, \, ], ^, _, and ` (positions 91 to 96)
All lowercase letters: a to z (positions 97 to 122)

Practical Example with Positions

Given the following table:

-- Create a table
CREATE TABLE users (
    id INT PRIMARY KEY,
    username VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_bin
);

-- Insert some users
INSERT INTO users (id, username) VALUES (1, 'Alice');   -- A = 65, l = 108, i = 105, c = 99, e = 101
INSERT INTO users (id, username) VALUES (2, 'alice');   -- a = 97, l = 108, i = 105, c = 99, e = 101
INSERT INTO users (id, username) VALUES (3, 'Ålice');   -- Å = 195 133, l = 108, i = 105, c = 99, e = 101
INSERT INTO users (id, username) VALUES (4, 'ålice');   -- å = 195 165, l = 108, i = 105, c = 99, e = 101
INSERT INTO users (id, username) VALUES (5, 'Z');       -- Z = 90
INSERT INTO users (id, username) VALUES (6, 'z');       -- z = 122
INSERT INTO users (id, username) VALUES (7, 'ÿ');       -- ÿ = 195 191
INSERT INTO users (id, username) VALUES (8, '_special');-- _ = 95, s = 115, p = 112, e = 101, c = 99, i = 105, a = 97, l = 108
INSERT INTO users (id, username) VALUES (9, 'example'); -- e = 101, x = 120, a = 97, m = 109, p = 112, l = 108, e = 101

Query Execution

-- Execute the query
SELECT * FROM users WHERE username >= 'A' AND username <= 'z';

Query Result

This query will include the following usernames based on the range:

Alice (A = 65, l = 108, i = 105, c = 99, e = 101)
Z (Z = 90)
example (e = 101, x = 120, a = 97, m = 109, p = 112, l = 108, e = 101)
_special (_ = 95, s = 115, p = 112, e = 101, c = 99, i = 105, a = 97, l = 108)
alice (a = 97, l = 108, i = 105, c = 99, e = 101)
z (z = 122)

However, it will not include:

Ålice (Å = 195 133, l = 108, i = 105, c = 99, e = 101, outside the specified range)
ålice (å = 195 165, l = 108, i = 105, c = 99, e = 101, outside the specified range)
ÿ (ÿ = 195 191, outside the specified range)

Conclusion

Understanding the differences between ASCII and UTF-8 character positions and ranges is crucial when performing case-sensitive queries in databases. For example, querying for usernames within the range >= 'A' and <= 'z' will include a specific set of characters based on their ASCII positions, impacting which rows are returned in your query results.

By grasping these concepts, you can ensure your database queries are accurate and efficient, especially when dealing with different encoding schemes.

Design Patterns for Library Creators in Dotnet

by Joche Ojeda | May 14, 2024 | C#, Data Synchronization, dotnet

Hello there! Today, we’re going to delve into the fascinating world of design patterns. Don’t worry if you’re not a tech whiz – we’ll keep things simple and relatable. We’ll use the SyncFramework as an example, but our main focus will be on the design patterns themselves. So, let’s get started!

What are Design Patterns?

Design patterns are like blueprints – they provide solutions to common problems that occur in software design. They’re not ready-made code that you can directly insert into your program. Instead, they’re guidelines you can follow to solve a particular problem in a specific context.

SOLID Design Principles

One of the most popular sets of design principles is SOLID. It’s an acronym that stands for five principles that help make software designs more understandable, flexible, and maintainable. Let’s break it down:

Single Responsibility Principle: A class should have only one reason to change. In other words, it should have only one job.
Open-Closed Principle: Software entities should be open for extension but closed for modification. This means we should be able to add new features or functionality without changing the existing code.
Liskov Substitution Principle: Subtypes must be substitutable for their base types. This principle is about creating new derived classes that can replace the functionality of the base class without breaking the application.
Interface Segregation Principle: Clients should not be forced to depend on interfaces they do not use. This principle is about reducing the side effects and frequency of required changes by splitting the software into multiple, independent parts.
Dependency Inversion Principle: High-level modules should not depend on low-level modules. Both should depend on abstractions. This principle allows for decoupling.

Applying SOLID Principles in SyncFramework

The SyncFramework is a great example of how these principles can be applied. Here’s how:

Single Responsibility Principle: Each component of the SyncFramework has a specific role. For instance, one component is responsible for tracking changes, while another handles conflict resolution.
Open-Closed Principle: The SyncFramework is designed to be extensible. You can add new data sources or change the way data is synchronized without modifying the core framework.
Liskov Substitution Principle: The SyncFramework uses base classes and interfaces that allow for substitutable components. This means you can replace or modify components without affecting the overall functionality.
Interface Segregation Principle: The SyncFramework provides a range of interfaces, allowing you to choose the ones you need and ignore the ones you don’t.
Dependency Inversion Principle: The SyncFramework depends on abstractions, not on concrete classes. This makes it more flexible and adaptable to changes.

And that’s a wrap for today! But don’t worry, this is just the beginning. In the upcoming series of articles, we’ll dive deeper into each of these principles. We’ll explore how they’re applied in the source code of the SyncFramework, providing real-world examples to help you understand these concepts better. So, stay tuned for more exciting insights into the world of design patterns! See you in the next article!

If you want to learn more about data synchronization you can checkout the following blog posts:

Data synchronization in a few words – https://www.jocheojeda.com/2021/10/10/data-synchronization-in-a-few-words/
Parts of a Synchronization Framework – https://www.jocheojeda.com/2021/10/10/parts-of-a-synchronization-framework/
Let’s write a Synchronization Framework in C# – https://www.jocheojeda.com/2021/10/11/lets-write-a-synchronization-framework-in-c/
Synchronization Framework Base Classes – https://www.jocheojeda.com/2021/10/12/synchronization-framework-base-classes/
Planning the first implementation – https://www.jocheojeda.com/2021/10/12/planning-the-first-implementation/
Testing the first implementation – https://youtu.be/l2-yPlExSrg
Adding network support – https://www.jocheojeda.com/2021/10/17/syncframework-adding-network-support/

The mystery of lost values: Understanding ASCII vs. UTF-8 in Database Queries

Understanding ASCII vs. UTF-8 in Database Queries: A Practical Guide

ASCII vs. UTF-8: What’s the Difference?

ASCII (American Standard Code for Information Interchange)

UTF-8 (8-bit Unicode Transformation Format)

ASCII and UTF-8 Position Examples

Case Sensitivity in Database Queries

ASCII Example

UTF-8 Example

Practical Example with Positions

Query Example

Included Characters

Practical Example with Positions

Query Execution

Query Result

Conclusion

Design Patterns for Library Creators in Dotnet

What are Design Patterns?

SOLID Design Principles

Applying SOLID Principles in SyncFramework

Related articles

Search

Recent Posts

Categories

Archives