UTF-8 is a variable-size encoding of 16 bits unicode characters. Some characters (ascii 0-127) are encoded as a single byte but some other characters are split to 2 or 3 bytes. E.g. character "?" in latin-1 is unicode character 0x00e9. That will be split using the 2-bytes encoding as yyy.yyxx.xxxx == 000.1110.1001 and then encoding bytes 110y.yyyy 10xx.xxxx have the value 1100.0011 1010.1001 (which is 0xC3 0xA9)
Now when decoding the UTF-8 string, you wish to retrieve that it was "0x00e9" by calling
Code:
unsigned char utf8[]={0xc3,0xa9};
unsigned unicode=get_utf_char(&utf8); // should return 0xe9
What is done is to compare the highest nibble of the current char in the stream. If it matches "1110", then you have a 3-bytes encoding (that's the test [tt]if ((c&0xF0)==0xE0)[/tt])
If it matches "110*", then you have a 2-byte encoding (that's our case here). you then extract the "interresting bits" in the sequence, and combine them.
If it matches "10**" it's no good: it means you are trying to decode the middle of a multibytes character ...
if it matches "0****", it's a single-byte character which you can return without extra effort.
In any other case, it's not a valid UTF-8 stream...
If that still doesn't help, i suggest you try to figure better how masking and shifting works...