Prototyping ASM in a higher level language

JohnpaulTH · **Joined:** Sat May 23, 2020 6:51 pm **Posts:** 20

Hi All,
Instead of "help me" I thought I would make a thread on prototyping asm in higher level languages. I write in lua and nasm. My nasm skills may not be the best, but I enjoy writing in assembly, that is until it doesn't work. Unfortuantely, with asm anything that goes wrong is my fault, so I have used a certain measure of prototype code in lua that does the same thing in a similar way. I wrote a routine to get a hexadecimal number from the user and store the result in rax.
Since the function complained of invalid input given valid input, I decided to rewrite it in lua.
The problem is my lua code works but my asm doesn't. (I am not trying to use lua for osdev, just for debugging.) This shows me that my lua prototype isn't acurrite enough. I am sure the problem with my NASM is simple, and after some good debugging, I should be able to get to the root of the problem, like with the rest of the problems I have fixed.

Here are both snippets of code. Take a deep breath, my ASM is not the best.

Code:

GETHEX:
push rsi
push rbx
.HEXREAD:
mov rbx, 0x200000
mov rsi, rbx
call READLN
xor rbx, rbx
.loadchar: ;while i < #m200000 do
        lodsb
        xor al, 0
        jz .end
        cmp al, '0'
        jl .fail
        cmp al, '9'
        jg .net1
        sub al, '0'
        jmp .save
.net1:
        cmp al, 'A'
        jl .fail
        cmp al, 'F'
        jg  .net2
        sub al, 'A'
        add al, 10
        jmp .save
.net2:
        cmp al, 80h
        jl .fail
        cmp al, 85h
        jg .fail
        sub al, 80h
        add al, 10
        jmp .save
.end:
mov rax, rbx
pop rbx
pop rsi
ret

.save:
shl rbx, 4
or bl, al
jmp .loadchar

.fail:
;print something
mov rsi, TEXT.FAIL
call WRITELN
jmp .HEXREAD

TEXT:
.FAIL: "FAILED TO PARSE HEX",0

And here is my Lua prototype:

Code:

function hexread()
-- lua equivalent for debugging.
io.write("0x")
local m200000 = io.read()
-- now convert to hexadecimal. If it isn't a valid hexadecimal char, clear it and start over.
-- QUARRY
i = 1
local rbx = 0
while i <= #m200000 do
        al = m200000:byte(i)
        if al < string.byte("0") then
                -- fail
                print("FAILED TO PARSE HEX. INVALID CHARS")
                return hexread()
        else
                if al > string.byte("9") then
-- NET 1
                        if al < string.byte("A",1) then
                                print("FAILED TO READ HEX")
                                return hexread()
                        else
                                if al > string.byte("F") then
                                        -- NET TWO WOULD GO HERE, BUT IS BEYOND SCOPE
                                        print("FAILED TO READ HEX")
return hexread()
                                else
                                        -- subtract 'A' from al
                                        -- and then add ten
                                        al = al - string.byte("A")
                                        al = al + 10
                                end
                        end
                else
                        -- subtract '0'
                        al = al - string.byte("0")
                end
        end
        rbx = rbx << 4
        rbx = rbx | al
        i = i+1
end
return rbx
end

When I type valid HEX in, I get the INVALID HEX message.
My function doesn't get past the jl directive at the cmp al, '0'.

For my assembly, I should mention that READLN and WRITELN are both tested functions that work. READLN gets keys from the user, and stores the resulting string at address rbx placing a NULL at the end, when the user presses enter. (the LF is not saved in memory) This has been confirmed working.
WRITELN also works, it takes a rsi at a specified address, and prints the string to video memory.
Finally, I should mention that characters 80h to 85h are special characters that I also want to be able to use with hexadecimal.

So here is my question, Since my Lua code works, and my ASM does not, I am not accuritely representing asm in lua.
What are some tips to better prototype asm?
I see several problems with my current approach:

Using m200000 to represent that slot in memory is pretty dumb. I have tested and the asm function is fetching the right characters from memory, so it is not a memleak causing it. EDIT: It totally was, but in a different way. However this strengthens the question. That said, are there better strategies for prototyping with memory slots. (also, I probably should have READLN place the value on the stack.)
Since my lua "registers" are actually ordinary variables, I can see register related problems arising,
lua uses a more structured approach to conditional execution that x86, so label related problems could go undetected. In some ways I think there is an advantage to it (like being able to express your idea in a basic fashion, and then adjust for assembly) but obviously without a GOTO that could be problematic.

Obviously there are better languages to prototype routines in, but I am not fluent in them, and would rather spend the time talking to the hardware than learning a new language.

TL;DR:

My ASM was okay, but there was a memory problem, that the prototype code didn't detect.
What are some tips for prototyping asm in a higher level language?
When, if ever should this be done?
And finally, how useful is prototyping in debugging?

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5137

JohnpaulTH wrote:

What are some tips for prototyping asm in a higher level language?

Pick a language like C and use a compiler to turn it into assembly. :wink:

JohnpaulTH wrote:

When, if ever should this be done?

When you don't know how to lay out the program flow, or need to know if the algorithm will work before you implement it.

Your prototype proves that your algorithm is correct, and you already figured out the program flow before you wrote the prototype. The problems you're having now are best solved using a debugger and a CPU manual, not a prototype.

For example, using your debugger and CPU manual you might spot that the LODSB instruction behaves differently depending on the state of the direction flag, or that JL and JG perform signed comparisons instead of unsigned comparisons.

JohnpaulTH · **Joined:** Sat May 23, 2020 6:51 pm **Posts:** 20

Octocontrabass wrote:

Pick a language like C and use a compiler to turn it into assembly. :wink:

Ah, but that takes all the fun out :grin:

Octocontrabass wrote:

or that JL and JG perform signed comparisons instead of unsigned comparisons.

This is interesting. In my case, a debugger solved my problem, which was that my read function was not placing a zero at the end, which I hadn't detected before because I was reading into zeroed territory. However jl has not worked for me before. perhaps jb is better? I hadn't thought about the implications of that.
So it seems like prototype for testing algorithm, debugger for testing implementation.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1604

JohnpaulTH wrote:

Since the function complained of invalid input given valid input, I decided to rewrite it in lua.

That means you weren't actually prototyping. Prototyping would mean you write it in the higher level language first, then translate that to assembler. So basically, what a compiler is doing.

This can be a good thing, but can also be a bad thing. In assembler you can do a lot of things different from the higher level languages. You can have cases flow into each other, you can branch out of comparisons to different points, etc.

For example, I once wrote a heap sort. And to heapify its internal heap, I wrote the standard algorithm:

Code:

for (;;) {
  size_t left = 2 * this + 1;
  size_t right = left + 1;
  if (right < nmemb) {
    if (cmp(this, left) < 0 || cmp(this, right) < 0) {
      if (cmp(left, right) < 0) {
        swap(this, right);
        this = right;
      } else {
        swap(this, left);
        this = left;
      }
    } else break;
  } else break;
}

However, when you look at it, you will notice that it isn't optimal. It performs too much work. In the case where the left child is smaller or equal to the head node, but the right child is larger, evaluating the first comparison already gives you exactly that data. Two comparisons ought to be enough in that case. But this code will perform an unnecessary comparison anyway, because it is easier to write this way.

Trying to implement that idea in C, however, I noticed that I had really no idea how to do it without either duplicating code or using a goto. Luckily, in assembler, goto is the only way, anyway.

JohnpaulTH wrote:

However jl has not worked for me before. perhaps jb is better? I hadn't thought about the implications of that.

JL and JG (etc.) are for signed numbers, JB and JA (etc.) are for unsigned numbers. Also, to test for ranges, I really like to use the following trick: Suppose you want to know if x is between a and b. So the inequation is "a <= x <= b". You can subtract a from all inequations and get "0 <= x - a <= b - a". And now you can cast the expression to unsigned to make negative numbers wrap around, and thereby become much bigger than your comparison. This means, however, that instead of two comparisons, you now make one subtract and one comparison, thereby simplifying the control flow massively.

In context, I would write it like this:

Code:

  sub al, '0'
  cmp al, 10
  jb .save
  sub al, 'A' - '0'
  cmp al, 7
  jb .save1
  sub al, 'a' - 'A'
  cmp al, 7
  jb .save1
  ; failure
  mov rsi, TEXT.FAIL
  call WRITELN
  jmp .HEXREAD
.save1:
  add al, 10
.save:
  shl rbx, 4
  or bl, al
  jmp .loadchar

JohnpaulTH · **Joined:** Sat May 23, 2020 6:51 pm **Posts:** 20

nullplan wrote:

That means you weren't actually prototyping.

Good to know, what do you call what I was doing?

nullplan wrote:

This can be a good thing, but can also be a bad thing. In assembler you can do a lot of things different from the higher level languages.

Good points.

Quote:

or using a goto.

Goto is not rpesent in lua, thus making it an even poorer choice for prototyping.

Quote:

I really like to use the following trick:

[/quote]
That is so cool! Thanks for sharing. I get so narrow minded that I forget about those other variants of jmp.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5137

nullplan wrote:

Trying to implement that idea in C, however, I noticed that I had really no idea how to do it without either duplicating code or using a goto.

GCC allows you to declare functions like cmp() as pure so the optimizer can remove redundant calls. For it to work in this case, you'd have to rewrite one of the comparisons in terms of the results of the other two.

It doesn't help as much in this case, but C boolean operators skip evaluating the right operand if the result is already determined by the left operand.

MichaelFarthing · **Posted:** Wed May 12, 2021 2:54 am

You made a comment on jb and jl earlier in the thread. This can indeed be a confusing matter. I'm not quite sure of the exact place you were referring to but I did note at one point you had:

cmp al 0
jl ....

This jump occurs when al has a value between 0x80 and 0xff because jl works on the basis that al will be signed and these are the negative values

jb ..

A jump would never occur using this instruction in this situation. jb works on the assumption that al is unsigned and al can never be below zero.

I haven't studied your code in detail but it might be part of your problem. In any case I hope it sorts out the difference between jl and jb for you.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1604

JohnpaulTH wrote:

Good to know, what do you call what I was doing?

I'm not sure it has a name. I have sometimes done the same to debug an assembler issue. That was when I realized that compilers don't get tired and are very efficient at checking types, and my long assembler listings contained lots of errors caused by fatigue and forgetting types. So then I switched to only using assembler where it was unavoidable, and have been a much happier person since then.

Octocontrabass wrote:

GCC allows you to declare functions like cmp() as pure so the optimizer can remove redundant calls.

Ah, but the fact that they are redundant is only given in this case because `cmp' establishes an order on its inputs. Which GCC doesn't know about and, to my knowledge, can't be told about. If you don't know that, you only see three function calls with different arguments. Also, I can't use that in the particular case here because `cmp' was actually a function pointer, and the code was abbreviated for readability.

The point I was trying to make is, as I was turning that into assembler for fun (don't ask), I noticed that I would have to add a conditional jump after each cmp call anyway, and therefore it is no issue to jump to "use the right child" after the second cmp(). And there is no equivalent way to write that in structured C without copying code, at least not that I could find.

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5137

nullplan wrote:

Ah, but the fact that they are redundant is only given in this case because `cmp' establishes an order on its inputs. Which GCC doesn't know about and, to my knowledge, can't be told about. If you don't know that, you only see three function calls with different arguments.

Right. You'd have to explicitly redo the first two comparisons to tell GCC that it can reuse their return values instead of performing the third comparison.

nullplan wrote:

Also, I can't use that in the particular case here because `cmp' was actually a function pointer, and the code was abbreviated for readability.

Function attributes can be applied to function pointers.

Fantasy94 · **Joined:** Fri Apr 30, 2021 9:38 am **Posts:** 5

JohnpaulTH wrote:

Here are both snippets of code. Take a deep breath, my ASM is not the best.

I know that what I am going to say does not go with the subject in particular.

But haven't you thought about using the SSE2 suite? It would give more quality to your code, since the current one is full of comparisons and jumps.

Regards.

JohnpaulTH · **Joined:** Sat May 23, 2020 6:51 pm **Posts:** 20

Fantasy94 wrote:

But haven't you thought about using the SSE2 suite? It would give more quality to your code,

Thanks for the tip! If I understand correctly, SSE2 is an extension to the x86 instruction set.
If I am correct, than looking over it I see things that could come in handy, but most of them seem to be about floating point. There are a few that could be useful, especially the packed ones. Can you give a specific example of how it might improve control flow?

Also bonus question, do these need to be taken into consideration when prototyping?

Octocontrabass · **Joined:** Mon Mar 25, 2013 7:01 pm **Posts:** 5137

JohnpaulTH wrote:

Also bonus question, do these need to be taken into consideration when prototyping?

Instruction set extensions tend to have very specific functionality, so you have to choose an algorithm where they'll be useful. Also, depending on your target, you might need to consider what will happen if a particular extension isn't available.

Fantasy94 · **Joined:** Fri Apr 30, 2021 9:38 am **Posts:** 5

JohnpaulTH wrote:

Fantasy94 wrote:

But haven't you thought about using the SSE2 suite? It would give more quality to your code,

Thanks for the tip! If I understand correctly, SSE2 is an extension to the x86 instruction set.
If I am correct, than looking over it I see things that could come in handy, but most of them seem to be about floating point. There are a few that could be useful, especially the packed ones. Can you give a specific example of how it might improve control flow?

Also bonus question, do these need to be taken into consideration when prototyping?

That is not entirely true, you have instructions to make purchases and create bit masks.

I show you the function that I usually use in my programs, in this case it is a strlen function. But you can modify it for your case, if it is the same you are making comparisons and discarding.

Code:

.section .text
.globl _start
.align 16

_start:

pushq %r8
pushq %r9
pushq %r12

movq %r8, %r9
pxor %xmm1, %xmm1

_x0:

movdqu (%r8), %xmm0
pcmpeqb %xmm0, %xmm1
pmovmskb %xmm1, %r12
addq $16, %r8
andq $0xffffffffffffffff, %r12
jz _x0

bsf %r12, %r12
subq $16, %r8
subq %r9, %r8
addq %r12, %r8

popq %r12
popq %r12

In this case the register xmm1 is at 0, I pass the first group of the chain in xmm0.

If there are any 0s in the string, that would denote its end. PCMPEQB would fill that byte of 1s, it would stay at 0xff. Otherwise, it would remain at 0, pmovmskb would pass a 0 to register r12, and the conditional jump jz would not be fulfilled.

Since the AND would give a zero to the ZF flag.

But when finding a 0 in the string PCMPEQB it would give a 0xff to xmm1, and PMOVMSKB would have a value other than 0 and the conditional jump would not be fulfilled.

BSF is in charge of finding the most significant bit, in r12 you have a 16-bit mask, which refers to the 16 bytes of the xmm0 register, in this case having a 0xfff8 mask, it would become a 3 byte.

Sorry, I know my explanation is not perfect, but you can take this code for reference.

You can discard the bytes that do not belong to the hexadecimal set and others, it would gain a lot of quality in your routines, since you calculate 16 bytes at a time, not byte by byte.

Sorry, if I have strayed a lot, I hope you find it interesting.

Regards.

P.S:

The pointer of the chain is in r8, the result is the length in r8.

nullplan · **Joined:** Wed Aug 30, 2017 8:24 am **Posts:** 1604

Fantasy94 wrote:

But haven't you thought about using the SSE2 suite? It would give more quality to your code, since the current one is full of comparisons and jumps.

SSE2 in a kernel (which I presumed this would end up in) is a bad idea, because it necessitates a full FPU register save/restore on every kernel entry/exit, rather than just on task switch. Besides, replacing jumps with calculations can be a fun exercise, but is not usually conducive to readability. Also, only a few algorithms even can be parallelized enough for SSE to be useful. If you have a data-dependent algorithm where the result for byte n depends on the result for byte n-1, you have lost. Unless you can somehow break that dependency, you must tackle these things in sequence. For example, a CRC32 algorithm can't really be parallelized. The CRC after byte n always depends on the CRC after byte n-1, you cannot combine CRCs (at least, not that I know), and so the most you can do is calculate the CRC in word size steps at a time. But there is no SSE instruction to make the CPU go and index a table with each byte simultaneously. Which means the look-up table access becomes the bottle neck, at which point SSE is not really useful. And if you forego the look-up table, you have to perform the bitwise algorithm. Good luck parallelizing that one. You can get rid of the jumps in it, but you really can't parallelize it.

Your strlen() is unsafe if access becomes invalid between the zero-byte you are looking for and the end of the unaligned 16-byte read. Which is easy to do:

Code:

  char* p = mmap(0, 8192, PROT_NONE, MAP_ANONYMOUS, -1, 0); /* two pages with no access. */
  mprotect(p, 4096, PROT_READ | PROT_WRITE); /* make the first of these accessible */
  strcpy(p + 4092, "bad"); /* write a C string right at the end of the first page */
  strlen(p + 4092); /* BOOM! */

To prevent that from being a problem, you have to align the pointer in a bytewise loop before doing the word-at-a-time loop. That ensures that either all sixteen bytes can be read, or none of them can (since the page size is definitely a multiple of the word size).

Of course, this is a general problem with strlen(), so maybe in a kernel, don't have strlen() at all, but only strnlen()? I'm learning a lot about the foibles of the C standard library by writing a kernel, and since I have the freedom to do so, I fix what I can. I don't have strchr(), for instance, only strindex() which returns (who'd have thought?) an index, not a pointer. If the character is not found, it returns the string length. This avoids the problem of removing const-ness from the pointer that the standard interface has.

eekee · **Posted:** Sun Sep 05, 2021 1:40 am

nullplan wrote:

For example, I once wrote a heap sort. And to heapify its internal heap, I wrote the standard algorithm:

Code:

for (;;) {
  size_t left = 2 * this + 1;
  size_t right = left + 1;
  if (right < nmemb) {
    if (cmp(this, left) < 0 || cmp(this, right) < 0) {
      if (cmp(left, right) < 0) {
        swap(this, right);
        this = right;
      } else {
        swap(this, left);
        this = left;
      }
    } else break;
  } else break;
}

However, when you look at it, you will notice that it isn't optimal. It performs too much work. In the case where the left child is smaller or equal to the head node, but the right child is larger, evaluating the first comparison already gives you exactly that data. Two comparisons ought to be enough in that case. But this code will perform an unnecessary comparison anyway, because it is easier to write this way.

Trying to implement that idea in C, however, I noticed that I had really no idea how to do it without either duplicating code or using a goto. Luckily, in assembler, goto is the only way, anyway.

C has goto.

It can't jump out of a function, but it's there.

I was thinking of suggesting Forth as the prototyping language. It's much like assembler because it's just sequences of routine/macro calls, but makes it easier to build high-level constructs, it's partially interactive which is useful for testing, and displaying the stack is easy. (So it's almost like using a debugger.) On the other hand, many modern Forth's lack the goto-like branch and ?branch primitives. Where they are present, there may not be any way to define labels. (They're used for building higher-level conditionals and loops.) But now I think about it, coding a label system would be pretty easy.

OSDev.org

Prototyping ASM in a higher level language

Who is online