bzt wrote:
Ethin wrote:
It uses `goto`, which makes it a bit troublesome to port to other languages.
First, `goto` is a valid ANSI C keyword, it is perfectly fine to use it (just do not overuse it, which is also true for any other language feature). Second, it's mostly used in sprintf to make it compact, so if your preferred language supports sprintf there's no reason to port that function in the first place. The one and only "goto dis" outside of sprintf could be avoided easily by duplicating the "disassemble bytecode" block (lines 347 and 355 to 365). That's no more than 11 additional SLoC.
Under no circumstances would I call this "troublesome".
Did I ever say that it wasn't a perfectly legitimate usage of ANSI C? I've used it before -- I know exactly what goto does and I'm okay with its use. It just makes it a bit more troublesome to port it to other language because it requires violating the DRY principal when those languages don't possess that keyword.
bzt wrote:
Ethin wrote:
it does have
an awesome and blazingly fast x86 disassembler that I'd use instead of the handwritten one.
I've just checked, iced has dependencies, even some Google code, and it's over 1 Mbytes in size. My implementation works for
AArch64 too, not handwritten, it is
generated by a
script for speed and compact size, and is just ca. 40Kbytes. (Feel the difference: 1024K vs. 40K, even x86 and ARM disassemblers combined is no more than 187K)
The difference is that's a single x86 disassembler and it supports a lot more functionality than yours does. Yours might support more architectures but that one supports more functionality specific to x86: all the various different ISA differences that've occurred over the years, all the output formats, etc. The compiler is smart enough to eliminate dead code, and that crates size is controllable via crate features. The various features specify what's included:
- Decoder: enable instruction decoding/disassembly
- Encoder: enable instruction encoding/reassembly
- block_encoder: enables the block encoder, which also enables the encoder
- op_code_info: enables the retrieval and examination of instruction opcodes
- instr_info: enables retrieval and examination of full instructions
- gas, intel, masm, nasm: enables an instruction disassembly format
- fast_fmt: enables the fast formatting routines, speeding up formatting by at least 3.3x
- std: enables depending on the standard library
- exhaustive_enums: enables exhaustive enumerations (covering all possible values). Definitely increases code size
- no_vex, no_evex, no_xop, no_d3now: disables various instruction subsets
So, as you can see, the disassembler is highly customizable and you can configure precisely what you want. The compiler will take care of the rest. Toggling a crate feature determines whether the associated code is emitted at all. It is a direct analog to the C preprocessors conditional expressions. The reason your disassembler is smaller is because it is highly tuned for your use-case. The iced disassembler, on the other hand, is not only more generalized, but is written with a cross-language architecture in mind.
bzt wrote:
Ethin wrote:
Additionally looking at the code it also defines the uint8_t, uint16_t, ..., types manually instead of using stdint.h.
Yeah, because not all bare metal projects have stdint.h. For userspace code you can rely on stdint (either as a header file or as a compiler built-in, but it must exists), but for freestanding mode it depends on the compiler (as there might be no include files at all, unless you compile a cross-compiler with sysroot support, and gcc might have a built-in version of that header but other compilers might not). If this bothers you so much, just replace the typedefs with an include, I've used the standard names so this should be no prob, this is hardly a roadblock for porting.
Have you ever seen a compiler that does not include or generate stdint.h? I would consider it a defect in the compiler if it didn't for cross-compilation purposes. The C standard mandates that uint[8/16/32/64]_t be there equivalent bit widths, but does not mandate such conditions for [signed/unsigned] char, short, long, and long long. Though it is unlikely, it is still a gamble to depend on the fact that those types will be the bit widths that you expect. (e.g. On a RISC-V system, unsigned long long may be 128 bits, not 64.) See section 6.2.5 of C18 for more info, as well as footnotes 38, 39, and 40.
bzt wrote:
Ethin wrote:
its just not very portable across other languages without a lot of back-bending, so to speak.
I'm not so sure about that, but granted, being easily portable to other languages was never its goal, being usable without dependency in any C project was.
Cheers,
bzt
Fair enough.
Edit: to clarify: the op_code_info and instr_info features explicitly enable retrieving and examining extra information about the instructions that are decoded. (And, yes, there are *a lot* of functions that one can use on an instruction -- see
the InstructionInfo and
the OpcodeInfo struct for the info that these features enable, and see
the Instruction struct for a list of all the functions that are available on individual instructions.) I could very easily see the iced-x86 crate being used in, say, professional disassemblers or debuggers (or, hell, hobby disassemblers/debuggers, even), if only because it allows a deep-dive look at instructions, as well as in-place modification of them and moving them around. So the additional reason its so large is because its an assembler and disassembler in one package. Granted, other disassemblers offer the same level of analytical analysis, but you have to admit that this is definitely a neat project.