OSDev.org

The Place to Start for Operating System Developers
It is currently Thu Mar 28, 2024 5:00 am

All times are UTC - 6 hours




Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: The strangest behavior with C++ iostream I've ever seen
PostPosted: Wed May 11, 2022 10:04 am 
Offline
Member
Member

Joined: Sun Jun 23, 2019 5:36 pm
Posts: 618
Location: North Dakota, United States
So, I'm posting this here because this is behavior that I've never seen exhibited by any libstdc++ ever. As you guys know, I'm taking a course in compiler design. I haven't touched my lexer in months -- this semester has almost entirely focused on the parser side of things -- and my lexer has worked up until this point (and of course it had to fail now because, of course, its finals week). The strange problem is, its not failing on my computer, but on my professors when he tries to validate my code. I'm on Linux and he's on Windows, which may have something to do with it, but on the previous assignments he never experienced this problem, and we're both stumped. (For reference, the parser/compiler is a miniature version of Pascal.)

The problem occurs when my lexer goes to read in a token:
Code:
    while (true) {
        auto pos = in.tellg();
        std::uint8_t c = 0;
        in >> std::noskipws >> c; // problem occurs on this line...
        // Figure out what we've got
        if (STATE_TBL[c][static_cast<std::uint64_t>(state)] ==
                DfaState::Accept ||
            !c) {
            in.seekg(pos);
            if (!trim(str).empty()) {
                // Store token...
            }
            if (c == 0) {
                break;
            }
            str.clear();
            prev_state = DfaState::Whitespace;
            state = DfaState::Whitespace;
            continue;
        }
        if (STATE_TBL[c][static_cast<std::uint64_t>(state)] ==
            DfaState::Error) {
            std::stringstream ss;
            ss << "Invalid token at position " << in.tellg()
               << ": was parsing char " << unsigned(c) << " in state "
               << unsigned(state) << "; got " << str
               << "\nTransitional state: " << unsigned(c)
               << ", transitions to state "
               << unsigned(STATE_TBL[c][static_cast<std::uint64_t>(state)])
               << " from state " << unsigned(prev_state) << " and "
               << unsigned(state);
            throw std::runtime_error(ss.str());
        } else {
            str += static_cast<unsigned char>(c);
            prev_state = state;
            state = STATE_TBL[c][static_cast<std::uint64_t>(state)];
        }
    }

Specifically, the problem he's having is that this line (pascal):
Code:
program code;

Breaks after it parses the space after "program". It reads "program", a space, and then immediately hits end-of-file and refuses to parse anything else. The file though contains much more; my copy, for example, contains this test code:
Code:
program code;

var x,y,sum:integer;
var A:integer;

procedure SumAvg(P1,P2:integer; var Avg:integer);
var s:integer;
begin
    s:=P1+P2;
    sum:=s;
    Avg:=sum/2;
end;

begin
    x:=5;
    y:=7;
    SumAvg(x,y,A);
end.

I'm completely baffled because this doesn't happen on my machine and its never happened before with any prior assignment. I did update the zip file that I was going to submit for code-formatting reasons, so maybe that did something, but both of our editors are showing that the test code does in fact contain a lot more than just "program " and so the lexer should see the same. Is there something really weird going on between my libstdc++ and his? Is this just a weird Windows thing?


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Wed May 11, 2022 10:14 am 
Offline
Member
Member

Joined: Mon Mar 25, 2013 7:01 pm
Posts: 5099
Ethin wrote:
Is this just a weird Windows thing?

It might be. Did you open the file in binary mode?


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Wed May 11, 2022 10:32 am 
Offline
Member
Member

Joined: Sun Jun 23, 2019 5:36 pm
Posts: 618
Location: North Dakota, United States
Octocontrabass wrote:
Ethin wrote:
Is this just a weird Windows thing?

It might be. Did you open the file in binary mode?

No. I left out the mode parameter and just let the library select the default (which presumably is "text mode").


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Wed May 11, 2022 10:46 am 
Offline
Member
Member

Joined: Fri Oct 04, 2019 10:10 am
Posts: 48
disclaimer: I don't use iostreams much, so some of these guesses may not be accurate. cppreference would be a decent gold standard to verify with.

Default should be text, though if the source file you're feeding it was made on linux, there might be questions as to how iostreams on windows in text mode handles lone linefeeds.

If the professor opens the pascal source in notepad++, and sets it to show all symbols, does anything odd pop up? Not sure how it would happen accidentally, but a windows newline is 0x0D 0x0A, vs the unix plain 0x0A... but ^D/ctrl-D is the windows EOF character...


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Wed May 11, 2022 11:08 am 
Offline
Member
Member
User avatar

Joined: Sat Mar 31, 2012 3:07 am
Posts: 4591
Location: Chichester, UK
Have you looked at the input file with a hex editor? It’s possible that, somehow, a non-printing character has slipped in which is upsetting things.


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Wed May 11, 2022 11:57 am 
Offline
Member
Member

Joined: Sun Jun 23, 2019 5:36 pm
Posts: 618
Location: North Dakota, United States
iansjack wrote:
Have you looked at the input file with a hex editor? It’s possible that, somehow, a non-printing character has slipped in which is upsetting things.

I did. There doesn't seem to be anything wrong with it (on my end anyway):
Code:
$ xxd code.txt

Code:
00000000: 7072 6f67 7261 6d20 636f 6465 3b0a 0a76  program code;..v
00000010: 6172 2078 2c79 2c73 756d 3a69 6e74 6567  ar x,y,sum:integ
00000020: 6572 3b0a 7661 7220 413a 696e 7465 6765  er;.var A:intege
00000030: 723b 0a0a 7072 6f63 6564 7572 6520 5375  r;..procedure Su
00000040: 6d41 7667 2850 312c 5032 3a69 6e74 6567  mAvg(P1,P2:integ
00000050: 6572 3b20 7661 7220 4176 673a 696e 7465  er; var Avg:inte
00000060: 6765 7229 3b0a 7661 7220 733a 696e 7465  ger);.var s:inte
00000070: 6765 723b 0a62 6567 696e 0a20 2020 2073  ger;.begin.    s
00000080: 3a3d 5031 2b50 323b 0a20 2020 2073 756d  :=P1+P2;.    sum
00000090: 3a3d 733b 0a20 2020 2041 7667 3a3d 7375  :=s;.    Avg:=su
000000a0: 6d2f 323b 0a65 6e64 3b0a 0a62 6567 696e  m/2;.end;..begin
000000b0: 0a20 2020 2078 3a3d 353b 0a20 2020 2079  .    x:=5;.    y
000000c0: 3a3d 373b 0a20 2020 2053 756d 4176 6728  :=7;.    SumAvg(
000000d0: 782c 792c 4129 3b0a 656e 642e 0a         x,y,A);.end..

For some reason, that view has characters that don't seem to actually exist:
Code:
$ xxd -i code.txt

Code:
unsigned char code_txt[] = {
  0x70, 0x72, 0x6f, 0x67, 0x72, 0x61, 0x6d, 0x20, 0x63, 0x6f, 0x64, 0x65,
  0x3b, 0x0a, 0x0a, 0x76, 0x61, 0x72, 0x20, 0x78, 0x2c, 0x79, 0x2c, 0x73,
  0x75, 0x6d, 0x3a, 0x69, 0x6e, 0x74, 0x65, 0x67, 0x65, 0x72, 0x3b, 0x0a,
  0x76, 0x61, 0x72, 0x20, 0x41, 0x3a, 0x69, 0x6e, 0x74, 0x65, 0x67, 0x65,
  0x72, 0x3b, 0x0a, 0x0a, 0x70, 0x72, 0x6f, 0x63, 0x65, 0x64, 0x75, 0x72,
  0x65, 0x20, 0x53, 0x75, 0x6d, 0x41, 0x76, 0x67, 0x28, 0x50, 0x31, 0x2c,
  0x50, 0x32, 0x3a, 0x69, 0x6e, 0x74, 0x65, 0x67, 0x65, 0x72, 0x3b, 0x20,
  0x76, 0x61, 0x72, 0x20, 0x41, 0x76, 0x67, 0x3a, 0x69, 0x6e, 0x74, 0x65,
  0x67, 0x65, 0x72, 0x29, 0x3b, 0x0a, 0x76, 0x61, 0x72, 0x20, 0x73, 0x3a,
  0x69, 0x6e, 0x74, 0x65, 0x67, 0x65, 0x72, 0x3b, 0x0a, 0x62, 0x65, 0x67,
  0x69, 0x6e, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x73, 0x3a, 0x3d, 0x50, 0x31,
  0x2b, 0x50, 0x32, 0x3b, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x73, 0x75, 0x6d,
  0x3a, 0x3d, 0x73, 0x3b, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x41, 0x76, 0x67,
  0x3a, 0x3d, 0x73, 0x75, 0x6d, 0x2f, 0x32, 0x3b, 0x0a, 0x65, 0x6e, 0x64,
  0x3b, 0x0a, 0x0a, 0x62, 0x65, 0x67, 0x69, 0x6e, 0x0a, 0x20, 0x20, 0x20,
  0x20, 0x78, 0x3a, 0x3d, 0x35, 0x3b, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x79,
  0x3a, 0x3d, 0x37, 0x3b, 0x0a, 0x20, 0x20, 0x20, 0x20, 0x53, 0x75, 0x6d,
  0x41, 0x76, 0x67, 0x28, 0x78, 0x2c, 0x79, 0x2c, 0x41, 0x29, 0x3b, 0x0a,
  0x65, 0x6e, 0x64, 0x2e, 0x0a
};
unsigned int code_txt_len = 221;

Maybe I'm misinterpreting the hex dump though...


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Wed May 11, 2022 11:59 am 
Offline
Member
Member

Joined: Sun Jun 23, 2019 5:36 pm
Posts: 618
Location: North Dakota, United States
reapersms wrote:
disclaimer: I don't use iostreams much, so some of these guesses may not be accurate. cppreference would be a decent gold standard to verify with.

Default should be text, though if the source file you're feeding it was made on linux, there might be questions as to how iostreams on windows in text mode handles lone linefeeds.

If the professor opens the pascal source in notepad++, and sets it to show all symbols, does anything odd pop up? Not sure how it would happen accidentally, but a windows newline is 0x0D 0x0A, vs the unix plain 0x0A... but ^D/ctrl-D is the windows EOF character...

I've always submitted my code in the unix format with EOL being \n and not \r\n. On every assignment... I don't think that's the problem since I'm pretty sure Windows handles \n EOLs fine...


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Wed May 11, 2022 12:24 pm 
Offline
Member
Member

Joined: Wed Aug 30, 2017 8:24 am
Posts: 1593
Is it possible the tellg() is failing? In that case, you would try to seekg(-1) after reading the first keyword, which I don't know if it is defined. If the failure were occurring on Linux I would tell you to try strace, but since it is on Windows, you will have to make do with what your professor is willing to go along with.

Also, cppreference warns me that seekg(n) is not necessarily the same as seekg(n, iso::beg). May be a thing a to consider.

_________________
Carpe diem!


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Wed May 11, 2022 1:51 pm 
Offline
Member
Member

Joined: Sun Jun 23, 2019 5:36 pm
Posts: 618
Location: North Dakota, United States
nullplan wrote:
Is it possible the tellg() is failing? In that case, you would try to seekg(-1) after reading the first keyword, which I don't know if it is defined. If the failure were occurring on Linux I would tell you to try strace, but since it is on Windows, you will have to make do with what your professor is willing to go along with.

Also, cppreference warns me that seekg(n) is not necessarily the same as seekg(n, iso::beg). May be a thing a to consider.

Maybe... According to this page on cppreference, seekg does call setstate to set the stream state to failbit if seeking fails. tellg apparently does similarly, and I do call in.exceptions to set an exception to be thrown in case of badbit being set, but I don't call this in case of failbit being set. I'll suggest that to him (or resubmit the assignment -- that might just solve the problem). That does raise the question of why tellg() is failing though, since other programs are able to read the file fine.


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Wed May 11, 2022 3:55 pm 
Offline
Member
Member

Joined: Tue Apr 03, 2018 2:44 am
Posts: 401
Ethin wrote:

Code:
        std::uint8_t c = 0;
        in >> std::noskipws >> c; // problem occurs on this line...

Specifically, the problem he's having is that this line (pascal):
Code:
program code;

Breaks after it parses the space after "program". It reads "program", a space, and then immediately hits end-of-file and refuses to parse anything else.

I'm completely baffled because this doesn't happen on my machine and its never happened before with any prior assignment. I did update the zip file that I was going to submit for code-formatting reasons, so maybe that did something, but both of our editors are showing that the test code does in fact contain a lot more than just "program " and so the lexer should see the same. Is there something really weird going on between my libstdc++ and his? Is this just a weird Windows thing?


You should at least mention what compiler he is using. libstdc++ will be part of the compiler, so it's unlikely to be Windows per se.

Quoting from: https://www.cplusplus.com/reference/ios/noskipws/

Quote:
Notice that many extraction operations consider the whitespaces themselves as the terminating character, therefore, with the skipws flag disabled, some extraction operations may extract no characters at all from the stream.


Which sounds exactly like what your professor is hitting.


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Wed May 11, 2022 5:30 pm 
Offline
Member
Member

Joined: Sun Jun 23, 2019 5:36 pm
Posts: 618
Location: North Dakota, United States
thewrongchristian wrote:
Ethin wrote:

Code:
        std::uint8_t c = 0;
        in >> std::noskipws >> c; // problem occurs on this line...

Specifically, the problem he's having is that this line (pascal):
Code:
program code;

Breaks after it parses the space after "program". It reads "program", a space, and then immediately hits end-of-file and refuses to parse anything else.

I'm completely baffled because this doesn't happen on my machine and its never happened before with any prior assignment. I did update the zip file that I was going to submit for code-formatting reasons, so maybe that did something, but both of our editors are showing that the test code does in fact contain a lot more than just "program " and so the lexer should see the same. Is there something really weird going on between my libstdc++ and his? Is this just a weird Windows thing?


You should at least mention what compiler he is using. libstdc++ will be part of the compiler, so it's unlikely to be Windows per se.

Quoting from: https://www.cplusplus.com/reference/ios/noskipws/

Quote:
Notice that many extraction operations consider the whitespaces themselves as the terminating character, therefore, with the skipws flag disabled, some extraction operations may extract no characters at all from the stream.


Which sounds exactly like what your professor is hitting.

He's using MSVC (unsure on what version, but I suspect 2017 or 2019). He may be hitting that but that doesn't make any sense because he didn't hit this on any other assignments that I used the noskipws specifier on. And I need to be able to read whitespace characters without going past the EOF, and my research indicated that using the formatted IO operations was the best way to do this. It doesn't help that the way I'm lexing is so sensitive and it breaks if I change it even in the slightest way. My experiments with using something like in.get read tokens that didn't exist, so I got parsing errors later on because my parser is written so that once it hits the final '.', that's the program terminator, and there should not be any tokens after that. (To clarify: it would read the '.', then it would read whitespace (or EOF, or some other character), which would confused my parser.) I thought I might try using something like in.eof as the condition of the loop, but there are at least a few stack overflow answers that say that using
Code:
!feof(f)
or
Code:
!in.eof()
is bad practice.


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Wed May 11, 2022 9:59 pm 
Offline
Member
Member

Joined: Wed Aug 30, 2017 8:24 am
Posts: 1593
Ethin wrote:
I thought I might try using something like in.eof as the condition of the loop, but there are at least a few stack overflow answers that say that using
Code:
!feof(f)
or
Code:
!in.eof()
is bad practice.
True, the preferred way to read a file in C (and I guess C++) is to call the read function you want until it throws an error. So in C you would do something like
Code:
int c;
...
while ((c = fgetc(in)) != EOF)
A common mistake in this pattern would be to declare c as char, but char cannot hold EOF. You also would use ungetc() to push the first character past a token back into the stream, rather than positioning functions. ungetc() only changes the read buffer. If used in moderation, it cannot fail (and only ungetting one character is the pinnacle of moderation). The reason for this is that feof() doesn't work like it does in Pascal, it returns true after a read function failed for hitting end of the file.

That reminds me, you aren't checking if your read function succeeded. I don't know how you are supposed to do that in C++, though.

_________________
Carpe diem!


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Thu May 12, 2022 12:22 am 
Offline
Member
Member
User avatar

Joined: Sun Feb 18, 2007 7:28 pm
Posts: 1564
Hi,

So... I would typically have a function like int FileGet() and FileUnget(int) that returns or ungets a respective character. This way, FileGet can handle EOL translation and process line continuation characters. It also facilitates debugging as you can independently test the file stream on its own given the input file.

This is basically what I suggest here. If you believe there is an issue with reading the input file, then just read the input file taking the scanner out of the picture. If you believe the issue is at the iostream level then the scanner code should be a nonfactor. Test it independently at the file level and have the scanner call those functions.

_________________
OS Development Series | Wiki | os | ncc
char c[2]={"\x90\xC3"};int main(){void(*f)()=(void(__cdecl*)(void))(void*)&c;f();}


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Thu May 12, 2022 8:40 am 
Offline
Member
Member

Joined: Sun Jun 23, 2019 5:36 pm
Posts: 618
Location: North Dakota, United States
nullplan wrote:
Ethin wrote:
I thought I might try using something like in.eof as the condition of the loop, but there are at least a few stack overflow answers that say that using
Code:
!feof(f)
or
Code:
!in.eof()
is bad practice.
True, the preferred way to read a file in C (and I guess C++) is to call the read function you want until it throws an error. So in C you would do something like
Code:
int c;
...
while ((c = fgetc(in)) != EOF)
A common mistake in this pattern would be to declare c as char, but char cannot hold EOF. You also would use ungetc() to push the first character past a token back into the stream, rather than positioning functions. ungetc() only changes the read buffer. If used in moderation, it cannot fail (and only ungetting one character is the pinnacle of moderation). The reason for this is that feof() doesn't work like it does in Pascal, it returns true after a read function failed for hitting end of the file.

That reminds me, you aren't checking if your read function succeeded. I don't know how you are supposed to do that in C++, though.

I don't really need to check if it succeeded -- that's what the condition below it is for. But if I did need to check that, I'd do something like
Code:
if ((in >> std::noskipws >> c))
or something. But the check to see if c != 0 or c == 0 below that negates that need.


Top
 Profile  
 
 Post subject: Re: The strangest behavior with C++ iostream I've ever seen
PostPosted: Thu May 12, 2022 12:57 pm 
Offline
Member
Member

Joined: Wed Aug 30, 2017 8:24 am
Posts: 1593
Ethin wrote:
But the check to see if c != 0 or c == 0 below that negates that need.
Well no. According to cppreference:
https://en.cppreference.com/w/cpp/io/basic_istream/operator_gtgt2 wrote:
Behaves as an FormattedInputFunction. After constructing and checking the sentry object, which may skip leading whitespace, extracts a character and stores it to ch. If no character is available, sets failbit (in addition to eofbit that is set as required of a FormattedInputFunction).
If I read that right, the value of c is undefined on failure. In any case, it appears that operator>> is overkill anyway, since you only want to extract single characters from the stream. So one possible solution would be to ditch it and instead use get() and unget(). get() without arguments returns the next character or EOF, but alternatively you can pass a character variable as reference, and then the return value can be converted to bool to get the state, so in effect:
Code:
char c;
while (in.get(c))
That ought to do it.

_________________
Carpe diem!


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next

All times are UTC - 6 hours


Who is online

Users browsing this forum: DotBot [Bot] and 34 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group