Discussion:
Toybox test image / fuzzing
(too old to reply)
Andy Chu
2016-03-11 23:12:11 UTC
Permalink
What is the best way to run the toybox tests? If I just run "make test", I
get a lot of failures, some of which are probably because I'm not running
as root, while some I don't understand, like:

PASS: pgrep -o pattern
pgrep: bad -s '0'
^
FAIL: pgrep -s

I'm on an Ubuntu 14.04 machine, running against the master branch. I
didn't try running as root since it seems like there is a non-zero chance
that it will mess up my machine.

I saw in the ELC YouTube talk that test infrastructure is a TODO.

http://landley.net/talks/celf-2015.txt

Is this something I can help with? I guess if you can tell me what
environment you use to get all tests to pass, it shouldn't be too hard to
make a shell script to create that environment, probably with Aboriginal
Linux. I have built Aboriginal Linux before (like a year ago).

One of the reasons I ran into this was because I wanted to distill a test
corpus for fuzzing from the shell test cases. afl-fuzz has a utility to
minimize a test corpus based on code path coverage. So getting a stable
test environment seems like a prerequisite for that.

FWIW, I had a different approach for fuzzing each arg:

https://github.com/andychu/toybox/commit/ff937e97881bfdf4b1221618c38857b75c9534e0

This seems to be a little laborious, because I have to manually write shell
scripts to fuzz individual inputs (and I didn't find anything beyond that
one crash yet). I think the mass fuzzing thing might work better, but I'm
not sure.

thanks,
Andy
Rob Landley
2016-03-12 17:51:25 UTC
Permalink
Post by Andy Chu
What is the best way to run the toybox tests? If I just run "make
test", I get a lot of failures, some of which are probably because I'm
PASS: pgrep -o pattern
pgrep: bad -s '0'
^
FAIL: pgrep -s
Unfortunately, the test suite needs as much work as the command
implementations do. :(

Ok, backstory!

Where I started toybox I had a big todo list, and was filling it in.
Some got completed and some (like sh.c, mdev.c, or mke2fs.c) got
partially finished and put on hold for a long time.

Then I started getting contributions of new commands from other
developers, some of which were easy to verify, polish up, and declare
done, and some of which required extensive review (an in several cases
an outright rewrite). I used to just merge stuff and track the state of
it in a local text file, but that didn't scale, and I got overwhelmed.

So I created toys/pending and moved all the unfinished command
implementations there. (And a lib/pending.c for shared infrastructure
used by toys/pending which needs its own review/cleanup pass.) After a
while I wrote a page (http://landley.net/toybox/cleanup.html) explaining
about the "pending" directory and the work I do to promote stuff _out_
of the pending directory, in hopes other people would be interested in
doing some of the cleanup for me.

But people kept asking how they could help other than implementing new
commands that would go into the giant toys/pending pile, or doing
cleanup, and the next logical thing for me was "test suite". So I
suggested that.

And got a lot of test suite entries full of tests that don't pass, tests
that don't actually test anything interesting in toybox (some test the
kernel, most don't test the interesting edge cases, none of them were
written with a though reading of the relevant standards document and/or
man page...)

Really, I need a tests/pending. :(

There's a missing layer of test suite infrastructure, which isn't just
"this has to be tested as root" but "this has to be tested on a known
system with a known environment". Preferably a synthetic one running
under an emulator, which makes it a good fit for my aboriginal linux
project with its build control images:

http://landley.net/aboriginal/about.html
http://landley.net/aboriginal/control-images

Unfortunately, when I tried to do this, the first one I did was "ps" and
making the process less ps -a sees reproducible is hard, because the
kernel launches a bunch of kernel threads based on driver configuration
and kernel version, so getting stable behavior out of that was enough of
a head-scratcher it went back on the todo list. I should try again with
"mount" or something...

Anyway, I've done a few minor cleanup passes over the test suite, but an
awful lot of it is still tests that fail because the test is wrong, or
lack of test coverage.

One example of a test I did some cleanup on was tests/chmod.test, a "git
log" of that might be instructive? That said, the result isn't remotely
_complete_. (Endless cut and paste of "u+r" makes this ls output that's
not a loop, but no tests for the sticky bit? Nothing sets the excutable
bit on a script and then tests we can run it? Removes exec permission
from a directory and checks we can't ls it? Removes read permission from
a file and checks we can't read it? No, all it tests is ls output over
and over...)
Post by Andy Chu
I'm on an Ubuntu 14.04 machine, running against the master branch. I
didn't try running as root since it seems like there is a non-zero
chance that it will mess up my machine.
Very much so!

That's why I need to do an aboriginal linux test harness that boots
under qemu and runs tests in a known chroot.
Post by Andy Chu
I saw in the ELC YouTube talk that test infrastructure is a TODO.
http://landley.net/talks/celf-2015.txt
Is this something I can help with?
If you could just triage the test suite and tell me the status of the
tests, that would be great. (I've been meaning to do that forever, but
every time I try I get distracted by fixing up a specific test and the
related command...)

First pass, you could sort the tests into:

1) this command is hard to test due to butterfly effects (run it twice
get different output, so even a known emulated environment won't help;
top, ps, bootchartd, vmstat...)

2) This command could be produce reliable output under an emulated
environment. This includes everything requiring root access. (Properly
testing oneit probably requires containers _within_ an emulator, but
let's burn that bridge when we come to it.)

3) This command can have a good test now. (Whether it _does_ is separate.)

Then let's put #1 and #2 aside for the moment and concentrate on filling
out #3.
Post by Andy Chu
I guess if you can tell me what
environment you use to get all tests to pass, it shouldn't be too hard
to make a shell script to create that environment, probably with
Aboriginal Linux.
Unfortunately, there isn't one. The test suite's bit rotted ever since I
started getting significant contributions to it without having a
"pending" directory to separate curated from wild tests. :(
Post by Andy Chu
I have built Aboriginal Linux before (like a year ago).
One of the reasons I ran into this was because I wanted to distill a
test corpus for fuzzing from the shell test cases. afl-fuzz has a
utility to minimize a test corpus based on code path coverage. So
getting a stable test environment seems like a prerequisite for that.
Looking at the tests, I suspect my recent changes to the dirtree
infrastructure broke "mv". (Something did, anyway...)

There's also the issue that "make test_mv" and "make tests" actually
test slightly different things. The first builds the command standalone,
and not all commands build correctly standalone. (That might be why
"make test_mv" didn't work, if it's not building standalone...)

Sometimes the command needs fixing, sometimes the build infrastructure
needs fixing, sometimes the test needs fixing...
Post by Andy Chu
https://github.com/andychu/toybox/commit/ff937e97881bfdf4b1221618c38857b75c9534e0
This seems to be a little laborious, because I have to manually write
shell scripts to fuzz individual inputs (and I didn't find anything
beyond that one crash yet). I think the mass fuzzing thing might work
better, but I'm not sure.
Building scripts to test each individual input is what the test suite is
all about. Figuring out what those inputs should _be_ (and the results
to expect) is, alas, work.

There's also the fact that either the correct output or the input to use
is non-obvious. It's really easy for me to test things like grep by
going "grep -r xopen toys/pending". There's a lot of data for it to bite
on, and I can test ubuntu's version vs mine trivially and see where they
diverge.

But putting that in the test suite, I need to come up with a set of test
files (the source changes each commit, source changes shouldn't cause
test case regressions). I've done a start of tests/files with some utf8
code in there, but it hasn't got nearly enough complexity yet, and
there's "standard test load that doesn't change" vs "I thought of a new
utf8 torture test and added it, but that broke the ls -lR test."

Or with testing "top", the output is based on the current system load.
Even in a controlled environment, it's butterfly effects all the way
down. I can look at the source files under /proc I calculated the values
from, but A) hugely complex, B) giant race condition, C) is implementing
two parallel code paths that do the same thing a valid test? If I'm
calculating the wrong value because I didn't understand what that field
should mean, my test would also be wrong...

In theory testing "ps" is easier, but in theory "ps" with no arguments
is the same as "ps -o pid,tty,time,cmd". But if you run it twice, the
pid of the "ps" binary changes, and the "TIME" of the shell might tick
over to the next second. You can't "head -n 2" that it because it's
sorted by pid, which wraps, so if your ps pid is lower than your bash
pid it would come first. Oh, and there's no guarantee the shell you're
running is "bash" unless you're in a controlled environment... That's
just testing the output with no arguments.)
Post by Andy Chu
thanks,
Andy
Rob
Andy Chu
2016-03-13 08:34:46 UTC
Permalink
Post by Rob Landley
Unfortunately, the test suite needs as much work as the command
implementations do. :(
Ok, backstory!
OK, thanks a lot for all the information! That helps. I will work on
this. I think a good initial goal is just to triage the tests that
pass and make sure they don't regress (i.e. make it easy to run the
tests, keep them green, and perhaps have a simple buildbot). For
example, the factor bug is trivial but it's a lot easier to fix if you
get feedback in an hour or so rather than a month later, when you have
to load it back into your head.
Post by Rob Landley
Really, I need a tests/pending. :(
Yeah I have some ideas about this. I will try them out and send a
patch. I think there does need to be more than 2 categories as you
say though, and perhaps more than kind of categorization.
Post by Rob Landley
Building scripts to test each individual input is what the test suite is
all about. Figuring out what those inputs should _be_ (and the results
to expect) is, alas, work.
Right, it is work that the fuzzing should be able to piggy back on...
so I was trying to find a way to leverage the existing test cases,
pretty much like this:

http://lcamtuf.blogspot.com/2015/04/finding-bugs-in-sqlite-easy-way.html

But the difference is that unlike sqlite, fuzzing toybox could do
arbitrarily bad things to your system, so it really needs to be
sandboxed. It gives really nasty inputs -- I wouldn't be surprised if
it can crash the kernel too.

Parsers in C are definitely the most likely successful targets for a
fuzzer, and sed seems like the most complex parser in toybox so far.
The regex parsing seem to be handled by libraries, and I don't think
those are instrumented (because they are in a shared library not
compiled with afl-gcc). I'm sure we can find a few more bugs though.
Post by Rob Landley
There's also the fact that either the correct output or the input to use
is non-obvious. It's really easy for me to test things like grep by
going "grep -r xopen toys/pending". There's a lot of data for it to bite
on, and I can test ubuntu's version vs mine trivially and see where they
diverge.
Yeah there are definitely a lot of inputs beside the argv values, like
the file system state and kernel state. Those are harder to test, but
I like that you are testing with Aboriginal Linux and LFS. That is
already a great torture test.

FWIW I think the test harness is missing a few concepts:

- exit code
- stderr
- file system state -- the current method of putting setup at the
beginning of foo.test *might* be good enough for some commands, but
probably not all

But this doesn't need to be addressed initially.

By the way, is there a target language/style for shell and make? It
looks like POSIX shell, and I'm not sure about the Makefile -- is it
just GNU make or something more restrictive? I like how you put most
stuff in scripts/make.sh -- that's also how I like to do it.

What about C? Clang is flagging a lot of warnings that GCC doesn't,
mainly -Wuninitialized.
Post by Rob Landley
But putting that in the test suite, I need to come up with a set of test
files (the source changes each commit, source changes shouldn't cause
test case regressions). I've done a start of tests/files with some utf8
code in there, but it hasn't got nearly enough complexity yet, and
there's "standard test load that doesn't change" vs "I thought of a new
utf8 torture test and added it, but that broke the ls -lR test."
Some code coverage stats might help? I can probably set that up as
it's similar to making an ASAN build. (Perhaps something like this
HTML http://llvm.org/docs/CoverageMappingFormat.html)

The build patch I sent yesterday will help with that as well since you
need to set CFLAGS.
Post by Rob Landley
Or with testing "top", the output is based on the current system load.
Even in a controlled environment, it's butterfly effects all the way
down. I can look at the source files under /proc I calculated the values
from, but A) hugely complex, B) giant race condition, C) is implementing
two parallel code paths that do the same thing a valid test? If I'm
calculating the wrong value because I didn't understand what that field
should mean, my test would also be wrong...
In theory testing "ps" is easier, but in theory "ps" with no arguments
is the same as "ps -o pid,tty,time,cmd". But if you run it twice, the
pid of the "ps" binary changes, and the "TIME" of the shell might tick
over to the next second. You can't "head -n 2" that it because it's
sorted by pid, which wraps, so if your ps pid is lower than your bash
pid it would come first. Oh, and there's no guarantee the shell you're
running is "bash" unless you're in a controlled environment... That's
just testing the output with no arguments.)
Those are definitely hard ones... I agree with the strategy of
classifying the tests, and then we can see how many of the hard cases
are. I think detecting trivial breakages will be an easy first step,
and it should allow others to contribute more easily.

thanks,
Andy
enh
2016-03-13 18:06:26 UTC
Permalink
#include <cwhyyoushouldbedoingunittestinginstead>

only having integration tests is why it's so hard to test toybox ps
and why it's going to be hard to fuzz the code: we're missing the
boundaries that let us test individual pieces. it's one of the major
problems with the toybox design/coding style. sure, it's something all
the existing competition in this space gets wrong too, but it's the
most obvious argument for the creation of the _next_ generation
tool...
Post by Andy Chu
Post by Rob Landley
Unfortunately, the test suite needs as much work as the command
implementations do. :(
Ok, backstory!
OK, thanks a lot for all the information! That helps. I will work on
this. I think a good initial goal is just to triage the tests that
pass and make sure they don't regress (i.e. make it easy to run the
tests, keep them green, and perhaps have a simple buildbot). For
example, the factor bug is trivial but it's a lot easier to fix if you
get feedback in an hour or so rather than a month later, when you have
to load it back into your head.
Post by Rob Landley
Really, I need a tests/pending. :(
Yeah I have some ideas about this. I will try them out and send a
patch. I think there does need to be more than 2 categories as you
say though, and perhaps more than kind of categorization.
Post by Rob Landley
Building scripts to test each individual input is what the test suite is
all about. Figuring out what those inputs should _be_ (and the results
to expect) is, alas, work.
Right, it is work that the fuzzing should be able to piggy back on...
so I was trying to find a way to leverage the existing test cases,
http://lcamtuf.blogspot.com/2015/04/finding-bugs-in-sqlite-easy-way.html
But the difference is that unlike sqlite, fuzzing toybox could do
arbitrarily bad things to your system, so it really needs to be
sandboxed. It gives really nasty inputs -- I wouldn't be surprised if
it can crash the kernel too.
Parsers in C are definitely the most likely successful targets for a
fuzzer, and sed seems like the most complex parser in toybox so far.
The regex parsing seem to be handled by libraries, and I don't think
those are instrumented (because they are in a shared library not
compiled with afl-gcc). I'm sure we can find a few more bugs though.
Post by Rob Landley
There's also the fact that either the correct output or the input to use
is non-obvious. It's really easy for me to test things like grep by
going "grep -r xopen toys/pending". There's a lot of data for it to bite
on, and I can test ubuntu's version vs mine trivially and see where they
diverge.
Yeah there are definitely a lot of inputs beside the argv values, like
the file system state and kernel state. Those are harder to test, but
I like that you are testing with Aboriginal Linux and LFS. That is
already a great torture test.
- exit code
- stderr
- file system state -- the current method of putting setup at the
beginning of foo.test *might* be good enough for some commands, but
probably not all
But this doesn't need to be addressed initially.
By the way, is there a target language/style for shell and make? It
looks like POSIX shell, and I'm not sure about the Makefile -- is it
just GNU make or something more restrictive? I like how you put most
stuff in scripts/make.sh -- that's also how I like to do it.
What about C? Clang is flagging a lot of warnings that GCC doesn't,
mainly -Wuninitialized.
Post by Rob Landley
But putting that in the test suite, I need to come up with a set of test
files (the source changes each commit, source changes shouldn't cause
test case regressions). I've done a start of tests/files with some utf8
code in there, but it hasn't got nearly enough complexity yet, and
there's "standard test load that doesn't change" vs "I thought of a new
utf8 torture test and added it, but that broke the ls -lR test."
Some code coverage stats might help? I can probably set that up as
it's similar to making an ASAN build. (Perhaps something like this
HTML http://llvm.org/docs/CoverageMappingFormat.html)
The build patch I sent yesterday will help with that as well since you
need to set CFLAGS.
Post by Rob Landley
Or with testing "top", the output is based on the current system load.
Even in a controlled environment, it's butterfly effects all the way
down. I can look at the source files under /proc I calculated the values
from, but A) hugely complex, B) giant race condition, C) is implementing
two parallel code paths that do the same thing a valid test? If I'm
calculating the wrong value because I didn't understand what that field
should mean, my test would also be wrong...
In theory testing "ps" is easier, but in theory "ps" with no arguments
is the same as "ps -o pid,tty,time,cmd". But if you run it twice, the
pid of the "ps" binary changes, and the "TIME" of the shell might tick
over to the next second. You can't "head -n 2" that it because it's
sorted by pid, which wraps, so if your ps pid is lower than your bash
pid it would come first. Oh, and there's no guarantee the shell you're
running is "bash" unless you're in a controlled environment... That's
just testing the output with no arguments.)
Those are definitely hard ones... I agree with the strategy of
classifying the tests, and then we can see how many of the hard cases
are. I think detecting trivial breakages will be an easy first step,
and it should allow others to contribute more easily.
thanks,
Andy
_______________________________________________
Toybox mailing list
http://lists.landley.net/listinfo.cgi/toybox-landley.net
--
Elliott Hughes - http://who/enh - http://jessies.org/~enh/
Android native code/tools questions? Mail me/drop by/add me as a reviewer.
Rob Landley
2016-03-13 18:55:05 UTC
Permalink
Post by enh
#include <cwhyyoushouldbedoingunittestinginstead>
only having integration tests is why it's so hard to test toybox ps
and why it's going to be hard to fuzz the code: we're missing the
boundaries that let us test individual pieces. it's one of the major
problems with the toybox design/coding style. sure, it's something all
the existing competition in this space gets wrong too, but it's the
most obvious argument for the creation of the _next_ generation
tool...
I started adding test_blah commands to the toys/example directory. I
plan to move the central ps plumbing to lib/proc.c and untangle the 5
commands in there into separate files, we can add test_proc commands if
you can think of good individual pieces to test.

I'm open to this category of test, and have the start of a mechanism.
I'm just spread a bit thin, and it's possible I don't understand the
kind of test harness you want?

Rob
Andy Chu
2016-03-13 19:52:56 UTC
Permalink
Post by Rob Landley
Post by enh
#include <cwhyyoushouldbedoingunittestinginstead>
only having integration tests is why it's so hard to test toybox ps
and why it's going to be hard to fuzz the code: we're missing the
boundaries that let us test individual pieces. it's one of the major
problems with the toybox design/coding style. sure, it's something all
the existing competition in this space gets wrong too, but it's the
most obvious argument for the creation of the _next_ generation
tool...
I started adding test_blah commands to the toys/example directory. I
plan to move the central ps plumbing to lib/proc.c and untangle the 5
commands in there into separate files, we can add test_proc commands if
you can think of good individual pieces to test.
I'm open to this category of test, and have the start of a mechanism.
I'm just spread a bit thin, and it's possible I don't understand the
kind of test harness you want?
The toys/example/test_*.c files seem to print to stdout, so I guess
they still need a shell wrapper to test correctness. That's
technically still an integration test rather than a unit test --
roughly I would say integration tests involve one more than one
process (e.g. for a system of servers) whereas unit tests are run
entirely within the language using a unit test framework in that
language.

Google uses gunit/googletest for testing, and I guess Android does too:

https://github.com/google/googletest

Example: https://android.googlesource.com/platform/system/core.git/+/master/libziparchive/zip_archive_test.cc

You basically write a bunch of functions wrapped in TEST_ macros and
they are linked into a binary with a harness and run.

I guess toybox technically could use it if the tests were in C++ but
the code is in C, though it seems like it clashes with the style of
the project pretty badly.

I think the main issue that Elliot is pointing to is that there are no
internal interfaces to test against or mock out so you don't hose your
system while running tests (i.e. you can "reify" the file system state
and kernel state, and the substitute them with fake values in tests).

I agree it would be nicer if there were such interfaces, but it's
fairly big surgery, and somewhat annoying to do in C. I think you
would have to get rid of most global vars, and use a strategy like Lua
or sqlite, where they pass around a context struct everywhere, which
can have system functions like open()/read()/write()/malloc()/etc.
sqlite has a virtual file system (VFS) abstraction for extensive tests
and Lua lets you plug in malloc/free at least. They are libraries and
not programs so I guess that is more natural.

I think this is worth keeping in mind perhaps, but it seems like there
is a lot of other low hanging fruit to address beforehand.

Andy
Rob Landley
2016-03-13 21:56:52 UTC
Permalink
Post by Andy Chu
Post by Rob Landley
Post by enh
#include <cwhyyoushouldbedoingunittestinginstead>
only having integration tests is why it's so hard to test toybox ps
and why it's going to be hard to fuzz the code: we're missing the
boundaries that let us test individual pieces. it's one of the major
problems with the toybox design/coding style. sure, it's something all
the existing competition in this space gets wrong too, but it's the
most obvious argument for the creation of the _next_ generation
tool...
I started adding test_blah commands to the toys/example directory. I
plan to move the central ps plumbing to lib/proc.c and untangle the 5
commands in there into separate files, we can add test_proc commands if
you can think of good individual pieces to test.
I'm open to this category of test, and have the start of a mechanism.
I'm just spread a bit thin, and it's possible I don't understand the
kind of test harness you want?
The toys/example/test_*.c files seem to print to stdout, so I guess
they still need a shell wrapper to test correctness.
cat tests/test_human_readable.test

scripts/test.sh test_human_readable

(I didn't hook up the scripts/examples directory in the script that
makes the "make test_blah" targets. I should add that, although "make
test_test_human_readable" is an awkward name...)
Post by Andy Chu
That's technically still an integration test rather than a unit test --
roughly I would say integration tests involve one more than one
process (e.g. for a system of servers) whereas unit tests are run
entirely within the language using a unit test framework in that
language.
/me goes to look up the definitions of integration test and unit test...

https://en.wikipedia.org/wiki/Unit_testing
https://en.wikipedia.org/wiki/Integration_testing

And the second of those links to "validation testing" which redirects to
https://en.wikipedia.org/wiki/Software_verification_and_validation which
implies that testing (like documentation) is something done badly by a
third party team in Bangalore after the original team scatters to the
four winds, so no.

I do "unit testing" while developing, but then I repeatedly refactor
that code as I go. I just split xexit() into _xexit() and xexit() while
redoing sigatexit() to replace atexit(). (Because the toys.rebound
longjmp stuff needs to happen after "atexit" but the standard C
functions don't give you a way of triggering the list early, nor of
removing things from it short of actually exiting.)

If I had a unit test suite for xexit(), I would have made more work for
myself updating them. I'm still trying to get toys/code.html to have
decent coverage of lib, so that other people can use these tools. A test
suite that tests things that have no external visiblity in a running
programming proves what exactly?

My test suite is _deeply_ unfinished, and testing a moving target, but
its eventual goals are:

1) Regression testing.
2) Standards compliance.
3) Coverage of all code paths.

#3 is non-obvious: how does signal delivery work in here, or disk full?
If sed -i receives a kill signal while saving it should leave the old
file in place which means write new .file.tmp and the mv it atomically
over the old one, but kill -9 means when re-run it needs to cleanup
.file.tmp but sed -i doesn't get re-run a lot (like vi would) which
means what we WANT to do is open our tempfile, delete it, write to it,
and then hardlink the /proc/$$/fd/fileno into the new location taking
advantage of proc's special case behavior, but is that portable enough
(sed should work if /proc isn't mounted) and other things may want to
use that so that code should live in /lib and have a fallback path with
atexit() stuff (see lib/lib.c copy_tempfile())...

So if we _do_ make this plumbing, do we test it in sed or do we have a
test_copy_tempfile in toys/examples that specifically tests this part of
the plumbing, and then it's just a question of whether sed uses it? But
if sed _didn't_ use it, we wouldn't notice unless we tested it...

Another coverage vs duplication issue is the fact that every command
should be calling xexit() at the end (including return from main) which
means it does a fflush(0) and checks ferror() and does a perror_exit()
if there was a problem. (Which is why I've been pruning back use of
xprintf() and similar, those cause the program to exit early rather than
producing endless output when writing to a full disk or closed socket,
but the fflush() affects performance and the exit path should notice), but

Possibly what I need is a shell function a test can call that says "this
command line modifies/replaces file $BLAH, make sure it handles disk
full and being interrupted and so on sanely", and it can run it in its
own directory and make sure there are no leftover files if it gets a
kill signal while running" (have it read from a fifo, once we're sure
it's blocked send it non -9 kill signal and then read the directory to
make sure there's only one file in there...)

This is the kind of thing I'm worried about in future. My idea of "full
coverage" is full of that sort of thing. Things which are externally
visible from the command can be tested by running the command in the
right environment.
Post by Andy Chu
https://github.com/google/googletest
Example: https://android.googlesource.com/platform/system/core.git/+/master/libziparchive/zip_archive_test.cc
You basically write a bunch of functions wrapped in TEST_ macros and
they are linked into a binary with a harness and run.
This is a set of command line utilities, not a C library.

If, after the 1.0 release, somebody wants to make a C library out of it,
have fun. But until then, infrastructure bits are subject to change
without notice. (I'm curently banging on dirtree to try to get that
infinite depth thing rm wants, for example. Commit 8d95074b7d03 changed
some of the semantics, adjusted the callers, and updated the
documentation. What would altering a test suite at that level
accomplish? Either the behavior is visible to the outside world when the
command runs, or it isn't. I can make a test_dirtree wrapper to check
specific dirtree corner cases, but we should also have _users_ of all
those corner cases, and should be testing the visible behavior of those
users...)
Post by Andy Chu
I guess toybox technically could use it if the tests were in C++ but
the code is in C, though it seems like it clashes with the style of
the project pretty badly.
The toybox shared C infrastructure isn't exported to the outside world
for use outside of toybox. If its semantics change, we adjust the users
in-tree.

Instrumenting the build to show that in allyesconfig this function is
never used from anywhere is interesting (and can probably be done with
readelf and sed).
Post by Andy Chu
I think the main issue that Elliot is pointing to is that there are no
internal interfaces to test against or mock out so you don't hose your
system while running tests (i.e. you can "reify" the file system state
and kernel state, and the substitute them with fake values in tests).
I've always planned to test these commands under an emulator in a
virtual system. (Aboriginal Linux is a much older project than toybox.)

Heck, back under busybox I was using User Mode Linux as my emulator
(qemu wasn't available yet):

https://git.busybox.net/busybox/tree/testsuite/umlwrapper.sh?h=1_2_1&id=f86a5ba510ef

Before that, I added a chroot mode to busybox tests:

https://git.busybox.net/busybox/tree/testsuite/testing.sh?h=1_2_1&id=f86a5ba510ef#n95

I am aware of that problem, but rather than dissecting the code and
sticking pins in it, I prefer to run the tests under an emulator in an
environment it can trash without repercussions. I just haven't finished
implementing it yet because it doesn't solve the "butterfly effect"
tests. (It's on the todo list!)

Note: solving the butterfly effect tests _is_ possible by providing a
fake /proc instead of a real one, --bind mounting a directory of known
data over /proc for the duration of the test so it produces consistent
results. It's all solvable, it's just a can of worms I haven't opened
yet because I've got six cans of worms going in parallel already.
Post by Andy Chu
I agree it would be nicer if there were such interfaces, but it's
fairly big surgery, and somewhat annoying to do in C. I think you
would have to get rid of most global vars,
I mostly have. All command-specific global variables should go in
GLOBALS() (which mean they go in "this", which a union of structs), and
everything else should be in the global "toys" union except for toybuf
and libbuf.

Let's see

nm --size-sort toybox_unstripped | sort -k2,2

and ignoring the "r", "t", and "T" entries gives us:

0000000000000001 b completed.6973

grep -r is not finding "completed" as a variable name? Odd...

0000000000000008 b tempfile2zap

In lib/lib.c so copy_tempfile() can let tempfile_handler() know what
file to delete atexit(). Bit of a hack, largely because there's only
_one_ file at a time it can store (not a list, but no users need it to
be a list yet). I think code.html mentions this? (If not it should.)

0000000000000028 b chattr

Blah, that's garbage I missed when cleaning up this contribution. That
should go in GLOBALS(), you can have a union of structs in there to have
per-command variables when sharing a file. (But why do they share a
file? I'd have to dig...)

0000000000000004 B __daylight@@GLIBC_2.2.5
0000000000000008 B __environ@@GLIBC_2.2.5
0000000000000008 B stderr@@GLIBC_2.2.5
0000000000000008 B stdin@@GLIBC_2.2.5
0000000000000008 B stdout@@GLIBC_2.2.5
0000000000000008 B __timezone@@GLIBC_2.2.5

glibc vomited forth these for no apparent reason.

0000000000000048 B toys
0000000000001000 B libbuf
0000000000001000 B toybuf
0000000000002028 B this

The ones I mentioned above, these are _expected_.

0000000000000150 d e2attrs

More lsattr stuff. You'll note that 2013 was before I had "pending",
looks like I missed some cleanup in this command.

0000000000001600 D toy_list

That could probably be "r" with a little more work, although I vaguely
recall adding it made lots of spurious "a const was passed to a
non-const thing! Alas and alack, woe betide! Did you know that string
constants are in the read-only section and segfault if you try to write
to them but the compiler doesn't complain if you pass them to a
non-const argument yet it all works out fine? Oh doom and gloom!"

That's probably why I didn't.

0000000000000004 V daylight@@GLIBC_2.2.5
0000000000000008 V environ@@GLIBC_2.2.5
0000000000000008 V timezone@@GLIBC_2.2.5

glibc again.

000000000000001f W mknod

What is a "W" type symbol?

And of course there's buckets of violations needing to be fixed in
pending, this is just defconfig...
Post by Andy Chu
and use a strategy like Lua
or sqlite, where they pass around a context struct everywhere, which
can have system functions like open()/read()/write()/malloc()/etc.
A) On nommu systems you have a limited stack size.

B) I looked into rewriting this in Lua back around 2009 or so. I chose
to stick with C. If you'd like to write a version in Lua, feel free.

If you're proposing that I extensively reengineer the project so you can
use a different style of test architecture, could you please explain
what those tests could test that the way I'm doing it couldn't?
Post by Andy Chu
sqlite has a virtual file system (VFS) abstraction for extensive tests
Linux has --bind and union mounts, containers, and I can run the entire
system under QEMU.
Post by Andy Chu
and Lua lets you plug in malloc/free at least. They are libraries and
not programs so I guess that is more natural.
The first time I wrote my own malloc/free wrapper to intercept and track
all allocations in a program was under OS/2 in 1996. I expect all
programmers do that at some point. I can only think of one person who
lists such a wrapper as one of his major life accomplishments on his
resume, and I have longstanding disagreements with that man.

I've been thinking for a long time about making generic recovery
infrastructure so you can nofork() any command and clean up after it.
And when I say "a long time" I mean a decade now:

http://lists.busybox.net/pipermail/busybox/2006-March/053270.html

And the conclusion I came to is "let the OS do it". I'm sure I blogged
about this in like 2011, but if you look at "man execve" and scroll down
to "All process attributes are preserved during an execve(), except the
following:" there's a GIANT list of things we'd need to clean up, and
that's just a _start_. It could mess with environment variables, or mess
with umask... it's just not worth it.

So I've since wandered to "fork a child process, have the child recurse
into the other_main() function to avoid an exec if there's enough stack
space left, and then have it exit with the parent waiting for it" as the
standard way of dealing with stuff that's not already easy. A command
can refrain from altering the process's state and be marked
TOYBOX_NOFORK or it can be a child process the OS cleans up after. I'm
not going for a case in between.

That said, it's important for long-lived processes not to leak. Init or
httpd can't leak, grep and sed can't leak per-file or per-line because
they can have unbounded input size... But again, people are looking for
that with valgrind, and I can make a general test memory and open
filehandles and such in xexit() under TOYBOX_DEBUG.
Post by Andy Chu
I think this is worth keeping in mind perhaps, but it seems like there
is a lot of other low hanging fruit to address beforehand.
If we need to test C functions in ways that aren't easily allowed by the
users of those C functions, we can write a toys/example command that
calls those C functions in the way we want to check. But if the behavior
isn't already accessible from an existing user of that function in one
of the commands, why do we care again?
Post by Andy Chu
Andy
Rob
Rob Landley
2016-03-13 18:18:58 UTC
Permalink
Post by Andy Chu
Post by Rob Landley
Unfortunately, the test suite needs as much work as the command
implementations do. :(
Ok, backstory!
OK, thanks a lot for all the information! That helps. I will work on
this. I think a good initial goal is just to triage the tests that
pass and make sure they don't regress (i.e. make it easy to run the
tests, keep them green, and perhaps have a simple buildbot).
I fixed "make test_mv" last night. The problem is that
scripts/singleconfig.sh was creating a "mv" that acted like "cp". (I
should write up a blog entry explaining the plumbing. This may fix one
or two other tests, I haven't checked. It should change the "make tests"
build which tests the multiplexer version, which depends on make
menuconfig to tell it what to test.)
Post by Andy Chu
For
example, the factor bug is trivial but it's a lot easier to fix if you
get feedback in an hour or so rather than a month later, when you have
to load it back into your head.
Indeed, but I did most of the fix yesterday and can check it in today.

(I special cased "-" is the first character, to print out a -1 and skip
it, then the rest of the math is unsigned for the larger range. This
means that "-" by itself is treated as -1, I'm not sure how to catch
that without an ugly special case test for that...)

I also switched it to long long, which should make no difference on 64
bit plaforms (with current compilers, anyway; there's nothing STOPPING
128 bit long long the way they wrote LP64, but nobody does it). On 32
bit platforms, it slows it down up to 50%.
Post by Andy Chu
Post by Rob Landley
Really, I need a tests/pending. :(
Yeah I have some ideas about this. I will try them out and send a
patch. I think there does need to be more than 2 categories as you
say though, and perhaps more than kind of categorization.
Eventually it should all collapse back into one category, but there's a
lot of work to do between now and then. But tests/posix and tests/lsb
and such make a certain amount of sense, and that would both get us
tests/pending and not have to be undone later.
Post by Andy Chu
Post by Rob Landley
Building scripts to test each individual input is what the test suite is
all about. Figuring out what those inputs should _be_ (and the results
to expect) is, alas, work.
Right, it is work that the fuzzing should be able to piggy back on...
so I was trying to find a way to leverage the existing test cases,
http://lcamtuf.blogspot.com/2015/04/finding-bugs-in-sqlite-easy-way.html
But the difference is that unlike sqlite, fuzzing toybox could do
arbitrarily bad things to your system, so it really needs to be
sandboxed. It gives really nasty inputs -- I wouldn't be surprised if
it can crash the kernel too.
I have plans to sandbox it using
http://landley.net/aboriginal/about.html but haven't finished that yet
because Giant TODO List.

(If I go off to my corner and focus on my todo list, I vanish for
months. Things like sed and ps can easily soak up a couple months each.
If I prioritize interrupts, I jump from topic to topic and wind up with
giant heaps of half-finished stuff, but at least other people can sort
of follow along. :)
Post by Andy Chu
Parsers in C are definitely the most likely successful targets for a
fuzzer, and sed seems like the most complex parser in toybox so far.
lib/args.c is a pretty complicated parser, and toys/*/find.c is also
moderately horrid in that regard (because it can't leverage lib/args to
do anything in a common way.)

I want to genericize find.c plumbing to have expr.c and maybe test.c do
parenthesization and prioritization and such the same way, but despite
sitting down to think it through more than once haven't come up with a
clean way to factor out the common code yet. I should just do expr.c and
then try to cleanup common code (if any) afterwards. (Yes there's an
expr.c in pending, and when I sat down to try to clean it up I hit
http://landley.net/notes-2014.html#02-12-2014 and then
http://landley.net/notes-2015.html#30-01-2015 and and it's on the todo
list.)

This might help:

ls -loSr toys/{android,example,other,lsb,posix}/*.c

The size of ps is partly illusory, I implemented "ps", "top", "iotop",
"pgrep", and "pkill" in the same command because I hadn't cleaned out
the common infrastructure to move it to lib/proc.c yet. (I should do
that. It can't use any of the GLOBALS/TT stuff and can't use any FLAG_
macros, because neither are available in lib. Oh, and it also shouldn't
ever check toys.which->name to see which command is running. I've got
that mostly cleaned out, need to factor it out into lib. It's on the
todo list.)
Post by Andy Chu
The regex parsing seem to be handled by libraries, and I don't think
those are instrumented (because they are in a shared library not
compiled with afl-gcc). I'm sure we can find a few more bugs though.
I'd prioritize musl and bionic. As far as I'm concerned uClibc is dead
(and uClibc-ng is necromancy, not a fresh start), and glibc is big iron
along with the rest of the GNU/nonsense.
Post by Andy Chu
Post by Rob Landley
There's also the fact that either the correct output or the input to use
is non-obvious. It's really easy for me to test things like grep by
going "grep -r xopen toys/pending". There's a lot of data for it to bite
on, and I can test ubuntu's version vs mine trivially and see where they
diverge.
Yeah there are definitely a lot of inputs beside the argv values, like
the file system state and kernel state.
I'm working on tests/files. I need directory traversal weirdness with
some symlinks and different permissions and fifos and such, but I
suspect I need a tarball and/or script to set those up because trying to
check intentionally filesystem corner cases into git is not a happy thought.
Post by Andy Chu
Those are harder to test, but
I like that you are testing with Aboriginal Linux and LFS. That is
already a great torture test.
Indeed, and ~2 weeks ago I was churning through LFS 7.8 packages until I
got distracted. I should get back to that. It's on the todo list.
Post by Andy Chu
- exit code
blah; echo $?
Post by Andy Chu
- stderr
2>&1
Post by Andy Chu
- file system state -- the current method of putting setup at the
beginning of foo.test *might* be good enough for some commands, but
probably not all
I mentioned the need for a standard directory of files everything can
assume is there, and tests/files being a start of that. For testing by
hand I just use the toybox source du jour, but that's obviously
unsuitable for automated testing.

That said, these test scripts are shell scripts. You can do any
setup/teardown you need to. The automated stuff is a convenience.

That said, right now the tests are run by sourcing them, which means
there's potential leftover crap if you define shell functions and such.
I need to make sure there's an appropraite ( ) subshell at the right
places. (When I first wrote this, I knew the answers to that sort of
thing off the top of my head. That was in... 2005? Now I have to go back
and confirm and add comments, but that's what other people have to do
looking at my code so probably a net win...)
Post by Andy Chu
But this doesn't need to be addressed initially.
By the way, is there a target language/style for shell and make?
I'm targeting bash (but older bash, like bash 2 with only a couple bash
3 features like ~=), because toybox's shell should be a proper bash
replacement, and toybox building itself is an obvious smoketest.

That said, there's a bootstrapping problem on weird systems. If I could
carve out the toysh.c and sed.c standalone builds so they can be run on
systems that haven't got acceptable versions of those commands, I'd
increase the portability of toybox a lot. (It still mucks about in /proc
and /sed looking for stuff, and calls some linux-only syscalls and
ioctls, but everybody and their dog has a linux emulation layer these
days. Large chunks of posix is still stuck in the 1970's, and they
always chickened out about standardizing things like "mount" or "init"
so you can't _boot_ a system that doesn't go beyond posix.)
Post by Andy Chu
It looks like POSIX shell, and I'm not sure about the Makefile -- is it
just GNU make or something more restrictive? I like how you put most
stuff in scripts/make.sh -- that's also how I like to do it.
In theory make is only there to provide the expected API. In practice,
the kconfig subdirectory was copied from Linux 2.6.12 and I need to
write a new one from scratch. (It's on the todo list! Note we only use
the generated .config file which is produced from our Config.in source,
so washing data through that plumbing doesn't affect the copyright and
thus license of the resulting binary. But it's an ugliness that really
should go bye-bye, and now that I've broken open the
lib/interestingtimes.c and lib/linestack.c can of worms... It's on the
todo list.)
Post by Andy Chu
What about C? Clang is flagging a lot of warnings that GCC doesn't,
mainly -Wuninitialized.
The Android guys build with clang against bionic. I need to set up a
local clang toolchain, but my netbook is still ubuntu 12.04 and AOSP's
moved on to 14.04. It's on the todo list.

That said, gcc produces buckets of _spurious_ "may be used uninitialized
but never actually is" warnings, which I sometimes silence with "int
a=a;" in the declarations. (Generates no code but shuts up the warning.)

Are these _real_ uninitialized warnings? I'm very interested in those,
but find wading through large quantities of false positives tiresome.
(That's why I'm not a big fan of static analysis either. False positives
as far as the eye can see.)
Post by Andy Chu
Post by Rob Landley
But putting that in the test suite, I need to come up with a set of test
files (the source changes each commit, source changes shouldn't cause
test case regressions). I've done a start of tests/files with some utf8
code in there, but it hasn't got nearly enough complexity yet, and
there's "standard test load that doesn't change" vs "I thought of a new
utf8 torture test and added it, but that broke the ls -lR test."
Some code coverage stats might help? I can probably set that up as
it's similar to making an ASAN build. (Perhaps something like this
HTML http://llvm.org/docs/CoverageMappingFormat.html)
Ooh, that sounds interesting.
Post by Andy Chu
The build patch I sent yesterday will help with that as well since you
need to set CFLAGS.
I lost it in the noise, I need to do a pass over the mailing list web
archive again today and see what's fallen through the cracks...
Post by Andy Chu
Post by Rob Landley
Or with testing "top", the output is based on the current system load.
Even in a controlled environment, it's butterfly effects all the way
down. I can look at the source files under /proc I calculated the values
from, but A) hugely complex, B) giant race condition, C) is implementing
two parallel code paths that do the same thing a valid test? If I'm
calculating the wrong value because I didn't understand what that field
should mean, my test would also be wrong...
In theory testing "ps" is easier, but in theory "ps" with no arguments
is the same as "ps -o pid,tty,time,cmd". But if you run it twice, the
pid of the "ps" binary changes, and the "TIME" of the shell might tick
over to the next second. You can't "head -n 2" that it because it's
sorted by pid, which wraps, so if your ps pid is lower than your bash
pid it would come first. Oh, and there's no guarantee the shell you're
running is "bash" unless you're in a controlled environment... That's
just testing the output with no arguments.)
Those are definitely hard ones... I agree with the strategy of
classifying the tests, and then we can see how many of the hard cases
are. I think detecting trivial breakages will be an easy first step,
and it should allow others to contribute more easily.
Initially I was only adding tests that either passed or showed something
interesting I needed to fix. This left large holes in the test suite
that I didn't know how to fill in yet, and when other people filled them
in I don't necessarily know how to fix them yet.

I'm glad somebody's taking a look. :)
Post by Andy Chu
thanks,
Andy
No, thank _you_,

Rob
Samuel Holland
2016-03-13 19:13:29 UTC
Permalink
Post by Rob Landley
Post by Andy Chu
- exit code
blah; echo $?
Post by Andy Chu
- stderr
2>&1
I think the idea here was the importance of differentiating between
stdout and stderr, and between text output and return code. This is as
simple as having a separate output variable for each type of output.

Granted, it will usually be unambiguous as to the correctness of the
program, but having the return code in the output string can be
confusing to the human looking at the test case. Plus, why would you not
want to verify the exit code for every test? It's a lot of duplication
to write "echo $?" in all of the test cases.

As for stdout/stderr, it helps make sure diagnostic messages are going
to the right stream when not using the helper functions.

--
Regards,
Samuel Holland <***@sholland.org>
Andy Chu
2016-03-13 19:54:04 UTC
Permalink
Post by Samuel Holland
I think the idea here was the importance of differentiating between
stdout and stderr, and between text output and return code. This is as
simple as having a separate output variable for each type of output.
Granted, it will usually be unambiguous as to the correctness of the
program, but having the return code in the output string can be
confusing to the human looking at the test case. Plus, why would you not
want to verify the exit code for every test? It's a lot of duplication
to write "echo $?" in all of the test cases.
As for stdout/stderr, it helps make sure diagnostic messages are going
to the right stream when not using the helper functions.
Yes, that is exactly what I was getting at. Instead of "testing",
there could be another function "testing-errors" or something. But
it's not super important right now.

Andy
Rob Landley
2016-03-13 20:32:48 UTC
Permalink
Post by Samuel Holland
Post by Rob Landley
Post by Andy Chu
- exit code
blah; echo $?
Post by Andy Chu
- stderr
2>&1
I think the idea here was the importance of differentiating between
stdout and stderr, and between text output and return code.
You can do this now. "2>&1 > /dev/null" gives you only stdout, or
2>file... There are many options.
Post by Samuel Holland
simple as having a separate output variable for each type of output.
Granted, it will usually be unambiguous as to the correctness of the
program,
Yes, adding complexity to every test that isn't usually needed.
Post by Samuel Holland
but having the return code in the output string can be
confusing to the human looking at the test case. Plus, why would you not
want to verify the exit code for every test?
Because science is about reducing variables and isolating to test
specific things? Because "we can so clearly we should" is the gnu
approach to things? Because you're arguing for adding complexity to the
test suite to do things it can already do, and in many cases is already
doing? Because we've already got 5 required arguments for each test and
you're proposing adding a couple more, and clearly that's going to make
the test suite easier to maintain and encourage additional contributions?

I'm not saying it's _not_ a good extension, but you did ask. Complexity
is a cost, spend it wisely. You're saying each test should test more
things, which means potential false positives and more investigation
about why a test failed (and/or more complex reporting format).

Also, "the return code" implies none of the tests are pipelines, or
multi-stage "do thing && examine thing" (which _already_ fails if do
thing returned failure, and with the error_msg() stuff would have said
why to stderr already). Yesterday I was poking at mv tests which have a
lot of "mv one two && [ -e two ] && [ ! -e one ] && echo yes" sort of
constructs. What is "the exit code" from that?
Post by Samuel Holland
It's a lot of duplication
to write "echo $?" in all of the test cases.
I don't. Sometimes I go "blah && yes" or "blah || yes", when the return
code is specifically what I'm testing, and sometimes checking the return
code and checking the output are two separate tests.

Keep in mind that error_msg() and friends produce output, and the tests
don't catch stderr by default but pass it through. If we catch stderr by
default and a test DOESN'T check it, then it's ignored instead of
visibile to the caller.

Also, keep in mind I want the host version to pass most of these tests
too, and if there are gratuitous differences in behavior I don't WANT
the test to fail based on something I don't care about and wasn't trying
to test. You're arguing for a tighter sieve with smaller holes when I've
already received a bunch of failing tests that were written against gnu
and never _tried_ against the toybox version, and failed for reasons
that aren't real failures.
Post by Samuel Holland
As for stdout/stderr, it helps make sure diagnostic messages are going
to the right stream when not using the helper functions.
Right now diagnostic messages are visible in the output when running the
test. There shouldn't be any by default, when there is it's pretty
obvious because those lines aren't colored.

I'm all for improving the test suite, but "what I think the test suite
should be trying to do differs from what you think the test suite should
be trying to do, therefore I am right" is missing some steps.

Rob

- All syllogisms have three parts, therefore this is not a syllogism.
Samuel Holland
2016-03-13 22:04:07 UTC
Permalink
Post by Rob Landley
Post by Samuel Holland
Post by Rob Landley
Post by Andy Chu
- exit code
blah; echo $?
Post by Andy Chu
- stderr
2>&1
I think the idea here was the importance of differentiating between
stdout and stderr, and between text output and return code.
You can do this now. "2>&1 > /dev/null" gives you only stdout, or
2>file... There are many options.
Post by Samuel Holland
simple as having a separate output variable for each type of
output.
Granted, it will usually be unambiguous as to the correctness of
the program,
Yes, adding complexity to every test that isn't usually needed.
Post by Samuel Holland
but having the return code in the output string can be confusing
to the human looking at the test case. Plus, why would you not want
to verify the exit code for every test?
Because science is about reducing variables and isolating to test
specific things?
If you want to reduce variables, see the suggestion about unit testing.
Post by Rob Landley
Because "we can so clearly we should" is the gnu approach to things?
Because you're arguing for adding complexity to the test suite to do
things it can already do, and in many cases is already doing?
find tests -name *.test -print0 | xargs -0 grep 'echo.*yes' | wc -l
182

Considering how many times this pattern is already used, I don't see it
adding much complexity. It's trading an ad hoc pattern used in ~17% of
the tests for something more consistent and well-defined. Never mind the
fact that using &&/|| doesn't tell you _what_ the return code was, only
a binary success or failure.

I have seen a couple of tests that pass because they expect failure, but
the command is failing for the wrong reason.
Post by Rob Landley
Because we've already got 5 required arguments for each test and
you're proposing adding a couple more, and clearly that's going to
make the test suite easier to maintain and encourage additional
contributions?
Personally, yes, I think so. If everything is explicit, there is less
"hmmm, how do I test that? I guess I could throw something together
using the shell. Now I have to sift through the existing tests to see if
there is precedent, and what that is." Instead, you put the inputs here,
the outputs there, and you're done.

I admit that filesystem operations are a whole new can of worms, and I
do not have a good answer to those.
Post by Rob Landley
I'm not saying it's _not_ a good extension, but you did ask.
Complexity is a cost, spend it wisely. You're saying each test
should test more things, which means potential false positives and
more investigation about why a test failed (and/or more complex
reporting format).
On the other hand, splitting up the outputs you check gives you _more_
information to help you investigate the problem. Instead of
"FAIL: foobar" you get something like "FAIL (stdout mismatch): foobar".

(As a side note, the test harness I've written recently even gives you a
diff of the expected and actual outputs when the test fails.)
Post by Rob Landley
Also, "the return code" implies none of the tests are pipelines, or
multi-stage "do thing && examine thing" (which _already_ fails if do
thing returned failure, and with the error_msg() stuff would have
said why to stderr already). Yesterday I was poking at mv tests
which have a lot of "mv one two && [ -e two ] && [ ! -e one ] && echo
yes" sort of constructs. What is "the exit code" from that?
Well, if we are testing mv, then the exit code is the exit code of mv.

The rest is just checking filesystem state after the fact. I claimed to
not know what to do about it, but in the interest of avoiding punting
the question, here's the answer off the top of my head (even though I
know you aren't going to like it much):

Add an argument to the testing command that contains a predicate to eval
after running the test. If the predicate returns true, the test
succeeded; if the predicate returns false, the test failed. That way,
the only command that is ever in the "command" argument is the toy to test.
Post by Rob Landley
Post by Samuel Holland
It's a lot of duplication to write "echo $?" in all of the test
cases.
I don't. Sometimes I go "blah && yes" or "blah || yes", when the
return code is specifically what I'm testing, and sometimes checking
the return code and checking the output are two separate tests.
Okay, so it's a lot of duplication to write "&& yes" all over the place. :)
Post by Rob Landley
Keep in mind that error_msg() and friends produce output, and the
tests don't catch stderr by default but pass it through. If we catch
stderr by default and a test DOESN'T check it, then it's ignored
instead of visibile to the caller.
I'm not sure how you could _not_ check stderr. The test case has a
string, the command generates a string, you compare the strings. If you
want to pass it through, nothing prevents that.
Post by Rob Landley
Also, keep in mind I want the host version to pass most of these
tests too, and if there are gratuitous differences in behavior I
don't WANT the test to fail based on something I don't care about
and wasn't trying to test. You're arguing for a tighter sieve with
smaller holes when I've already received a bunch of failing tests
that were written against gnu and never _tried_ against the toybox
version, and failed for reasons that aren't real failures.
If you want to do that, then yes, you definitely need a looser sieve. I
get 60 more failures here with TEST_HOST than I do with allyesconfig. I
agree that checking stderr on toybox vs. busybox or GNU is going to be
impossible because of differing error messages. Possible solutions
include not checking stderr by default (only formalizing the exit code
check), or simply not checking stderr when TEST_HOST=1.
Post by Rob Landley
Post by Samuel Holland
As for stdout/stderr, it helps make sure diagnostic messages are
going to the right stream when not using the helper functions.
Right now diagnostic messages are visible in the output when running
the test. There shouldn't be any by default, when there is it's
pretty obvious because those lines aren't colored.
I'm all for improving the test suite, but "what I think the test
suite should be trying to do differs from what you think the test
suite should be trying to do, therefore I am right" is missing some
steps.
Then I guess it's not exactly clear to me what you are trying to do with
the test suite. My interpretation of the purpose was to verify
correctness (mathematically, string transformations, etc.) and standards
compliance. In that sense, for each set of inputs to a command
(arguments, stdin, filesystem state), there is exactly one set of
correct outputs (exit code, stdout/stderr, filesystem state), and
therefore the goal of the test suite is to compare the actual output to
the correct output and ensure they match. If you don't check the exit
code, you are missing part of the output.



I won't tell you that you have to do it any one way. It's your project.
Of course, complexity is to some extent a value judgment. If you think
it adds too much complexity/strictness for your taste, that's fine. I
was just trying to explain the reasoning behind the suggestion, and why
I think it's a reasonable suggestion.
Post by Rob Landley
Rob
P.S. Your other reply came in just as I had finished typing. Sorry if
some of this is already addressed.

--
Regards,
Samuel Holland <***@sholland.org>
Rob Landley
2016-03-14 05:52:39 UTC
Permalink
Post by Samuel Holland
Post by Rob Landley
Post by Samuel Holland
but having the return code in the output string can be confusing
to the human looking at the test case. Plus, why would you not want
to verify the exit code for every test?
Because science is about reducing variables and isolating to test
specific things?
If you want to reduce variables, see the suggestion about unit testing.
I want the any complexity to justify itself. Complexity is a cost and I
want to get a reasonable return for it.

That said, what specifically was the suggestion about unit testing. "We
should have some?" We should export a second C interface to something
that isn't isn't a shell command for the purpose of telling us... what,
exactly?
Post by Samuel Holland
Post by Rob Landley
Because "we can so clearly we should" is the gnu approach to things?
Because you're arguing for adding complexity to the test suite to do
things it can already do, and in many cases is already doing?
find tests -name *.test -print0 | xargs -0 grep 'echo.*yes' | wc -l
182
Considering how many times this pattern is already used, I don't see it
adding much complexity. It's trading an ad hoc pattern used in ~17% of
the tests for something more consistent and well-defined.
Because 17% of the tests use it, 100% of the tests should get an extra
argument?
Post by Samuel Holland
Never mind the
fact that using &&/|| doesn't tell you _what_ the return code was, only
a binary success or failure.
In most cases I don't CARE what the return code was.

The specification of
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/false.html
says it returns "a non-zero error code". It never specifies what that
error code should _be_. If I return 3 I'm conformant. If the test suite
FAILS because it was expecting 1 and got 2, it's a bad test.
Post by Samuel Holland
I have seen a couple of tests that pass because they expect failure, but
the command is failing for the wrong reason.
Point them out please?
Post by Samuel Holland
Post by Rob Landley
Because we've already got 5 required arguments for each test and
you're proposing adding a couple more, and clearly that's going to
make the test suite easier to maintain and encourage additional
contributions?
Personally, yes, I think so.
I don't.
Post by Samuel Holland
If everything is explicit, there is less
"hmmm, how do I test that? I guess I could throw something together
using the shell. Now I have to sift through the existing tests to see if
there is precedent, and what that is."
The test is a shell script with some convenience functions. It's always
been a shell script with some convenience functions, because it's
testing commands most commonly called from the shell.
Post by Samuel Holland
Instead, you put the inputs here,
the outputs there, and you're done.
You want the test not to be obviously a shell script.

One of the failure cases I've seen in contributed tests is they're
testing what toybox does, not what the command is expected to do. The
command could produce different conformat behavior, but the test would
break.
Post by Samuel Holland
I admit that filesystem operations are a whole new can of worms, and I
do not have a good answer to those.
I could list testing cans of worms here for a long time, but I've been
listing cans of worms all day and am tired. The more rigid your
infrastructure is, the less range of situations it copes with.
Post by Samuel Holland
Post by Rob Landley
I'm not saying it's _not_ a good extension, but you did ask.
Complexity is a cost, spend it wisely. You're saying each test
should test more things, which means potential false positives and
more investigation about why a test failed (and/or more complex
reporting format).
On the other hand, splitting up the outputs you check gives you _more_
information to help you investigate the problem. Instead of
"FAIL: foobar" you get something like "FAIL (stdout mismatch): foobar".
Right now all fail is stdout mismatch, since that's the only thing it's
checking and you can go.

VERBOSE=fail make test_ls

And have it not only stop at the first failure, but show you the diff
between actual and expected, plus show you the command line it ran.

Your solution is to load more information into each test and write more
infrastructure to output which informat failed. I.E. make everything
bigger and more complicated without actually adding new capabilities.
Post by Samuel Holland
(As a side note, the test harness I've written recently even gives you a
diff of the expected and actual outputs when the test fails.)
So does this one, VERBOSE=1 shows the diff for all of them, VERBOSE=fail
stops after the first failure. It's not the DEFAULT output because it's
chatty.

Type "make help" and look at the "test" target. I think it's some of the
web documentation too, and it's also in the big comment block at the
start of scripts/runtest.sh.
Post by Samuel Holland
Post by Rob Landley
Also, "the return code" implies none of the tests are pipelines, or
multi-stage "do thing && examine thing" (which _already_ fails if do
thing returned failure, and with the error_msg() stuff would have
said why to stderr already). Yesterday I was poking at mv tests
which have a lot of "mv one two && [ -e two ] && [ ! -e one ] && echo
yes" sort of constructs. What is "the exit code" from that?
Well, if we are testing mv, then the exit code is the exit code of mv.
Not in the above test it isn't. "mv" isn't necessarily the first thing
we run, or the last thing we run, in a given pipeline.

We have a test for "xargs", which is difficult to run _not_ in a
pipeline. When you test "nice" or "chroot" or "time", the command has an
exit code and its child could have an exit code. It's NOT THAT SIMPLE.
Post by Samuel Holland
The rest is just checking filesystem state after the fact. I claimed to
not know what to do about it, but in the interest of avoiding punting
the question, here's the answer off the top of my head (even though I
Add an argument to the testing command that contains a predicate to eval
after running the test. If the predicate returns true, the test
succeeded; if the predicate returns false, the test failed. That way,
the only command that is ever in the "command" argument is the toy to test.
If you would like to write a completely different test suite from the
one I've done, feel free. I'm not stoping you.
Post by Samuel Holland
Post by Rob Landley
Post by Samuel Holland
It's a lot of duplication to write "echo $?" in all of the test
cases.
I don't. Sometimes I go "blah && yes" or "blah || yes", when the
return code is specifically what I'm testing, and sometimes checking
the return code and checking the output are two separate tests.
Okay, so it's a lot of duplication to write "&& yes" all over the place. :)
"&& echo yes", and no it isn't really. Compared to having an extra
argument in the other 3/4 of the tests that don't currently care about
the exit code, plus an exception mechanism for "we don't actualy care
what this exit code is, just that it's nonzero"...
Post by Samuel Holland
Post by Rob Landley
Keep in mind that error_msg() and friends produce output, and the
tests don't catch stderr by default but pass it through. If we catch
stderr by default and a test DOESN'T check it, then it's ignored
instead of visibile to the caller.
I'm not sure how you could _not_ check stderr. The test case has a
string, the command generates a string, you compare the strings.
By default it intercepts stdout and stderr goes to the terminal. The
shell won't care what gets produced on stderr if the resulting exit code
is then 0 either.
Post by Samuel Holland
If you want to pass it through, nothing prevents that.
I don't understand what you're saying here. I already pointed out you
can redirect and intercept it and make it part of your test. That said,
perror_msg appends a translated error string so exact matches on english
will fail in other locales. Plus kernel version changes have been known
to change what errno a given syscall failure returns. Heck, different
filesystem types sometimes do that too. (Reiserfs was notorious for that.)
Post by Samuel Holland
Post by Rob Landley
Also, keep in mind I want the host version to pass most of these
tests too, and if there are gratuitous differences in behavior I
don't WANT the test to fail based on something I don't care about
and wasn't trying to test. You're arguing for a tighter sieve with
smaller holes when I've already received a bunch of failing tests
that were written against gnu and never _tried_ against the toybox
version, and failed for reasons that aren't real failures.
If you want to do that, then yes, you definitely need a looser sieve. I
get 60 more failures here with TEST_HOST than I do with allyesconfig. I
agree that checking stderr on toybox vs. busybox or GNU is going to be
impossible because of differing error messages.
I've annotated a few tests with 'expected to fail with non-toybox', grep
for SKIP_HOST=1 in tests/*.test.
Post by Samuel Holland
Possible solutions
include not checking stderr by default (only formalizing the exit code
check), or simply not checking stderr when TEST_HOST=1.
I.E. add another special case context with different behavior.
Post by Samuel Holland
Post by Rob Landley
Post by Samuel Holland
As for stdout/stderr, it helps make sure diagnostic messages are
going to the right stream when not using the helper functions.
Right now diagnostic messages are visible in the output when running
the test. There shouldn't be any by default, when there is it's
pretty obvious because those lines aren't colored.
I'm all for improving the test suite, but "what I think the test
suite should be trying to do differs from what you think the test
suite should be trying to do, therefore I am right" is missing some
steps.
Then I guess it's not exactly clear to me what you are trying to do with
the test suite.
I had a list of 3 reasons in a previous email.

I run a bunch of tests manually when developing a command to determine
whether or not I'm happy with the command's behavior. My rule of thumb
is if I run a test command line during development, I should have an
equivalent test in the test suite for regression testing purposes. (I
had to run this test to check the behavior of the command, therefore it
is a necessary test and should be in the regression test suite. Usually
I cut and paste the command lines I ran to a file, along with the
output, and throw it on the todo list. Running "find" or "sed" tests
against the toybox source isn't stable for reasons listed earlier, or
grep -ABC against the README with all the weird possible -- placements,
so translating the tests into the format I need isn't always obvious.)

Then I want to do a second pass reading the specs closely (posix, man
page, whatever the spec I'm implementing from is) and do tests checking
every constraint in the spec. (That's one of my "run up to the 1.0
release" todo items.)

This is why so much of the test suite is still on the todo list. There's
a lot of work in doing it _right_...
Post by Samuel Holland
My interpretation of the purpose was to verify
correctness (mathematically, string transformations, etc.) and standards
compliance.
Alas, as I know from implementing a lot of this stuff, determining what
"correctness" _means_ is often non-obvious and completely undocumented.
Post by Samuel Holland
In that sense, for each set of inputs to a command
(arguments, stdin, filesystem state), there is exactly one set of
correct outputs (exit code, stdout/stderr, filesystem state)
This isn't how reality works.

See "false" above. 3 is an acceptable return code, and that's a TRIVIAL
case. The most interesting things to test are the error paths, and the
error output is almost never rigidly specified, and in the toybox case
perror_exit output is partly translated. And then you get into "toybox,
busybox, and ubuntu produce different output but I want at least _most_
of the tests to pass in common"...
Post by Samuel Holland
, and
therefore the goal of the test suite is to compare the actual output to
the correct output and ensure they match. If you don't check the exit
code, you are missing part of the output.
Remember the difference between android and toybox uptime output? Or how
about pmap, what should its output show? The only nonzero return code
base64 can do is failure to write to stdout, but I recently added tests
to check that === was being wrapped by -w properly (because previously
it wasn't). Is error return code the defining characteristic of an
nbd-client test? (How _do_ you test that?)

Here is a cut and paste of the _entire_ man page of setsid:

SETSID(1) User Commands SETSID(1)

NAME
setsid - run a program in a new session

SYNOPSIS
setsid program [arg...]

DESCRIPTION
setsid runs a program in a new session.

SEE ALSO
setsid(2)

AUTHOR
Rick Sladkey <***@world.std.com>

AVAILABILITY
The setsid command is part of the util-linux package and is
available
from ftp://ftp.kernel.org/pub/linux/utils/util-linux/.

util-linux November 1993 SETSID(1)

Now tell me: what error return codes should it produce, and under what
circumstances? Are the error codes the man page doesn't bother to
mention an important part of testing this command, or is figuring out
how to distinguish a session leader (possibly with some sort of pty
wrapper plumbing to signal it through) more important to testing this
command?
Post by Samuel Holland
I won't tell you that you have to do it any one way. It's your project.
Of course, complexity is to some extent a value judgment. If you think
it adds too much complexity/strictness for your taste, that's fine. I
was just trying to explain the reasoning behind the suggestion, and why
I think it's a reasonable suggestion.
I'd like to figure out how to test the commands we've got so that if
they break in a way we care about, the test suite tells us rather than
us having to find it out. I don't care if false returns 3 and nothing
will ever notice. I _do_ care that the perl build broke because sed
wasn't doing a crazy thing I didn't know it had to do, which is why
commit 32b3587af261 added a test. If somebody has to implement a new sed
in future, that test shows them a thing it needs to do to handle that
crazy situation. (Unless perl gets fixed, which seems unlikely. But if
so, git annotate on the test suite shows why the test was added,
assuming the comment itself before the test isn't enough.)

Part of what the test suite does is make me re-think through what the
correct behavior _is_ in various corner cases, and I'm not sure setsid's
current behavior is remotely correct. I always meant to revisit it when
doing the shell...
Post by Samuel Holland
Post by Rob Landley
Rob
P.S. Your other reply came in just as I had finished typing. Sorry if
some of this is already addressed.
It's fine. Figuring out the right thing to do is often hard.

Rob
Samuel Holland
2016-03-15 01:58:55 UTC
Permalink
Your previous email definitely clarified how you want the test suite to
work, thank you.

I tried to answer your questions while avoiding duplication. I realize
this thread is getting towards bikeshedding territory, so I've attempted
to focus on the more factual/neutral/useful parts.
Post by Rob Landley
Post by Samuel Holland
Post by Rob Landley
Because science is about reducing variables and isolating to test
specific things?
If you want to reduce variables, see the suggestion about unit testing.
That said, what specifically was the suggestion about unit testing.
"We should have some?" We should export a second C interface to
something that isn't isn't a shell command for the purpose of
telling us... what, exactly?
only having integration tests is why it's so hard to test toybox ps
and why it's going to be hard to fuzz the code: we're missing the
boundaries that let us test individual pieces. it's one of the major
problems with the toybox design/coding style. sure, it's something
all the existing competition in this space gets wrong too, but it's
the most obvious argument for the creation of the _next_ generation
tool...
There is only so much variable-reduction you can do if you test the
whole program at once. If you want to, as you suggested, "test specific
things", like the command infrastructure, thoroughly, they have to be
tested apart from the limits of the commands they are used in.
Post by Rob Landley
If we need to test C functions in ways that aren't easily allowed by
the users of those C functions, we can write a toys/example command
that calls those C functions in the way we want to check.
I think we actually agree with each other here.
Post by Rob Landley
Post by Samuel Holland
Considering how many times this pattern is already used, I don't
see it adding much complexity. It's trading an ad hoc pattern used
in ~17% of the tests for something more consistent and
well-defined.
Because 17% of the tests use it, 100% of the tests should get an
extra argument?
It's not adding any more features, just refactoring the existing
behavior behind a common function instead of repeating it throughout the
testsuite. For how to avoid adding complexity where it's not used, I'll
Post by Rob Landley
Yes, that is exactly what I was getting at. Instead of "testing",
there could be another function "testing-errors" or something. But
it's not super important right now.
Post by Samuel Holland
I have seen a couple of tests that pass because they expect
failure, but the command is failing for the wrong reason.
Point them out please?
I don't remember specifics at this point. I haven't looked at the test
suite in much detail (other than reading the mailing list) since the end
of 2014 or so when I was working on using it in a toy distro.

http://thread.gmane.org/gmane.linux.toybox/1709
https://github.com/smaeul/escapist/commits/master

If I remember correctly, one of them failed because it got a SIGSEGV,
but to a shell that's just false. The other one was not crashing, but
failing for another reason than expected. If I had to guess, one of them
was cp, but that's because it's the one I spent the most time on. I'm
positive they are both fixed now.
Post by Rob Landley
you can go
VERBOSE=fail make test_ls
And have it not only stop at the first failure, but show you the diff
between actual and expected, plus show you the command line it ran.
<snip>
Post by Samuel Holland
(As a side note, the test harness I've written recently even gives
you a diff of the expected and actual outputs when the test
fails.)
So does this one, VERBOSE=1 shows the diff for all of them,
VERBOSE=fail stops after the first failure. It's not the DEFAULT
output because it's chatty.
Type "make help" and look at the "test" target. I think it's some of
the web documentation too, and it's also in the big comment block at
the start of scripts/runtest.sh.
Okay, to some extent, I actually like way that better than mine. It
gives you an overview of how close you are to conformance (you can count
the failing tests, instead of quitting at the first failure), yet lets
you drill down when desired. Like I said, I haven't studied the test
infrastructure recently; I should go do that.
Post by Rob Landley
Post by Samuel Holland
Post by Rob Landley
Also, "the return code" implies none of the tests are pipelines,
or multi-stage "do thing && examine thing" (which _already_ fails
if do thing returned failure, and with the error_msg() stuff
would have said why to stderr already). Yesterday I was poking at
mv tests which have a lot of "mv one two && [ -e two ] && [ ! -e
one ] && echo yes" sort of constructs. What is "the exit code"
from that?
Well, if we are testing mv, then the exit code is the exit code of mv.
Not in the above test it isn't. "mv" isn't necessarily the first
thing we run, or the last thing we run, in a given pipeline.
Right. The whole point was that (in my ideal test suite) mv (or any
other program being tested) should never _be_ in a pipeline. That way
you don't have to even consider how the pipeline works in xyz shell.
Post by Rob Landley
We have a test for "xargs", which is difficult to run _not_ in a
pipeline.
Redirecting stdin from a file (which could be temporary) doesn't do
weird things with return values like a shell pipeline does.
Post by Rob Landley
When you test "nice" or "chroot" or "time", the command has an exit
code and its child could have an exit code. It's NOT THAT SIMPLE.
"nice true", "xargs echo", "chroot . true", etc. I'm not sure how "true"
or "echo" would have any other exit code than 0. (If it does, your
shell/echo/true is majorly broken, and you might as well give up.)
Post by Rob Landley
Post by Samuel Holland
Post by Rob Landley
Keep in mind that error_msg() and friends produce output, and the
tests don't catch stderr by default but pass it through. If we
catch stderr by default and a test DOESN'T check it, then it's
ignored instead of visibile to the caller.
I'm not sure how you could _not_ check stderr. The test case has a
string, the command generates a string, you compare the strings.
By default it intercepts stdout and stderr goes to the terminal. The
shell won't care what gets produced on stderr if the resulting exit
code is then 0 either.
Post by Samuel Holland
If you want to pass it through, nothing prevents that.
I don't understand what you're saying here. I already pointed out you
can redirect and intercept it and make it part of your test.
I should have been more clear: I was confused by why you were
considering "If we catch stderr by default and a test DOESN'T check
it..." If stderr is caught by the test infrastructure the test doesn't
specify anything for it, it would be compared against the empty string.
The test would have to actively throw it away (2>/dev/null or something)
for it to not be checked.

I am aware you can pass stderr through to the terminal without checking
it, and that that's what the toybox test suite currently does. "If you
want to pass it through, nothing prevents that." was meant to point out
that, even if stderr was caught (for checking) by default, it could
_also_ be sent to the terminal if you wanted to.

(I think a lot of my writing suffers from "it makes sense to me...".)
Post by Rob Landley
That said, perror_msg appends a translated error string so exact
matches on english will fail in other locales.
Set LC_MESSAGES=C in the test infrastructure? By this time, I've
realized that checking stderr for an expected value is often going to be
impossible...
Post by Rob Landley
Plus kernel version changes have been known to change what errno a
given syscall failure returns. Heck, different filesystem types
sometimes do that too. (Reiserfs was notorious for that.)
...and apparently errno isn't reliable either. I thought the kernel
didn't break userspace? I guess that contract doesn't include "why you
can't do that."

Okay, point taken. I wasn't aware that return codes were so loosely
specified. I was under the impression that programs would generally just
exit with the last errno (or 1 for some other error), and that errno
values were well-specified at the libc/kernel level.
Post by Rob Landley
Post by Samuel Holland
and therefore the goal of the test suite is to compare the actual
output to the correct output and ensure they match. If you don't
check the exit code, you are missing part of the output.
Remember the difference between android and toybox uptime output? Or
how about pmap, what should its output show? The only nonzero return
code base64 can do is failure to write to stdout, but I recently
added tests to check that === was being wrapped by -w properly
(because previously it wasn't). Is error return code the defining
characteristic of an nbd-client test? (How _do_ you test that?)
SETSID(1) User Commands SETSID(1)
NAME setsid - run a program in a new session
SYNOPSIS setsid program [arg...]
DESCRIPTION setsid runs a program in a new session.
SEE ALSO setsid(2)
AVAILABILITY The setsid command is part of the util-linux package
and is available from
ftp://ftp.kernel.org/pub/linux/utils/util-linux/.
util-linux November 1993 SETSID(1)
Now tell me: what error return codes should it produce, and under
what circumstances? Are the error codes the man page doesn't bother
to mention an important part of testing this command, or is figuring
out how to distinguish a session leader (possibly with some sort of
pty wrapper plumbing to signal it through) more important to testing
this command?
Of course I don't claim that return codes are the most important, by any
means. I just think^Wthought they were a relatively low-overhead thing
to test _in_addition_ to the important stuff, that might catch some
additional corner cases. As for "setsid", in my opinion, it should
return the errno from setsid() or exec*() if either fails. After it
execs, it doesn't really have a say.

Amusingly, my setsid (from util-linux 2.26.2, which has two real
options!) manages to fail rather spectacularly:

setsid: child 8695 did not exit normally: Success
setsid: failed to execute htop: Invalid or incomplete multibyte or wide
character
Post by Rob Landley
I'd like to figure out how to test the commands we've got so that if
they break in a way we care about, the test suite tells us rather
than us having to find it out. I don't care if false returns 3 and
nothing will ever notice.
Hmmm, difference of viewpoint. I see the command line interface of these
programs as an API, just like any other. You mention their use in shell
scripts. It would be a regression to gratuitously change the output of a
command, even if it is still within the relevant standard. The argument
against that is that shell scripts should follow the standard, not a
specific implementation; but as you often bring up, the standards are
Post by Rob Landley
One of the failure cases I've seen in contributed tests is they're
testing what toybox does, not what the command is expected to do.
and (as I see it) such is the difference between regression testing and
testing for conformance. And both are useful. I'm all for continuous
refactoring of internal logic, but externally-visible behavior makes
more promises. Of course, toybox isn't 1.0 yet, so users should expect
changes in behavior... I give up.

What would be nice is if there was a POSIX test suite for commands and
utilities... Apparently there was at one point, some website mentioned
it, but it's not listed on the downloads page:

http://www.opengroup.org/testing/downloads.html

Doing some URL fiddling got me here:

http://www.opengroup.org/testing/downloads/vsclite.html

It's gone. Of course you can't just use the new one, you have to try to
get certified to even download it.
Post by Rob Landley
If you would like to write a completely different test suite from the
one I've done, feel free. I'm not stopping you.
I will probably end up trying that, at least for POSIX, because a freely
available test suite is generally useful (and I'm young enough to enjoy
writing something for the educational value even if it gets all thrown
away). How? grep, regexec(), I'll figure something out.

Again, thank you for setting me straight.

--
Regards,
Samuel Holland

Loading...