Opened 15 years ago
Closed 14 years ago
#36 closed defect (fixed)
file-position broken for utf16 and utf32
Reported by: | Raymond Toy | Owned by: | somebody |
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | Core | Version: | 2010-01 |
Keywords: | Cc: |
Description
Consider this code:
(defun bug (&optional (format :utf16)) (with-open-file (s "/tmp/bom.txt" :direction :output :if-exists :supersede :external-format format) (format s "Hello~%")) (with-open-file (s "/tmp/bom.txt" :direction :input :external-format format) (print (read-char s)) (print (file-position s))) (values))
Running (bug :utf16)
produces
#\H 2
(bug :utf32)
produces
#\H 4
In both cases, the actual position is wrong. For utf16, the position should 4; utf32, 8. The BOM has been ignored.
This is caused by STRING-ENCODE
outputting the BOM for these formats. STRING-ENCODE)
is used to figure out how many octets have not yet been processed but have been read from the file. If the BOM was not output, the position would be correct.
This bug (will) occur in the 2010-02 snapshot and later.
Change History (4)
comment:1 Changed 15 years ago by
Version: | 19f → 2010-01 |
---|
comment:2 Changed 14 years ago by
comment:3 Changed 14 years ago by
Keeping track of the octets is probably the only "correct" solution. There's no guarantee that the input (octet-to-code) state has any relationship to the output (code-to-octet) state, so there may be no consistent way run string-encode correctly.
Some tests with keeping track of the char lengths indicate that the cost is fairly low, at least when reading characters one at a time (but the conversion is still done a block at a time and doled out one character at a time).
comment:4 Changed 14 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Fixed by using an array to hold the octet length of each character. Tests show very small change in speed (about 1% increase in time).
One possible solution is to keep track of the number of octets used to create each character. This has a relatively high cost because we need to save this for each character, for all inputs, but the data is only used for file-position. This seems really wasteful of MIPS and memory since file-position probably occurs much less often than reading characters.
Another alternative would be to modify string-encode so that the BOM is not included. But that's a bit tricky too. Either we need a new method for each external format (that needs it) or we need to add an extra parameter to the external format method to say we don't want a BOM. Not too hard to do, but some work to modify every format for this.
Or maybe string-encode can take a new argument specifying the ef state. But then we would need a new ef function to give us the ef state that will guarantee no BOM.
Or, the most hackish, but workable solution is to look at the output of string-encode. If the first two octets are the BOM, adjust for that. A bit hackish, but seems doable.