close Warning: Can't synchronize with repository "(default)" ("(default)" is not readable or not a Git repository.). Look in the Trac log for more information.

Opened 14 years ago

Closed 14 years ago

#36 closed defect (fixed)

file-position broken for utf16 and utf32

Reported by: Raymond Toy Owned by: somebody
Priority: minor Milestone:
Component: Core Version: 2010-01
Keywords: Cc:

Description

Consider this code:

(defun bug (&optional (format :utf16))
  (with-open-file (s "/tmp/bom.txt" 
		     :direction :output
		     :if-exists :supersede
		     :external-format format)
    (format s "Hello~%"))
  (with-open-file (s "/tmp/bom.txt" 
		     :direction :input
		     :external-format format)
    (print (read-char s))
    (print (file-position s)))
  (values))

Running (bug :utf16) produces

#\H
2

(bug :utf32) produces

#\H
4

In both cases, the actual position is wrong. For utf16, the position should 4; utf32, 8. The BOM has been ignored.

This is caused by STRING-ENCODE outputting the BOM for these formats. STRING-ENCODE) is used to figure out how many octets have not yet been processed but have been read from the file. If the BOM was not output, the position would be correct.

This bug (will) occur in the 2010-02 snapshot and later.

Change History (4)

comment:1 Changed 14 years ago by Raymond Toy

Version: 19f2010-01

comment:2 Changed 14 years ago by Raymond Toy

One possible solution is to keep track of the number of octets used to create each character. This has a relatively high cost because we need to save this for each character, for all inputs, but the data is only used for file-position. This seems really wasteful of MIPS and memory since file-position probably occurs much less often than reading characters.

Another alternative would be to modify string-encode so that the BOM is not included. But that's a bit tricky too. Either we need a new method for each external format (that needs it) or we need to add an extra parameter to the external format method to say we don't want a BOM. Not too hard to do, but some work to modify every format for this.

Or maybe string-encode can take a new argument specifying the ef state. But then we would need a new ef function to give us the ef state that will guarantee no BOM.

Or, the most hackish, but workable solution is to look at the output of string-encode. If the first two octets are the BOM, adjust for that. A bit hackish, but seems doable.

comment:3 Changed 14 years ago by Raymond Toy

Keeping track of the octets is probably the only "correct" solution. There's no guarantee that the input (octet-to-code) state has any relationship to the output (code-to-octet) state, so there may be no consistent way run string-encode correctly.

Some tests with keeping track of the char lengths indicate that the cost is fairly low, at least when reading characters one at a time (but the conversion is still done a block at a time and doled out one character at a time).

comment:4 Changed 14 years ago by Raymond Toy

Resolution: fixed
Status: newclosed

Fixed by using an array to hold the octet length of each character. Tests show very small change in speed (about 1% increase in time).

Note: See TracTickets for help on using tickets.