8bfe8a6bfdd45de43c626a12fd176750486a0759 - binutils-gdb

commit: 8bfe8a6bfdd45de43c626a12fd176750486a0759
[log]
author: Andrew Burgess <aburgess@redhat.com>
Fri Feb 14 11:51:41 2025 +0000
committer: Andrew Burgess <aburgess@redhat.com>
Sat Mar 15 12:36:46 2025 +0000
tree: f009f658a4f681d144a0c390a7ec3f23524f11a6
parent: c7d973ab6189290fc894529cec5db7f585074ab4 [diff]

gdb/python: handle non-utf-8 character from gdb.execute()

I noticed that it was not possible to return a string containing non
utf-8 characters using gdb.execute().  For example, using the binary
from the gdb.python/py-source-styling.exp test:

  (gdb) file ./gdb/testsuite/outputs/gdb.python/py-source-styling/py-source-styling
  Reading symbols from ./gdb/testsuite/outputs/gdb.python/py-source-styling/py-source-styling...
  (gdb) set style enabled off
  (gdb) list 26
  21	  int some_variable = 1234;
  22
  23	  /* The following line contains a character that is non-utf-8.  This is a
  24	     critical part of the test as Python 3 can't convert this into a string
  25	     using its default mechanism.  */
  26	  char c[] = "�";		/* List this line.  */
  27
  28	  return 0;
  29	}
  (gdb) python print(gdb.execute('list 26', to_string=True))
  Python Exception <class 'UnicodeDecodeError'>: 'utf-8' codec can't decode byte 0xc0 in position 250: invalid start byte
  Error occurred in Python: 'utf-8' codec can't decode byte 0xc0 in position 250: invalid start byte

It is necessary to disable styling before the initial 'list 26',
otherwise the source will be passed through GNU source highlight, and
GNU source highlight seems to be smart enough to figure out the
character encoding, and convert it to UTF-8.  This conversion is then
cached in the source cache, and the later Python gdb.execute call will
get back a pure UTF-8 string.

If source styling is disabled, then GDB caches the string without the
conversion to UTF-8, now the gdb.execute call gets back the string
with a non-UTF-8 character within it, and Python throws an error
during its attempt to create a string object.

I'm not, at this point, proposing a solution that tries to guess the
source file encoding, though I guess such a thing could be done.
Instead, I think we should make use of the host_charset(), as set by
the user with 'set host-charset ....' during the creation of the
Python string.

To do this, in execute_gdb_command, we should switch from
PyUnicode_FromString, which requires the input be a UTF-8 string, to
using PyUnicode_Decode, which allows GDB to specify the string
encoding.  We will use host_charset().

With this done, it is now possible to list the file contents using
gdb.execute(), with the contents passing through a string:

  (gdb) set host-charset ISO-8859-1
  (gdb) python print(gdb.execute('list 26', to_string=True), end='')
  21	  int some_variable = 1234;
  22
  23	  /* The following line contains a character that is non-utf-8.  This is a
  24	     critical part of the test as Python 3 can't convert this into a string
  25	     using its default mechanism.  */
  26	  char c[] = "À";		/* List this line.  */
  27
  28	  return 0;
  29	}
  (gdb)

There are already plenty of other places in GDB's Python code where we
use PyUnicode_Decode to create a string from something that might
contain user generated content, so I believe this is the correct
approach.

2 files changed