What's wrong with this address: 28 St. Someone’s Close?
The single quote, that's what. It's not a single quote, which is ANSI 0x27, it's actually an ANSI 0x92. This is causing me problems. The address is supplied in a text box on a web page, processed in a StringBuilder to make an XML fragment, and passed via ADO.Net to a SQL Server stored procedure, which is attempting to use sp_xml_preparedocument to read it. However, we're getting an error thrown out of the stored proc:
The XML parse error 0xc00ce508 occurred on line number 1, near the XML text "
Now, working out the problem isn't exactly difficult: we're passing the value in as a text field, so it's (in the words of SQL's books on-line), "Variable-length non-Unicode data in the code page of the server". A change to a unicode field and we should be fine. I probably should make the observation here that there's a note in books on-line to the effect that they're planning to deprecate the text and ntext data types:
Note: ntext, text, and image data types will be removed in a future version of Microsoft SQL Server. Avoid using these data types in new development work, and plan to modify applications that currently use them. Use nvarchar(max), varchar(max), and varbinary(max) instead.
So, I copied the relevant code into SQL Management Studio's query window to have a play, and changed the data type to nvarchar(max), and the sp_xml_preparedocument executed correctly:
DECLARE @hDoc INTEGER, @RailSegmentRemarks nvarchar(max)
SELECT @RailSegmentRemarks = '<RailSegmentRemarks><Remark RemarkID="1" RemarkText="28 St. Someone’s Close, Sometown, Oxfordshire, " /><Remark RemarkID="2" RemarkText="OX12 3AB" /><Remark RemarkID="3" RemarkText="" /><Remark RemarkID="4" RemarkText="" /></RailSegmentRemarks>'
EXEC sp_xml_preparedocument @hDoc OUTPUT, @RailSegmentRemarks
DECLARE @Remarks TABLE (RemarkID VARCHAR(50), RemarkText VARCHAR(50))
INSERT INTO @Remarks
SELECT RemarkID, RemarkText
FROM OPENXML(@hDoc, N'RailSegmentRemarks/Remark')
WITH (RemarkID INTEGER '@RemarkID', RemarkText VARCHAR(50) '@RemarkText') AS XML
WHERE COALESCE(RemarkText, '') <> ''
SELECT * FROM @Remarks
EXEC sp_xml_removedocument @hDoc
Now, the odd thing here is that all the code works: I get a set of fields returned from the final SELECT statement which contain the single quote character. This seems odd to me because while I've declared the @RailSegmentRemarks variable as an nvarchar, I haven't yet changed the rest of the code, which is still just varchar. But everything seems to work. Very strange.
Anyway, all that's left (ha!) for me to do is to change all the relevant variables, table defs, stored procs, views, and everything else that may be affected, to nvarchar. or I could just prevent these characters from being entered in the first place, I suppose...
There are some articles by Michael Rys here and here, which discuss some of these issues.
Update: Everything works ok if I pass in the XML data prefixed with an XML declaration specifying the encoding. This is, obviously, a lot less work than changing everything to handle unicode properly, and is sufficient for our system:
<?xml version=""1.0"" encoding=""windows-1252""?>
I needed an encoding that would match against the Latin1_General-CI-AS that our database uses, and the Windows-1252 (Western Europe Latin character set) does this.