Difference between revisions of "Dev:fileConventions"

From wiki.railML.org
Jump to: navigation, search
[unchecked revision][checked revision]
m (spelling corrected ("railml" is lower-case in current XSDs))
m
Line 27: Line 27:
 
* The file names (and comments) of the files in a ZIP archive are always encoded 8 bit, either Single Byte Character Set using IBM Code Page 437 or Multiple Byte Character Set using UTF-8. To distinguish between both, bit 11 of General Purpose Bit Flag of the file’s header is used: If this bit is set (=1), the file name (and comment) must be UTF-8 encoded. UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM). If General Purpose Bit 11 is unset (=0), the file name (and comment) must be encoded with code page 437. Note that no other code page is valid - independently from the system code page of the corresponding operating system or regional settings! That’s why UTF-8 is recommended for railmlx files.
 
* The file names (and comments) of the files in a ZIP archive are always encoded 8 bit, either Single Byte Character Set using IBM Code Page 437 or Multiple Byte Character Set using UTF-8. To distinguish between both, bit 11 of General Purpose Bit Flag of the file’s header is used: If this bit is set (=1), the file name (and comment) must be UTF-8 encoded. UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM). If General Purpose Bit 11 is unset (=0), the file name (and comment) must be encoded with code page 437. Note that no other code page is valid - independently from the system code page of the corresponding operating system or regional settings! That’s why UTF-8 is recommended for railmlx files.
 
* For the railML file to be the ''first'' in the ZIP archive (rule 3), it is necessary to be the first Central File Header in the Central Directory which is at the ''end'' of the ZIP archive. It does not need to be the first file entry in the data stream itself - that is, its value ''Relative offset of local file header'' does not need to be zero. (In a ZIP file, the ''Central Directory'' at the end of the file is read first.)
 
* For the railML file to be the ''first'' in the ZIP archive (rule 3), it is necessary to be the first Central File Header in the Central Directory which is at the ''end'' of the ZIP archive. It does not need to be the first file entry in the data stream itself - that is, its value ''Relative offset of local file header'' does not need to be zero. (In a ZIP file, the ''Central Directory'' at the end of the file is read first.)
* The railmlx file must start with a Local File Header Signature (4 Bytes h50 h4B h03 h04) - no arbitrary data before the first Local File Header is allowed. (An End of Central Directory Signature h50 h4B h05 h06 would also be possible in case of en empty ZIP file but does not make sense here.) This is to allow detection whether a railML file is compressed or not - no valid uncompressed railML file can start with these bytes (see also below).
+
* The railmlx file must start with a Local File Header Signature (4 Bytes h50 h4B h03 h04) - no arbitrary data before the first Local File Header is allowed. (An End of Central Directory Signature h50 h4B h05 h06 would also be possible in case of an empty ZIP file but does not make sense here.) This is to allow detection whether a railML file is compressed or not - no valid uncompressed railML file can start with these bytes (see also below).
 
* railmlx files must not use a ZIP File Comment in the End of Central Directory Record (EOCD) nor any arbitrary data behind the EOCD. To correctly read a ZIP file from its end, the reading software has to find the start of the EOCD first. With ZIP File Comment, this could only be done by scanning for the EOCD Signature backwards from the end. But strangely, it is not secured that no EOCD Signature is part of the ZIP File Comment. This seems to be a weakness of the ZIP file format. To provide a comment in a railmlx file, use File Comment of a Central Directory File Header instead of a ZIP File Comment.
 
* railmlx files must not use a ZIP File Comment in the End of Central Directory Record (EOCD) nor any arbitrary data behind the EOCD. To correctly read a ZIP file from its end, the reading software has to find the start of the EOCD first. With ZIP File Comment, this could only be done by scanning for the EOCD Signature backwards from the end. But strangely, it is not secured that no EOCD Signature is part of the ZIP File Comment. This seems to be a weakness of the ZIP file format. To provide a comment in a railmlx file, use File Comment of a Central Directory File Header instead of a ZIP File Comment.
 
* railmlx files must not be spread over several ‘disks’ - the values ''Number of this disk'' and ''Disk where central directory starts'' of EOCD always have to be zero.  
 
* railmlx files must not be spread over several ‘disks’ - the values ''Number of this disk'' and ''Disk where central directory starts'' of EOCD always have to be zero.  
Line 44: Line 44:
 
If the data is not also (by coincidence?) from UTF-8, it has to be re-coded into UTF-8 to be stored in the railML file. In most cases nowadays, data will come from UTF-16 or any other Unicode format which can be re-coded into UTF-8 and vice-versa without loss.
 
If the data is not also (by coincidence?) from UTF-8, it has to be re-coded into UTF-8 to be stored in the railML file. In most cases nowadays, data will come from UTF-16 or any other Unicode format which can be re-coded into UTF-8 and vice-versa without loss.
  
Also, any 8 bit code page (Single Byte Character Set, SBCS) can be re-coded into UTF-8 but here comes the only problem which may happen: If data from a railML file has to be stored in a SBCS, this may not be possible without loss as long as the original code page is not known. Therefore, it is recommended to note the origin of the data whenever possible using the field [[CO:dc:language | dc:language]] whenever possible.
+
Also, any 8 bit code page (Single Byte Character Set, SBCS) can be re-coded into UTF-8 but here comes the only problem which may happen: If data from a railML file has to be stored in a SBCS, this may not be possible without loss as long as the original code page is not known. Therefore, it is recommended to note the origin of the data whenever possible using the field [[CO:dc:language | <dc:language>]] whenever possible.
  
In most cases of proper names of stations, a software will not know the ''language'' of station names (since station names normally are international and do not belong to a certain language). To tell the character set anyway - without a language - use the element <dc:language> as follows, with the language sub-tag set to und for undefined but the script sub-tag set to the original character set:<br>
+
In most cases of proper names of stations, a software will not know the ''language'' of station names (since station names normally are international and do not belong to a certain language). To tell the character set anyway - without a language - use the element [[CO:dc:language | <dc:language>]] as follows, with the language sub-tag set to <code>und</code> for ''undefined'' but the script sub-tag set to the original character set:<br>
 
:<code><dc:language>und-Latn</dc:language> for Latin character set</code><br>
 
:<code><dc:language>und-Latn</dc:language> for Latin character set</code><br>
 
:<code><dc:language>und-Grek</dc:language> for Greek character set</code><br>
 
:<code><dc:language>und-Grek</dc:language> for Greek character set</code><br>
Line 58: Line 58:
  
 
Since <code>encoding='UTF-8'</code> is the default value for {{Attr|encoding}}, and the XML declaration itself is optional, too, one could mean that it is ok to skip the entire XML declaration. Thus would mean a valid railML starting with  
 
Since <code>encoding='UTF-8'</code> is the default value for {{Attr|encoding}}, and the XML declaration itself is optional, too, one could mean that it is ok to skip the entire XML declaration. Thus would mean a valid railML starting with  
<railml version…
+
<code><railml version…</code> rather than <code><?xml version='1.0' encoding='UTF-8' ?></code>.
rather than
+
<?xml version='1.0' encoding='UTF-8' ?>.
+
  
 
It is strongly recommended and best practice, '''not''' to omit the XML declaration. Always remember that file names and their extensions may be altered, so it should easily be possible to validate a file as being of XML format. Also, no arbitrary data are allowed before the XML declaration (except BOMs, see below).
 
It is strongly recommended and best practice, '''not''' to omit the XML declaration. Always remember that file names and their extensions may be altered, so it should easily be possible to validate a file as being of XML format. Also, no arbitrary data are allowed before the XML declaration (except BOMs, see below).

Revision as of 22:07, 16 August 2015

Conventions for railML files

Which file name extension shall a railML file get?

*.xml or *.railml, that is the question. The answer is: Both. It was not a convention from the beginning, but best practice should be:

  • *.railml shall only be used for files / exports from tools which are certified by railML.org commitee. Being a registered trademark, railml® should be a kind of quality sign. As a certification is mandatory since the release of railML 2.2, all tools in productive use shall use *.railml for files / exports.
  • *.xml is always allowed since railML files are xml files. It shall be used for non-certified or experimental files. Since the certification process started in 2013 all railML version 1.x files / exports must use this extension. railML version 2.0 or 2.1 files / exports shall use *.xml extension, if the programme was delivered before July 1st 2013 and is not certified up to now.
  • *.railmlx: compressed *.railML files shall use this extension if compressed implicitly by a certified software. Manually compressed files, of course, may have the extension *.zip.

Technical description

Compressed railML files

Due to the large size of railML files by trend - as being text files with all their inefficiencies - there is a demand to reduce their size. Therefore, there was a suggestion on compression of railML files in 2012 (see railML's trac ticket #181). In August 2015 the standard of compression of railML files settles to the following:

  1. railML files shall be compressed using Deflate compression algorithm and packed into a ZIP file archive corresponding to .ZIP File Format Specification
  2. The advantage of Deflate and ZIP to allow compression and decompression ‘on the fly’ shall fully be supported.
  3. The railML (xml) file shall be the first file of the ZIP archive. Generally admitted railmlx files (neutral, with no predefined 'destination') should contain one railML file only - compression is the only aim of the ZIP archive.
  4. To be discussed in railML's misc forum: There may be railMLx files with more than one railML file packed. Please read and participate in the discussion at railML's forum!
  5. The original file name of the ZIP archive (railmlx file) shall be identical to the file name of the railML file in the ZIP archive - only the file extensions shall be different. The file extension of the packed railML file shall be *.xml or *.railml as stated above, the file extension of the ZIP archive shall always be *.railmlx.
  6. If the railML file uses non-railML namespaces with their schema files (XSD files) not available at an Internet URL, these schema files shall be packed into the ZIP file 'behind' the railML file. All files in the archive should validate without any further files other than:
  • railML XSD schema files
  • Dublin Core XSD schema files
  • (MathML XSD schema files)
If this way is used, the attribute xsi:schemaLocation shall contain the XSD file name as packed into the ZIP archive without any path nor prefix, so without 'http://' nor 'urn:' nor such and without any delimiter /. See also Defining namespaces and validating railML files and Namespace handling.

Please note the following further remarks:

  • The file names (and comments) of the files in a ZIP archive are always encoded 8 bit, either Single Byte Character Set using IBM Code Page 437 or Multiple Byte Character Set using UTF-8. To distinguish between both, bit 11 of General Purpose Bit Flag of the file’s header is used: If this bit is set (=1), the file name (and comment) must be UTF-8 encoded. UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM). If General Purpose Bit 11 is unset (=0), the file name (and comment) must be encoded with code page 437. Note that no other code page is valid - independently from the system code page of the corresponding operating system or regional settings! That’s why UTF-8 is recommended for railmlx files.
  • For the railML file to be the first in the ZIP archive (rule 3), it is necessary to be the first Central File Header in the Central Directory which is at the end of the ZIP archive. It does not need to be the first file entry in the data stream itself - that is, its value Relative offset of local file header does not need to be zero. (In a ZIP file, the Central Directory at the end of the file is read first.)
  • The railmlx file must start with a Local File Header Signature (4 Bytes h50 h4B h03 h04) - no arbitrary data before the first Local File Header is allowed. (An End of Central Directory Signature h50 h4B h05 h06 would also be possible in case of an empty ZIP file but does not make sense here.) This is to allow detection whether a railML file is compressed or not - no valid uncompressed railML file can start with these bytes (see also below).
  • railmlx files must not use a ZIP File Comment in the End of Central Directory Record (EOCD) nor any arbitrary data behind the EOCD. To correctly read a ZIP file from its end, the reading software has to find the start of the EOCD first. With ZIP File Comment, this could only be done by scanning for the EOCD Signature backwards from the end. But strangely, it is not secured that no EOCD Signature is part of the ZIP File Comment. This seems to be a weakness of the ZIP file format. To provide a comment in a railmlx file, use File Comment of a Central Directory File Header instead of a ZIP File Comment.
  • railmlx files must not be spread over several ‘disks’ - the values Number of this disk and Disk where central directory starts of EOCD always have to be zero.
  • To fully allow compressing and uncompressing ‘on the fly’, the values CRC32 and CompressedSize of the Local File Headers in the ZIP archive does not need to be used. A reading (uncompressing) software shall ignore these values (since they may be unknown to the writing software when writing the Local File Headers on the fly). However, the corresponding values CRC32 and CompressedSize of the Central Directory File Headers must be correctly set.

Encoding of railML files

RailML files shall be encoded UTF-8. This means, the attribute encoding of the XML declaration shall be set to encoding='UTF-8':

<?xml version="1.0" encoding="UTF-8" ?>

Please note that this only defines the encoding of the XML (railML) file. This does not necessarily be identical with the original encoding of the data (contents) of the file. The term data here (concerning railML files) often applies to proper names of stations and such.

If the data is not also (by coincidence?) from UTF-8, it has to be re-coded into UTF-8 to be stored in the railML file. In most cases nowadays, data will come from UTF-16 or any other Unicode format which can be re-coded into UTF-8 and vice-versa without loss.

Also, any 8 bit code page (Single Byte Character Set, SBCS) can be re-coded into UTF-8 but here comes the only problem which may happen: If data from a railML file has to be stored in a SBCS, this may not be possible without loss as long as the original code page is not known. Therefore, it is recommended to note the origin of the data whenever possible using the field <dc:language> whenever possible.

In most cases of proper names of stations, a software will not know the language of station names (since station names normally are international and do not belong to a certain language). To tell the character set anyway - without a language - use the element <dc:language> as follows, with the language sub-tag set to und for undefined but the script sub-tag set to the original character set:

<dc:language>und-Latn</dc:language> for Latin character set
<dc:language>und-Grek</dc:language> for Greek character set
<dc:language>und-Cyrl</dc:language> for Cyrillic character set
<dc:language>und-Arab</dc:language> for Arabic character set
<dc:language>und-Hebr</dc:language> for Hebrew character set
<dc:language>und-Japn</dc:language> for Japanese character set

and so on.

XML declaration

Since encoding='UTF-8' is the default value for encoding, and the XML declaration itself is optional, too, one could mean that it is ok to skip the entire XML declaration. Thus would mean a valid railML starting with <railml version… rather than <?xml version='1.0' encoding='UTF-8' ?>.

It is strongly recommended and best practice, not to omit the XML declaration. Always remember that file names and their extensions may be altered, so it should easily be possible to validate a file as being of XML format. Also, no arbitrary data are allowed before the XML declaration (except BOMs, see below).

(Please note that this article applies on railML files. It may be ok to omit the XML declaration if railML is encapsulated by, or embedded in, a higher protocol level and thus is part of a data stream which is not a file. Apart from such special use cases, railML is a file format for general data exchange. In this general meaning - with no precise data source and destination - the XML declaration is not to be omitted.)

However, you can omit the attribute encoding if it is UTF-8 (but not if the file is from any other code page). If you declare encoding='UTF-8', please note that UTF-8 is to be written in upper-case.

Byte Order Mark

A railML file may start with a Byte Order Mark (BOM) but since no special byte order applies to UTF-8, it is not necessary and therefore not recommended.

So, a valid railML file may start with the following byte sequences:

  • <?xml - no BOM, valid XML declaration: encoding shall be interpreted or file must be encoded UTF-8 if there is no encoding given
  • hEF hBB hBF - BOM for UTF-8: encoding must be set to UTF-8 or omitted
  • h50 h4B h03 h04 - compressed railML file: decompress and unpack the file and go on reading BOM and/or encoding of the decompressed data

And possibly:

  • <railml - no BOM, no XML declaration: file must be encoded UTF-8. Since this is against best practice, you may refuse such files.

The following byte sequences may be valid XML files but unsupported by railML:

  • h00 h00 hFE hFF - UCS-4, big-endian (1234 order)
  • hFF hFE - UCS-4, little-endian (4321 order) or UTF-16, little-endian
  • h00 h00 hFF hFE - UCS-4, unusual octet order (2143)
  • hFE hFF - UCS-4, unusual octet order (3412) or UTF-16, big-endian

Further reading

See also: XML Syntax issues (Attribute delimiters, Character references)