Difference between revisions of "Dev:fileConventions"
(Page newly created)
Revision as of 21:51, 13 August 2015
Conventions for railML files
Which file name extension shall a railML file get?
*.xml or *.railml, that is the question. The answer is: Both. It was not a convention from the beginning, but best practice should be:
- *.railml shall only be used for certified files / exports. Being a registered trademark, railml shall be a kind of quality sign.
- *.xml is always allowed since railML files are xml files. It shall be used for older (non-certified) or experimental files. Since the certification process started only from railML version 2.2, no file of older version should have the extension *.railml but always *.xml.
- Compressed railML files shall have the extension *.railmlx if compressed implicitly by a certified software. Manually compressed files, of course, may have the extension *.zip.
Compressed railML files
Due to the large size of railML files by trend - as being text files with all their inefficiencies - there is a demand to reduce their size. Therefore, there was a suggestion on compression of railML files in 2012. Until time of writing, the standard of compression of railML files settles to the following:
- railML files shall be compressed using Deflate compression algorithm and packed into a zip file archive corresponding to .ZIP File Format Specification
- The advantage of Deflate and ZIP to allow compression and decompression ‘on the fly’ shall fully be supported.
- The railML (xml) file shall be the first file of the ZIP archive. Generally admitted railmlx files (neutral, with no predefined 'destination') shall contain one railML file only - compression is the only aim of the ZIP archive. (There may be railmlx files with more than one railML file packed in for special use cases, when agreed so from both sides. But with no special use case in mind, a reader can expect that there is one railML file only in a railmlx file.)
- The original file name of the ZIP archive (railmlx file) shall be identical to the file name of the railML file in the ZIP archive - only the file extensions shall be different. The file extension of the packed railML file shall be *.xml or *.railml as stated above, the file extension of the ZIP archive shall always be *.railmlx.
- If the railML file uses non-railML namespaces with their schema files (XSD files) not available at an Internet URL, these schema files shall be packed into the ZIP file 'behind' the railML file. All files in the archive should validate without any further files other than:
- railML XSD schema files
- Dublin Core XSD schema files
- (MathML XSD schema files)
- If this way is used, the attribute xsi:schemaLocation shall contain the XSD file name as packed into the ZIP archive without any path nor prefix, so without 'http://' nor 'urn:' nor such and without any delimiter /. See also Defining namespaces and validating railML files and Namespace handling.
Please note the following further remarks:
- The file names (and comments) of the files in a ZIP archive are always encoded 8 bit, either Single Byte Character Set using IBM Code Page 437 or Multiple Byte Character Set using UTF-8. To distinguish between both, bit 11 of General Purpose Bit Flag of the file’s header is used: If this bit is set (=1), the file name (and comment) must be UTF-8 encoded. UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM). If General Purpose Bit 11 is unset (=0), the file name (and comment) must be encoded with code page 437. Note that no other code page is valid - independently from the system code page of the corresponding operating system or regional settings! That’s why UTF-8 is recommended for railmlx files.
- For the railML file to be the first in the ZIP archive (rule 3), it is necessary to be the first Central File Header in the Central Directory which is at the end of the ZIP archive. It does not need to be the first file entry in the data stream itself - that is, its value Relative offset of local file header does not need to be zero. (In a ZIP file, the Central Directory at the end of the file is read first.)
- The railmlx file must start with a Local File Header Signature (4 Bytes h50 h4B h03 h04) - no arbitrary data before the first Local File Header is allowed. (An End of Central Directory Signature h50 h4B h05 h06 would also be possible in case of en empty ZIP file but does not make sense here.) This is to allow detection whether a railML file is compressed or not - no valid uncompressed railML file can start with these bytes (see also below).
- railmlx files must not use a ZIP File Comment in the End of Central Directory Record (EOCD) nor any arbitrary data behind the EOCD. To correctly read a ZIP file from its end, the reading software has to find the start of the EOCD first. With ZIP File Comment, this could only be done by scanning for the EOCD Signature backwards from the end. But strangely, it is not secured that no EOCD Signature is part of the ZIP File Comment. This seems to be a weakness of the ZIP file format. To provide a comment in a railmlx file, use File Comment of a Central Directory File Header instead of a ZIP File Comment.
- railmlx files must not be spread over several ‘disks’ - the values Number of this disk and Disk where central directory starts of EOCD always have to be zero.
- To fully allow compressing and uncompressing ‘on the fly’, the values CRC32 and CompressedSize of the Local File Headers in the ZIP archive does not need to be used. A reading (uncompressing) software shall ignore these values (since they may be unknown to the writing software when writing the Local File Headers on the fly). However, the corresponding values CRC32 and CompressedSize of the Central Directory File Headers must be correctly set.
Encoding of railML files
RailML files shall be encoded UTF-8. This means, the attribute encoding of the XML declaration shall be set to
<?xml version="1.0" encoding="UTF-8" ?>
Please note that this only defines the encoding of the XML (railML) file. This does not necessarily be identical with the original encoding of the data (contents) of the file. The term data here (concerning railML files) often applies to proper names of stations and such.
If the data is not also (by coincidence?) from UTF-8, it has to be re-coded into UTF-8 to be stored in the railML file. In most cases nowadays, data will come from UTF-16 or any other Unicode format which can be re-coded into UTF-8 and vice-versa without loss.
Also, any 8 bit code page (Single Byte Character Set, SBCS) can be re-coded into UTF-8 but here comes the only problem which may happen: If data from a railML file has to be stored in a SBCS, this may not be possible without loss as long as the original code page is not known. Therefore, it is recommended to note the origin of the data whenever possible using the field dc:language whenever possible.
In most cases of proper names of stations, a software will not know the language of station names (since station names normally are international and do not belong to a certain language). To tell the character set anyway - without a language - use the element <dc:language> as follows, with the language sub-tag set to und for undefined but the script sub-tag set to the original character set:
<dc:language>und-Latn</dc:language> for Latin character set
<dc:language>und-Grek</dc:language> for Greek character set
<dc:language>und-Cyrl</dc:language> for Cyrillic character set
<dc:language>und-Arab</dc:language> for Arabic character set
<dc:language>und-Hebr</dc:language> for Hebrew character set
<dc:language>und-Japn</dc:language> for Japanese character set
and so on.
encoding='UTF-8' is the default value for encoding, and the XML declaration itself is optional, too, one could mean that it is ok to skip the entire XML declaration. Thus would mean a valid railML starting with
<?xml version='1.0' encoding='UTF-8' ?>.
It is strongly recommended and best practice, not to omit the XML declaration. Always remember that file names and their extensions may be altered, so it should easily be possible to validate a file as being of XML format. Also, no arbitrary data are allowed before the XML declaration (except BOMs, see below).
(Please note that this article applies on railML files. It may be ok to omit the XML declaration if railML is encapsulated by, or embedded in, a higher protocol level and thus is part of a data stream which is not a file. Apart from such special use cases, railML is a file format for general data exchange. In this general meaning - with no precise data source and destination - the XML declaration is not to be omitted.)
However, you can omit the attribute encoding if it is UTF-8 (but not if the file is from any other code page). If you declare encoding='UTF-8', please note that UTF-8 is to be written in upper-case.
A railML file may start with a Byte Order Mark (BOM) but since no special byte order applies to UTF-8, it is not necessary and therefore not recommended.
So, a valid railML file may start with the following byte sequences:
- <?xml - no BOM, valid XML declaration: encoding shall be interpreted or file must be encoded UTF-8 if there is no encoding given
- hEF hBB hBF - BOM for UTF-8: encoding must be set to UTF-8 or omitted
- h50 h4B h03 h04 - compressed railML file: decompress and unpack the file and go on reading BOM and/or encoding of the decompressed data
- <railML - no BOM, no XML declaration: file must be encoded UTF-8. Since this is against best practice, you may refuse such files.
The following byte sequences may be valid XML files but unsupported by railML:
- h00 h00 hFE hFF - UCS-4, big-endian (1234 order)
- hFF hFE - UCS-4, little-endian (4321 order) or UTF-16, little-endian
- h00 h00 hFF hFE - UCS-4, unusual octet order (2143)
- hFE hFF - UCS-4, unusual octet order (3412) or UTF-16, big-endian