A Roller Coaster Ride: XML Parsing and Indexing

One of the idea behind GFE ([dje-fe], see previous post) is to be the less intrusive and the most reactive application. A front-end shouldnt need long processes stopping the user about to play, nor any private files written on disk (especially if you want it running from a DVD).
But GFE needs to display informations about currently selected game (from a collection of 10000 items).
Such data exists and is usually available through (big) xml files... They're generated by Mame.exe, or available from the Net (google 'dat file emulator').
Hence the idea to parse like 40 Megs of xml data on the fly (at each run), and index/map the document to avoid keeping the whole file in memory.
After two weeks studying the topic, it's time for conclusions :

Microsoft Xml Pull Parser is fast. Written in .Net, but *correctly* written, it's generally a good tool for your xml needs. It's a pull system, easier to manipulate than SAX, and probably faster (look at XmlTextReader for reference). It takes 750 ms to walk 40 megs of data (8800 records) on my computer.
Unfortunately you can't bookmark the interesting parts of your document with it. We could rely on the line/character pair this parser is returning but then another text parser should extract text blocks from line/characters pairs. Maybe tricky to implement and we'll lose some horsepower in the process.
Note that parsing xml files is easy as long as we don't want document validation or xpath support: state machine only has to handle 10-12 token types, c# is silently handling various character format.

So what ? Why not writing a dedicated indexing-xml-pull-parser ?
Some results after a dozen hour of coding, a custom pull parsing implementation:

It's now as fast as MS XmlTextReader: first version wasn't :) too much function calls is hitting very hard c# performance (8 times slower than the original !), solution is to pack the whole state machine into a single function: now it's easily indexing 50 Megs of xml per second.
It's able to give file stream position information about each element start and end: during some initialization pass, an indexer stores each record key and location values in a dictionnary, for further fast access.
When GFE needs a particular record info, the dictionnary is requested for the location of the fragment which is read from huge file, and loaded into memory to be thoroughly parsed. Apparently it's fast enough for 100 random requests per second. It's 100 times enough :)
there is no 4.

This 'works' and fits a particular situation for a particular embedded application: no memory stress, no alien file creation. Currently it's only limited by a minimal support of character types (it's only handling ANSI+unicode).
But definitely it looks like an achievable way to go if you feel somewhat embarassed with existing implementation.
Below is the code source of the main state machine function: contact me for more implementation details.

public TokenType Next(TokenType notifs)
{
_insideElement:
    if (_currentType < TokenType.END_ELEMENT)
    {
        for (; ; )
        {
            //attribute
            switch (_ioBuffer[_bufferIndex])
            {
                case ' ':
                case '\t':
                case '\r':
                case '\n':
                    _bufferIndex++;
                    continue;
                case '>': //end of start element
                    _bufferIndex++;
                    _depth++;
                    goto _beyondElement;
                case '/': //empty element
                    _bufferIndex++;
                    if (_bufferIndex + 1 > _dataLen)
                        PrefetchIO(1);
                    if (_ioBuffer[_bufferIndex] == '>')
                    {
                        _bufferIndex++;
                        _tokenEndOffset = _bytesParsed + _bufferIndex;
                        _currentType = TokenType.END_ELEMENT; //EMPTY ELEMENT ! it starts and ends on the same token
                        if ((notifs & TokenType.END_ELEMENT) != 0)
                            return _currentType;
                        goto _beyondElement;
                    }
                    return DoExpected(">");
                case '\0':
                    if (!FillIOBuffer())
                        return TokenType.END_OF_STREAM;
                    continue;
                default:
                    _currentType = ProcessAttribute((notifs & TokenType.ATTRIBUTE) != 0);
                    if ((notifs & TokenType.ATTRIBUTE) != 0)
                        return _currentType;
                    continue;

            }
        }
    }
//
//beyond element: data, or comment, or pi, cdata
_beyondElement:
    for (; ; )
    {
    _start:
        switch (_ioBuffer[_bufferIndex])
        {
            case ' ':
            case '\t':
            case '\r':
            case '\n':
                ++_bufferIndex;
                continue;
            case '\0':
                if (!FillIOBuffer())
                    return (_currentType = DoEndOfStream());
                continue;
            case '<':
                _tokenStartOffset = _bytesParsed + _bufferIndex;
                ++_bufferIndex;
                for (; ; )
                {
                    switch (_ioBuffer[_bufferIndex])
                    {
                        case '!': //comment, cdata, doctype
                            _bufferIndex++;
                            if (_bufferIndex + 7 > _dataLen)
                                PrefetchIO(7);
                            if ((_ioBuffer[_bufferIndex] == '-') && (_ioBuffer[_bufferIndex + 1] == '-'))
                            {
                                _bufferIndex += 2;
                                ProcessComment();
                                goto _start;
                            }
                            if ((_ioBuffer[_bufferIndex] == '[') && (_ioBuffer[_bufferIndex + 1] == 'C') && (_ioBuffer[_bufferIndex + 2] == 'D') &&
                                (_ioBuffer[_bufferIndex + 3] == 'A') && (_ioBuffer[_bufferIndex + 4] == 'T') && (_ioBuffer[_bufferIndex + 5] == 'A') &&
                                (_ioBuffer[_bufferIndex + 6] == '['))
                            {
                                _bufferIndex += 7;
                                ProcessCData();
                                goto _start;
                            }
                            if ((_ioBuffer[_bufferIndex] == 'D') && (_ioBuffer[_bufferIndex + 1] == 'O') && (_ioBuffer[_bufferIndex + 2] == 'C') &&
                                (_ioBuffer[_bufferIndex + 3] == 'T') && (_ioBuffer[_bufferIndex + 4] == 'Y') && (_ioBuffer[_bufferIndex + 5] == 'P') &&
                                (_ioBuffer[_bufferIndex + 6] == 'E'))
                            {
                                _bufferIndex += 7;
                                ProcessDocType();
                                goto _start;
                            }
                            return DoUnexpected("-");
                        case '?': //pi
                            _bufferIndex++;
                            ProcessPI();
                            goto _start;
                        case '\0':
                            //--- fill data
                            if (!FillIOBuffer())
                                return (_currentType = TokenType.END_OF_STREAM);
                            continue;
                        case '/': //end element 
                            _bufferIndex++;
                            _currentType = ProcessEndElement((notifs & TokenType.END_ELEMENT) != 0);
                            if ((notifs & TokenType.END_ELEMENT) != 0)
                                return _currentType;
                            goto _start;
                        default:  //new element
                            //(notifs & TokenType.START_ELEMENT) != 0);
                            _currentType = ProcessStartElement(true);//always in full as element name may is extremely usefull to get
                            if ((notifs & TokenType.START_ELEMENT) != 0)
                                return _currentType;
                            goto _insideElement;
                    }

                }
            default: //element data. (probably.)
                _currentType = ProcessData((notifs & TokenType.DATA) != 0);
                if ((notifs & TokenType.DATA) != 0)
                    return _currentType;
                goto _start;
        }
    }
}

A Roller Coaster Ride

31 May 2010

XML Parsing and Indexing

No comments:

Blog Archive

Labels