But GFE needs to display informations about currently selected game (from a collection of 10000 items).
Such data exists and is usually available through (big) xml files... They're generated by Mame.exe, or available from the Net (google 'dat file emulator').
Hence the idea to parse like 40 Megs of xml data on the fly (at each run), and index/map the document to avoid keeping the whole file in memory.
After two weeks studying the topic, it's time for conclusions :
- Microsoft Xml Pull Parser is fast. Written in .Net, but *correctly* written, it's generally a good tool for your xml needs. It's a pull system, easier to manipulate than SAX, and probably faster (look at XmlTextReader for reference). It takes 750 ms to walk 40 megs of data (8800 records) on my computer.
- Unfortunately you can't bookmark the interesting parts of your document with it. We could rely on the line/character pair this parser is returning but then another text parser should extract text blocks from line/characters pairs. Maybe tricky to implement and we'll lose some horsepower in the process.
- Note that parsing xml files is easy as long as we don't want document validation or xpath support: state machine only has to handle 10-12 token types, c# is silently handling various character format.
Some results after a dozen hour of coding, a custom pull parsing implementation:
- It's now as fast as MS XmlTextReader: first version wasn't :) too much function calls is hitting very hard c# performance (8 times slower than the original !), solution is to pack the whole state machine into a single function: now it's easily indexing 50 Megs of xml per second.
- It's able to give file stream position information about each element start and end: during some initialization pass, an indexer stores each record key and location values in a dictionnary, for further fast access.
- When GFE needs a particular record info, the dictionnary is requested for the location of the fragment which is read from huge file, and loaded into memory to be thoroughly parsed. Apparently it's fast enough for 100 random requests per second. It's 100 times enough :)
- there is no 4.
But definitely it looks like an achievable way to go if you feel somewhat embarassed with existing implementation.
Below is the code source of the main state machine function: contact me for more implementation details.
public TokenType Next(TokenType notifs) { _insideElement: if (_currentType < TokenType.END_ELEMENT) { for (; ; ) { //attribute switch (_ioBuffer[_bufferIndex]) { case ' ': case '\t': case '\r': case '\n': _bufferIndex++; continue; case '>': //end of start element _bufferIndex++; _depth++; goto _beyondElement; case '/': //empty element _bufferIndex++; if (_bufferIndex + 1 > _dataLen) PrefetchIO(1); if (_ioBuffer[_bufferIndex] == '>') { _bufferIndex++; _tokenEndOffset = _bytesParsed + _bufferIndex; _currentType = TokenType.END_ELEMENT; //EMPTY ELEMENT ! it starts and ends on the same token if ((notifs & TokenType.END_ELEMENT) != 0) return _currentType; goto _beyondElement; } return DoExpected(">"); case '\0': if (!FillIOBuffer()) return TokenType.END_OF_STREAM; continue; default: _currentType = ProcessAttribute((notifs & TokenType.ATTRIBUTE) != 0); if ((notifs & TokenType.ATTRIBUTE) != 0) return _currentType; continue; } } } // //beyond element: data, or comment, or pi, cdata _beyondElement: for (; ; ) { _start: switch (_ioBuffer[_bufferIndex]) { case ' ': case '\t': case '\r': case '\n': ++_bufferIndex; continue; case '\0': if (!FillIOBuffer()) return (_currentType = DoEndOfStream()); continue; case '<': _tokenStartOffset = _bytesParsed + _bufferIndex; ++_bufferIndex; for (; ; ) { switch (_ioBuffer[_bufferIndex]) { case '!': //comment, cdata, doctype _bufferIndex++; if (_bufferIndex + 7 > _dataLen) PrefetchIO(7); if ((_ioBuffer[_bufferIndex] == '-') && (_ioBuffer[_bufferIndex + 1] == '-')) { _bufferIndex += 2; ProcessComment(); goto _start; } if ((_ioBuffer[_bufferIndex] == '[') && (_ioBuffer[_bufferIndex + 1] == 'C') && (_ioBuffer[_bufferIndex + 2] == 'D') && (_ioBuffer[_bufferIndex + 3] == 'A') && (_ioBuffer[_bufferIndex + 4] == 'T') && (_ioBuffer[_bufferIndex + 5] == 'A') && (_ioBuffer[_bufferIndex + 6] == '[')) { _bufferIndex += 7; ProcessCData(); goto _start; } if ((_ioBuffer[_bufferIndex] == 'D') && (_ioBuffer[_bufferIndex + 1] == 'O') && (_ioBuffer[_bufferIndex + 2] == 'C') && (_ioBuffer[_bufferIndex + 3] == 'T') && (_ioBuffer[_bufferIndex + 4] == 'Y') && (_ioBuffer[_bufferIndex + 5] == 'P') && (_ioBuffer[_bufferIndex + 6] == 'E')) { _bufferIndex += 7; ProcessDocType(); goto _start; } return DoUnexpected("-"); case '?': //pi _bufferIndex++; ProcessPI(); goto _start; case '\0': //--- fill data if (!FillIOBuffer()) return (_currentType = TokenType.END_OF_STREAM); continue; case '/': //end element _bufferIndex++; _currentType = ProcessEndElement((notifs & TokenType.END_ELEMENT) != 0); if ((notifs & TokenType.END_ELEMENT) != 0) return _currentType; goto _start; default: //new element //(notifs & TokenType.START_ELEMENT) != 0); _currentType = ProcessStartElement(true);//always in full as element name may is extremely usefull to get if ((notifs & TokenType.START_ELEMENT) != 0) return _currentType; goto _insideElement; } } default: //element data. (probably.) _currentType = ProcessData((notifs & TokenType.DATA) != 0); if ((notifs & TokenType.DATA) != 0) return _currentType; goto _start; } } }
No comments:
Post a Comment