Register
View RSS Feed

WoW-GPS Dev Blog

WoW-GPS 2.0 - Links?

Rate this Entry
In a previous post, I outlined some plans I had for how to fetch/store/manage/interact with raw data: http://stormspire.net/blogs/kathroma...api-first.html

Well, in my spare time (of which there's not been admittedly much lately) I've been mulling over some implementation strategies for this back-end data.

I previously outlined a plan for storing the raw auction data only once and then having an index file in the seller or item path with a reference to the auction ID. I figured this would be light on storage while maintaining a decent level of functionality for data retrieval. Unfortunately, there are 2 very important data components that won't interact well with the index system: Removal of old data (ie. beyond 14 days, etc.) and Separation of active/current auctions from historic. I had thought about again using a centralized index, something like active.json for an index of auctionIDs for current auctions and then maybe something like cleanup-{timestamp}.json for an index of old data to be removed, which would involve only running through the list of cleanup files and comparing timestamps, rather than running through every single auction itself and checking the time. The problem however, is that although this system would make removal/maintenance of the individual auction files simpler, the process of going back and updating every single seller and item index affected by the deletions would become extremely tedious - not to mention ineffective.

After battling with different strategies to try to make those indexes work, a new though crossed my mind: symbolic links.

I must confess that my knowledge of symlinks wasn't all that deep, but I understand the concept, which was enough to convince me it was worth taking a closer look at. So, I did some digging. I started by searching for information about the potential memory hit with symlinks, and ended up looking into both symlinks (or softlinks) AND hardlinks. There's a nice overview (if you're interested) here: http://techthrob.com/2010/09/softlin...k-explanation/ but I'm just going to give an overview of the differences and potential benefits as they relate to the WoW-GPS project.

Assuming a source file of auctionID.json:

A softlink would essentially create a new file that would contain redirect information to force access through to the auctionID.json file itself. One of the major benefits here would be if I ever needed to MOVE the original files around on the disk, the links would still function. Unfortunately, that's not really a use-case I've envisioned for this project. Also, some of the research I did mentioned that the symlink files ran about 4kb in size, which is likely quite a bit more than what auctionID.json would be in the first place, and definitely more than the 20 bytes or so that the ID number would occupy in an index file, so definitely not as efficient, storage-wise. So, is the benefit of not having to crawl through each of those index files with every scan worth the extra disk space?

Instead of attempting to answer that question, my attention shifted over to hardlinks. The hardlink basically creates a second (or third, etc.) pointer to the same space on the disk as the original file. This would mean slightly faster read performance, since it can go directly to the data, rather than having to redirect to the original file, first. It also means that any changes made to ANY of the hardlinked files will automatically synchronize with the rest. I'm not sure if that will be necessary, but nice to know, either way.

To me, it really seems as though the hardlinks would be a better option in terms of storage, performance, and ease of access/maintenance, but I'm willing to hear arguments on all sides. Any dev/sysadmin types out there following along who either have experience, opinion or benchmarks for any of the proposed methods and would care to weigh in? Feedback is always appreciated.
Categories
Uncategorized

Comments

  1. Kathroman's Avatar
    Well, this was a colossal failure.

    The overhead for read/write operations seems to be insurmountable with such small amounts of data - it's too inefficient to proceed down this route, so we're back to the drawing board.

    I benchmarked 10,000 file writes compared to 1 file write with 10,000x as much data, and the results were not good: roughly 20-30x longer for many, small writes.

    I guess my next test will involve indexing benchmarks within a LARGE dataset... wish me luck
  2. Kathroman's Avatar
    @Ord - looks like sqlite IS an option with my shared host, so perhaps I'll have to give that another look. It certainly seems more viable than reading one HUGE file into memory for each realm and trying to search through indexes every time I go looking for some data...
  3. Ord's Avatar
    I haven't worked with SQLite specifically but if that is an option it should be immensely more efficient than parsing raw data files. It looks like SQLite allows concurrent read access but only one write at a time. I don't think that will be a problem for this use/implementation but you also might not know what issues arise until you start building it.
  4. Kathroman's Avatar
    Quote Originally Posted by Ord
    I haven't worked with SQLite specifically but if that is an option it should be immensely more efficient than parsing raw data files. It looks like SQLite allows concurrent read access but only one write at a time. I don't think that will be a problem for this use/implementation but you also might not know what issues arise until you start building it.
    True - I think the biggest issue will be with setup. Wanting this to be as efficient as possible, I'll have to compare the performance impact of say:

    Having 1 DB per realm, with multiple tables (ie. items, sellers, etc.) vs. Having 1 DB per TABLE, per realm (ie. realm items db, realm sellers db, etc.)

    My thinking here is that the tables themselves will become exceptionally large, and since the entire DB is in the same file, it might not make sense to pull down ALL seller records with items when I just need to grab some item data.

    What I'll probably do is combine this with my original approach and store the bulk of the data in the Auctions table and then include the auction ID in other tables as a foreign key. Either that, or I'll only store the raw data, and no calculated data whatsoever.

    At the end of the day, this might not be the worst news - at least using an sqlite solution will encourage future portability, should the DB situation change, down the road.
  5. Kathroman's Avatar
    After looking into sqlite further, it seems like it actually might end up working better than expected. Some of my initial reservations were based on the fact that it doesn't seem to scale very well, making it more suitable for smaller DBs, and with the write limitations that @Ord mentioned

    In the case of the WoW-GPS data, those issues might not be as bad as I'd originally thought.

    1) It appears I can easily create new sqlite DBs with PDO initialization, so I'll probably do 1 DB per faction, per realm, keeping the total DB size down to about 100MB or so. I expect this will drastically help with performance. Apparently, the "large" that people are referring to when they talk about sqlite performance issues is closer to 50-100GB, so we should be fine.
    2) It looks like I can make use of "ATTACH CONNECTION" in order to join 2 DBs together to cross query - so this will help with any cross-faction or cross-realm data, so having everything split up into ~1000 DBs might not be as limiting as it seems.
    3) With the raw data, and given that each realm will have it's OWN DB, I'm not sure that I'd even need more than 1 write connection at any given time. As long as I bundle all writable transactions into the same script (IE. insert new auctions, update existing ones, delete outdated ones) I shouldn't ever have any conflicting write processes.

    All in all, I think things are looking far less bleak than they were last night, so I'm calling this a successful afternoon.