Register
Welcome to the Stormspire.net WoW Gold Forums!

Greetings, guest! Welcome to Stormspire.net, your #1 resource for WoW and Diablo 3 Gold, Professions, and everything in between!


Enjoy what the site offers? Consider Premium Membership if you wish to get the most out of the site. Premium Membership benefits include the acclaimed Goblineer's World of Warcraft Gold Guide, private forum access, a complimentary high-quality avatar, and much more!


If this is your first visit, make sure to check out the Stormspire.net Survival Guide for a quick site orientation. You may also register to unlock all standard forum features, and to get rid of these lovely ads!


We proudly host the Official forums for TradeSkillMaster, The Undermine Journal, and WoWuction.

Results 1 to 8 of 8
  1. #1
    Apprentice
    elfchief's Avatar
    Posts
    1
    Reputation
    0
    Tagged in
    0 Posts
    Add to this user's reputation

    Willing to Share DB Schema?

    So, asking as someone who set up an auction-stats-gatherer waaaaay back in the day (think Burning Crusade timeframe... I had to parse the logs from Auctioneer, which I ran scans with once a day on my realm, by hand), any chance y'all would be willing to share the database schemas used by Undermine Journal (publicly, privately, whichever). Mostly a curiosity thing, to compare what you did to what I did way back when.

    No interest in putting up my own service or anything, just interest in the database geekery.

    Thanks!

    -j

  2. #2
    Developer - The Undermine Journal
    Erorus's Avatar
    Posts
    1,073
    Reputation
    88
    Tagged in
    568 Posts
    Add to this user's reputation
    Eh, I'm not particularly inclined to post the table create statements for even a subset of the 80+ tables involved in TUJ, but I did give it some serious thought.

    How about you show me some tables, and I can tell you how I've done things differently (and maybe how I'd do `em today if I had to create TUJ all over again).

  3. #3
    Entrepreneur
    vetchlin's Avatar
    Posts
    8
    Reputation
    0
    Tagged in
    4 Posts
    Add to this user's reputation
    Quote Originally Posted by Erorus View Post
    How about you show me some tables, and I can tell you how I've done things differently (and maybe how I'd do `em today if I had to create TUJ all over again).
    ok, I'll bite:

    CREATE TABLE activeauctions (realm text, item int, data blob, PRIMARY KEY(realm, item));

    which is actually CQL3, because I'm using cassandra 1.2 rather than mySQL for auction data. The 'realm' col is geo/server/market e.g. "eu/bloodhoof/horde" and the blob is a custom binary packing of the set of auctions for a given item. It's a) incredibly space efficient, since the realm string is stored only once, all the data is compressed by cassandra and it's essentially an index organized table so no index space overhead and b) it's access efficient (as in, O(1) disk seeks) for building item centric web pages. It's still O(1) seeks for seller centric pages too, but requires that the app grab all the current auctions for the realm using a range fetch on item id and then discard any not by the required seller. Since disk bandwidth is cheap that's not a major problem, but it's possible to denormalize and do

    CREATE TABLE activeauctions_byseller (realm text, seller text, data blob, PRIMARY KEY(realm, seller));

    also to provide for that use case. Doubles the disk space usage, but considering only the active auctions that's a trivial amount of space. The downside of this is of course that it's necessary to write back the entire blob(s) if even only a single auction in the set has changed in a given update. However, since those updates come only once an hour it's a more efficient approach than maintaining an auction event log table and recalculating the current state from it at read time.

    So, what's your view on noSQL databases? Would you stick with mySQL is you started over from scratch today?

  4. #4
    Developer - The Undermine Journal
    Erorus's Avatar
    Posts
    1,073
    Reputation
    88
    Tagged in
    568 Posts
    Add to this user's reputation
    I thought about nosql, but between my inexperience with nosql and no clear advantage for what I would do with it, I stuck with mysql.

    It's important to recognize the constraints I imposed on the project: Must search by item, must search by seller, must have at least 2 weeks of detailed history, must support all US realms (around 500 auction houses). I noticed that you aren't storing any history, just stats on the current state of the AH. That makes things considerably easier.

    If I'm updating 500 auction houses per hour, that means a realm must finish its updates in about 7 seconds on average. You get a little wiggle room by updating more than one house at a time, but not too much because otherwise you'll end up disk thrashing.

    So, all that in mind, you don't really want to use more disk space than necessary. You don't want to perform more writes than necessary, even if your disks have enough bandwidth to handle large writes (i.e. even if seek latency is more of a concern than throughput).

    Main auctions table looks like this:

    CREATE TABLE IF NOT EXISTS `tblAuctions` (
    `realmid` smallint(5) unsigned NOT NULL,
    `auctionid` int(10) unsigned NOT NULL,
    `itemid` mediumint(8) unsigned NOT NULL,
    `quantity` smallint(5) unsigned NOT NULL,
    `bidamt` int(10) unsigned NOT NULL,
    `buyamt` int(10) unsigned NOT NULL,
    `sellerid` mediumint(8) unsigned NOT NULL,
    `firstseen` datetime NOT NULL,
    `hoursafter` tinyint(3) unsigned NOT NULL,
    `timeleft` tinyint(3) unsigned NOT NULL,
    `attributes` set('Live','Sold','LargeBid','LargeBuy') COLLATE utf8_unicode_ci NOT NULL,
    PRIMARY KEY (`realmid`,`auctionid`),
    KEY `itemsort` (`itemid`,`realmid`),
    KEY `sellersort` (`sellerid`,`realmid`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

    Realmid is an internal ID for realms. Quicker to store and compare 2 bytes in a numeric ID than to use strings. The table of realms matches the realmid to region,realmname,faction.

    AuctionID is blizzard's internal auction ID that they expose in the API. AuctionID is unique only to a realm, so that's why auctionid+realmid is the PK.

    ItemID,quantity - obvious

    bidamt and buyamt go up to the 32-bit unsigned max, which is the infamous 429,496g. With patch 4.0(?) auctions could be posted for more than that. Instead of chewing up 8 more bytes per record for using 64-bit numbers, instead I just divide the value by 100, store that, and set the LargeBid and/or LargeBuy attributes in the record. It makes queries a little messier, and on large auctions we lose the copper value, but it gives us a new max of 42,949,672g and keeps the table slim.

    sellerid goes to a lookup table for seller names. I could've designed that a bit nicer to keep the seller table small, but the point is I don't even store seller names as strings in the main auction table, they'd take up too much space. Here, all sellers take only 3 bytes per record.

    Firstseen is the date and time of the first AH pull that returned that auction. It used to be when I pulled it from the web interface, but with the battle.net API it's simply the timestamp of the AH file. It's 8 bytes to store this date as a datetime type, which is unfortunate.. I could've probably used timestamp (4 bytes) but mysql does weird things with time zones with the timestamp type, and I didn't want to figure it out. I'd change it but it's not worth it at this point.

    hoursafter is how many hours after firstseen did I continue to see this auction. I used to have a lastseen column, also a datetime type, which chewed up 8 bytes per record.. then when I wanted to support EU realms, I had to find places to trim the fat out of this table. I figured that, with an hourly resolution and auctions that last 48 hours max, I could instead use 1 byte to store the number of hours past firstseen, and calculate lastseen that way. Saved 7 bytes.

    timeleft is what blizzard reports as the auction's time left (0=short, 1=medium, 2=long, 3=verylong). Could've probably squeezed that into the attributes field, but attributes was added later.

    attributes is self-explanatory. I mentioned largebid and largebuy above; live and sold are marked appropriately during parses.

    The itemsort key is "itemid,realmid" instead of "realmid,itemid" so i can search for an item across all realms, if I so chose.

    Every time I do a parse, I remove all live attributes from the live auctions on the given realm. then, I do the classic mysql "insert.. on duplicate key update" for each auction. When an auction already exists, I calculate a new hoursafter value (latest timestamp - firstseen), set bidamt and timeleft, and re-set the live attribute. That "on duplicate key update" ability is so convenient.. I even explored postgresql but didn't bother further once I saw it couldn't do that.

    Parse log output for a realm:
    2013-01-06 19:24:31 - Starting scan from 1 minute ago. EU-Ragnaros-Alliance. (98kb in 0.29 seconds, 341.14kbps)
    2013-01-06 19:24:32 - Old live auctions read after 1 seconds.
    2013-01-06 19:24:32 - live attribute removal complete for 2146 auctions after 1 seconds.
    2013-01-06 19:24:33 - tblAuctions update complete after 2 seconds. (2431 new, 13857 already seen)
    2013-01-06 19:24:34 - Skipped auction info complete after 3 seconds.
    2013-01-06 19:24:35 - tblAuctionsLive update complete after 4 seconds.
    2013-01-06 19:24:35 - tblAuctionsRare (10 rows) update complete after 4 seconds.
    2013-01-06 19:24:35 - tblAuctionMarketLive update complete after 4 seconds.
    2013-01-06 19:24:35 - Sold update (1086 auctions) complete after 4 seconds.
    2013-01-06 19:24:36 - Complete scan with 37781 Auctions done in 4.46 seconds on EU-Ragnaros-Alliance
    My auctions database is hosted at home, but the server that queries battle.net is on a hosted machine in a datacenter. So, my first server packages the data in such a way as to minimize the amount I have to download at home.. it's mostly just changed data that gets sent home. That's how I updated 37781 auctions with downloading only 100,000 bytes.

    "13857 already seen" means that many auctions had data sent over but aren't technically new.. so they're "long", "medium" or "short" time left. For those auctions, I always send over the bid amounts in case they changed.
    tblAuctionsLive is a simple table with realmid and auctionid for all currently live auctions.. quicker than querying for the "live" attribute in the main auctions table.
    tblAuctionsRare is another simple table to support the "unusual auctions" page.
    tblAuctionMarketLive is a table with realmid/itemid primary key, for supporting category pages, the addon and market xml.. basically summary info for every item.

    tblAuctions is on an Intel SLC SSD, in a server with 16GB of RAM. It has about 250-300 million rows. The data takes up 21GB, and the indexes are another 16GB.

  5. #5
    Entrepreneur
    vetchlin's Avatar
    Posts
    8
    Reputation
    0
    Tagged in
    4 Posts
    Add to this user's reputation
    Quote Originally Posted by Erorus View Post
    I noticed that you aren't storing any history, just stats on the current state of the AH. That makes things considerably easier.
    Umm, no. It's more a case of my telling you about only part of the schema. The historical archive is a separate table because it has got different use cases. For building item/seller detail pages I need only the current auction details and historical stat summaries. For building history pages I need details of expired auctions too. Expired auctions are held indefinitely - disk is cheap and it's impossible to determine when I may be overcome by an urge to do some long term trend analysis.

    Quote Originally Posted by Erorus View Post
    If I'm updating 500 auction houses per hour, that means a realm must finish its updates in about 7 seconds on average.
    I'm currently running at well under one second per realm. The biggest cost is decompressing and parsing the JSON file, which unfortunately is a serial operation. However, retrieval of the previous period's data from the db can be done in parallel with that. Once both sources of data are ready the updating of the three markets can be done in parallel. It's CPU bound rather than disk bound on an SSD.

    Quote Originally Posted by Erorus View Post
    My auctions database is hosted at home, but the server that queries battle.net is on a hosted machine in a datacenter.
    Is that the reason for the excruciatingly high latency on uncached page requests? I'm assuming most of that time is multiple sequential queries to the db server.

    Quote Originally Posted by Erorus View Post
    tblAuctions is on an Intel SLC SSD, in a server with 16GB of RAM. It has about 250-300 million rows. The data takes up 21GB, and the indexes are another 16GB.
    What's the query load and response time like? Should be pretty good with that kind of hardware.

  6. #6
    Developer - The Undermine Journal
    Erorus's Avatar
    Posts
    1,073
    Reputation
    88
    Tagged in
    568 Posts
    Add to this user's reputation
    Quote Originally Posted by vetchlin View Post
    I'm currently running at well under one second per realm. The biggest cost is decompressing and parsing the JSON file, which unfortunately is a serial operation. However, retrieval of the previous period's data from the db can be done in parallel with that. Once both sources of data are ready the updating of the three markets can be done in parallel. It's CPU bound rather than disk bound on an SSD.
    I run up to 3 concurrent parsers when inserting into the auction DB, so when they're all chugging I can get it down to maybe 2-3 seconds per alliance/horde house, but normally it doesn't need to run at full tilt. Your "well under a second" is pretty good when maintaining history and calculating summary stats. Does that time include writing to your historical table or just updating live and summary data? How many realms? (I assume all US?)

    Is that the reason for the excruciatingly high latency on uncached page requests? I'm assuming most of that time is multiple sequential queries to the db server.
    Can you better describe the "excruciatingly high latency"? Could be a few things going on here.

    First, yes, uncached item pages do need to hit the auction DB which is about 70ms by ping.

    Next, the script won't return the progress bar on uncached pages unless 1.5 seconds have already passed. This is because some pages would load in less than that time, and I thought it would be pointless to flash a full progress bar then force the client to reload to get the data. So that's why you might observe a "lag" for a second or so before the bar shows up, but no worries, it wasn't time wasted.

    Finally, what takes the most time to load is traversing the tblAuctions I described earlier. I loaded Ghost Iron Ore on a random realm, and to display that page (which includes summary data, global data, checks to shun web bots, component items, preload some tooltips, etc etc) required 33 queries and took 2.46 seconds.

    What's the query load and response time like? Should be pretty good with that kind of hardware.
    The whole database isn't on the SSD, I have most of the tables that aren't updated constantly still on spinning rust. I have no complaints for how much it updates.

  7. #7
    Entrepreneur
    vetchlin's Avatar
    Posts
    8
    Reputation
    0
    Tagged in
    4 Posts
    Add to this user's reputation
    The summary stats calc is all in memory, so that's cheap. Its the additional bytes of output to the db that hurt most, as I essentially denormalize the data for item and seller indexing in both the current and history tables. I'm still tinkering with the setup, but I think I can probably keep it under one second even with the history indexed both ways. Of course in a real cassandra installation it would simply be load balanced over multiple servers and disks, but it's kinda fun seeing how far you can push a single node installation. I'll probably do us+eu servers to start and see how it goes. For the purposes of load testing you can actually just take one realm's file and duplicate it to simulate as many additional realms as desired.

    high latency is 4-5 seconds to get the final html down (on a 120ms trans-atlantic server ping), plus page render time. That's at least 3 seconds more than I'd like, but then again I'm used to corporate datacentres that have rather less than 70ms db connection latency :-) 33 queries in series on that kind of link is painful, especially given the parallelism a SSD backed db is capable of. One of the things I like about the new cassandra release is the async wire protocol and driver. I can send out all the queries for a page in parallel and then render page fragments as I get data for each back, which keeps the latency down nicely. I'm also tinkering with writing the complete html page back to cassandra, essentially using it as a cache. The problem is that a cache miss adds further latency. I think I'll need to keep a bitmask cache index in RAM. Something more to tinker with at the weekend... after getting paid to do much the same sort of nerdy things the other five days of the week :-)

  8. #8
    Developer - The Undermine Journal
    Erorus's Avatar
    Posts
    1,073
    Reputation
    88
    Tagged in
    568 Posts
    Add to this user's reputation
    Quote Originally Posted by vetchlin View Post
    Of course in a real cassandra installation it would simply be load balanced over multiple servers and disks
    I do miss that advantage of modern nosql servers, sorta. Adding capacity is supposed to be a bit easier. Back before I had the DB server hosted here, I had multiple VPSes hosting the auction tables. Sharding is pretty obvious when you have 500+ AHs all doing the same thing but not interacting. I still ran into performance issues from sharing the hosted disk with other clients, so that's when I consolidated the realms on a dedicated box at home.

    high latency is 4-5 seconds to get the final html down (on a 120ms trans-atlantic server ping), plus page render time. That's at least 3 seconds more than I'd like, but then again I'm used to corporate datacentres that have rather less than 70ms db connection latency :-) 33 queries in series on that kind of link is painful, especially given the parallelism a SSD backed db is capable of.
    Yeah that sounds about right, and I agree, I'm not thrilled with it. It's the price I pay for hourly resolution. I never thought too much about the latency, I figured once the query is sent over, results would be sent in bulk with minimal overhead anyway, but your comment made me do another test. I ran a query in my test environment (the box literally sitting next to the auction DB) pointed at live auction data, and compared the speed against what I get in production. The public box is hosting other clients and doing other things that my test box isn't, and it's not dedicated hardware so it'll probably be a bit slower, but even with those caveats there's a difference in page generation time. In test (pointed at prod auction DB) I render a different medium-pop realm's Ghost Iron Ore page in 4.7 seconds, but production takes about 6.3 seconds. This is with the records fresh in the database's cache so the times are pretty consistent. Interesting.

    One of the things I like about the new cassandra release is the async wire protocol and driver. I can send out all the queries for a page in parallel and then render page fragments as I get data for each back, which keeps the latency down nicely.
    That's pretty cool. I'm using some older methods with php and mysql, and should learn this newer stuff to make things even more parallel.

    I'm also tinkering with writing the complete html page back to cassandra, essentially using it as a cache. The problem is that a cache miss adds further latency.
    I store the cached page data as compressed XML in the web server's mysql database, then upon cache hit, run that XML through XSLT and give it to the user, so it never hits the auction DB. If things get ugly I think I'll look at memcached and see if I can do proper page caching.

    Something more to tinker with at the weekend... after getting paid to do much the same sort of nerdy things the other five days of the week :-)
    Yep Nice to have a job that's much like one's hobby...

 

 

Similar Threads

  1. Share your never-to-be-repeated AH mistakes
    By Sinshroud in forum Auction House Discussion
    Replies: 458
    Last Post: August 17th, 2014, 08:41 AM
  2. Share TSM auction groups?
    By sainter in forum General TradeSkillMaster Discussion
    Replies: 0
    Last Post: November 11th, 2012, 08:28 AM
  3. Share the wealth?
    By Schwarz in forum General Discussion
    Replies: 78
    Last Post: October 23rd, 2012, 07:12 AM
  4. [General] Theory on % share per profession
    By Bloomchicken in forum General Discussion
    Replies: 6
    Last Post: July 15th, 2012, 12:42 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •