Register
Welcome to the Stormspire.net WoW Gold Forums!

Greetings, guest! Welcome to Stormspire.net, your #1 resource for WoW and Diablo 3 Gold, Professions, and everything in between!


If this is your first visit, make sure to check out the Stormspire.net Survival Guide for a quick site orientation. You may also register to unlock all standard forum features, and to get rid of these lovely ads!


Consider Premium Membership if you wish to get the most out of the site.


We proudly host the Official forums for both TradeSkillMaster and The Undermine Journal.

Results 1 to 10 of 10
  1. #1
    Entrepreneur
    Amandria's Avatar
    Posts
    58
    Reputation
    0
    Tagged in
    9 Posts
    Add to this user's reputation

    Long-time history/database for auctions

    Hey,

    I'm currently pulling the hourly AH snapshots for my realm off the blizz API and import those into a database - with a schema pretty much 1:1 to the json files I get. However this does not scale, I'm adding 4MB to my database every hour and even with indexes set my queries tend to get real slow.

    So I want to move to some more sophisticated design and want to know your opinions:

    How long should I keep "raw" data (i.e. the full dumps) in the database? (I plan on keeping the gzip'ed json-files for longer just in case I want to reimport them and look something up.)
    I'm thinking a week should do (gives me maximum freedom in doing weird JOINs all over the place to get all kinds of information without any restriction).

    What aggregate data should I create and store long term? Which would you consider useful?
    Availability and min-buyout? or better avg-buyout? median of the cheapest x%? ...

  2. #2
    US-Stormrage(A)
    Ethereal Contributor
    Xsinthis's Avatar
    Posts
    397
    Reputation
    7
    Tagged in
    95 Posts
    Add to this user's reputation
    well first of all successive snapshots should have overlapping auctions, so you can just update these on the old snapshot with data from the new instead of adding new, and repetitive, rows for them.

    Also, you can try pruning some of the data from the auctions before you record it, you probably don't need all the data options offered by the json

  3. #3
    Entrepreneur
    Amandria's Avatar
    Posts
    58
    Reputation
    0
    Tagged in
    9 Posts
    Add to this user's reputation
    Yeah you're right I probably don't get anything from storing the time left and should move to first and last seen time-stamps (plus updating auctions instead of storing them for each snapshot).

    Another thing to get rid of would that realm+faction and make it a realm-id (plus a table where I can look those up again).

    Let's see how much I can save with that changes.


    Side-note: if anyone is interested in the scripts I'm using to do all this I might clean them up and put them on github or something.

  4. #4
    EU-Defias Brotherhood(A)
    Wind Trader
    Pliaksi's Avatar
    Posts
    156
    Reputation
    56
    Tagged in
    71 Posts
    Add to this user's reputation
    For speed purposes It really depends what DB are you using. For instance if I recall correct the Oracle Express edition utilises only a limited amount of resources (e.g. 1 core up to 1GB RAM etc). I would also check if tables are formatted correct (Numbers and Numbers, Varchars not exceeding the needed capacity...). As for further normalising the data (thus improving performance) I would consider moving out Seller's names / Timeleft into lookup tables and doing a PK/FK relation (though you will need PL/SQL support to automate this).

    As for keeping data - how are you doing it? Are you having separate tables for each import or a single table with timestamps? I personally would use the latter. For historical data there are 2 approaches:
    1. Keep daily unique data.
    2. Keep weekly aggregated data. For this it really depends what analysis you are running on it, but the most statistically significant would be the min, max, count, average, median and stdev. As I personally only work with buyouts, I tend to keep only them.
    Last edited by Pliaksi; August 29th, 2011 at 01:08 AM.
    It's not a bug, it's a feature !

  5. #5
    US-Emerald Dream(Both)
    Moderator
    Kathroman's Avatar
    Posts
    2,451
    Reputation
    110
    Blog Entries
    1
    Tagged in
    710 Posts
    Add to this user's reputation
    @Amandria . I've personally been storing raw data in a MySql DB and it looks like 30MB covers a large-pop realm for about a week's worth of data. As far as the JSON files - I'm saving the most recent to a local folder to be able to better parse it. After that, it get's rewritten by the next one. Since each JSON file is ~6MB itself, even if I stored every single field, I don't think it would be much more than 40MB. Compared to over 1000MB worth of JSON files in 1 week...definitely the better option.

  6. #6
    Entrepreneur
    Amandria's Avatar
    Posts
    58
    Reputation
    0
    Tagged in
    9 Posts
    Add to this user's reputation
    @Pliaksi I am using MySQL 5.1, types are correct, well maybe varchar limits for sellers are to long. Sellers are left in the raw data, timeLeft is an enum type.

    I guess I'll do aggregation on import and have 3 tables which keep updated:
    1. Current snapshot, as raw as possible, in-memory table
    2. Per-auction data (with time first seen, time last seen, seller moved to an extra table)
    3. Historical data (item availability and some statistics on the price)

    Not sure yet how much I want to aggregate historical data (one entry for each scan or for each day)

    @Kathroman I store the gzip'ed JSON files (I put accept-encoding: gzip or something in my wget command to fetch them) which are about 1MB per snapshot.
    In MySQL DB grows about 4MB per snapshot.

  7. #7
    US-Emerald Dream(Both)
    Moderator
    Kathroman's Avatar
    Posts
    2,451
    Reputation
    110
    Blog Entries
    1
    Tagged in
    710 Posts
    Add to this user's reputation
    Quote Originally Posted by Amandria View Post
    @Pliaksi I am using MySQL 5.1, types are correct, well maybe varchar limits for sellers are to long. Sellers are left in the raw data, timeLeft is an enum type.

    I guess I'll do aggregation on import and have 3 tables which keep updated:
    1. Current snapshot, as raw as possible, in-memory table
    2. Per-auction data (with time first seen, time last seen, seller moved to an extra table)
    3. Historical data (item availability and some statistics on the price)

    Not sure yet how much I want to aggregate historical data (one entry for each scan or for each day)

    @Kathroman I store the gzip'ed JSON files (I put accept-encoding: gzip or something in my wget command to fetch them) which are about 1MB per snapshot.
    In MySQL DB grows about 4MB per snapshot.
    Yeah, as mentioned above form others, I'm using an update function via the "Auc" field as a PK for my raw data and including the lastmodified value from the header to keep track of when auctions have been updated. I'm also running a separate analysis table, but the calculations get processed AS the data is being parsed/uploaded, so there isn't any need to store any significant amount of historical "raw" data.

    ...FWIW

  8. #8
    US-Emerald Dream (A)
    Wind Trader
    Stede's Avatar
    Posts
    1,049
    Reputation
    96
    Blog Entries
    2
    Tagged in
    379 Posts
    Add to this user's reputation
    Kath and I have a pretty well-fleshed out framework, but the answer is that it depends on what question you're trying to answer.

    Do you want to look at long-term or cyclical trends? Do you want to dig for correlations in prices among items? Kath and I (he's doing all the DB work) are going with the 15th percentile as Market Value, and creating a processed table for each pull of the raw API data. It includes Quantity, Market Value, itemid, realm, and time of the pull. After that, we ditch the raw API data completely and stick to a DB that includes these hourly processed tables. From there, we have an additional processing step that does a couple things. First, it goes back and cleans off any data that is more than a week old in this DB - it gets popped into an archive. Next, we find the Mean and StdDev of the MV for each item, given the remaining week of data, and we add these two in as variables on the current processed data.

    This essentially gives us a 1-week moving average of the 15th percentile in terms of price. It also means that our archive is complete and will never need additional processing. I can pull January to June and run analyses on it without any additional DB steps.
    If a post has helped you or made you think about something in a different way, please leave feedback by clicking the star in the bottom left of the post.

    Some will; Some won't - So what? Let em die poor.

  9. #9
    US-Emerald Dream(Both)
    Moderator
    Kathroman's Avatar
    Posts
    2,451
    Reputation
    110
    Blog Entries
    1
    Tagged in
    710 Posts
    Add to this user's reputation
    Now, if we could ever get someone to teach Kath how to properly program PHP, we'd really have something.

  10. #10
    US - Burning Legion
    Entrepreneur
    Abeni's Avatar
    Posts
    147
    Reputation
    14
    Tagged in
    52 Posts
    Add to this user's reputation
    So I made a related post on the TUJ forum...just noticed this old thread as well.

    I'm not so interested in long term...for example I would be fine with just the 2 weeks. What I'm looking for is the price and quantity available of Obsidium Ore (for example) as an average over every server. (Such that you can see how the overall price and quantity available has been doing for the past 2 weeks) Perhaps you could separate it by horde and alliance or include both and ideally you would throw out the highest and lowest servers.

    For example there's a lot of rumor and a new post made every time someone thinks bots got ban waved...but no way to REALLY tell from just using TUJ by itself. Also when something new and interesting happens like a patch...new items etc...it would be cool to look up an item...compare that to your realm and see if your realm is riding that wave or falling behind the averages.

    I wouldn't need every item ever posted to be included...mostly raw mats and some choice crafted items from those materials. I put this as a suggestion in the TUJ forum but I'm also wondering if this is possible to do on my own...preferably in at least a semi-automated way.
    Last edited by Abeni; October 4th, 2011 at 04:20 PM.

 

 

Similar Threads

  1. Item Database
    By Bookerz in forum Official Forum of The Undermine Journal
    Replies: 1
    Last Post: June 30th, 2011, 08:03 AM
  2. Designate an addon database for LilSparky's Workshop
    By Der in forum Addons & UI Customization
    Replies: 3
    Last Post: January 27th, 2011, 11:18 AM
  3. The shortest war in history, probably.
    By Thefluffyrocker in forum General Discussion
    Replies: 7
    Last Post: November 29th, 2010, 07:36 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •