Wednesday, August 01, 2007
Using MS Query To Trim Down Large Product Feeds
Data feeds are a popular way of building a website around a merchant's product set. These are typically available as a CSV or XML format download. There's also an abundance of website scripts out there which are ready go with a CSV that you have downloaded from a network or merchant.
Two related issues have arisen for me before. Sometimes these product feeds can be huge files. This causes a problem because you run the risk of quite a performance hit on your web-server if you're expecting to frequently open up and read from a very large text file. The logical step is to open it up and trim it down to just the products related to your site (if appropriate to do that), except this raises another issue because most people's default application for working with CSVs is Microsoft Excel and that has historically had a row limit of about 65,000 rows which means you can't fully open a data feed file with more products than that.
As I mentioned a while back, this has actually been resolved in the latest version of Excel which has some huge new row limit, but if you're like me and still on a previous version, then you might have wondered before how you can best work with these files. The way I do it is to use MS Query.
MS Query is a nice little add on available within Microsoft Office which lets you connect to a CSV, other database file formats, and extract data but crucially you can apply criteria so you only pull out the info you need.
A practical example is that I use HMV's videogame product information on my Toys and Videogames site. There's a whole load of other stuff in that product feed that I don't want - books, DVDs, CDs etc - so I use this method to pull out just the videogame data. Here's how to do it:
1. Open up Excel (I'm using Excel 2003)
2. Click Data -> Import External Data -> New Database Query
3. "New Data Source" should be selected already, so just click OK
4. A box tiled "Create New Data Source" appears. In field 1, Give your Data source some kind of friendly name e.g. Merchant X Product Feed. Then in the drop down box, look for "Microsoft Text Driver (*.txt, *.csv)". Click the Connect button below.
5. A new box should appear titled ODBC Text Setup - untick the check box for Use Current Directory and use the Select Directory button to browse to the directory where your CSV is, and click OK.
6. You will be returned to the "Create New Data Source" window and now the 4th field - a drop down box - is enabled and you select your CSV from a list of available datasources in that directory. Do that and click OK.
7. You have now created your Datasource and you will be back at the first screen. So rather than choose to create a new Datasource (which you did first time round), now select the one you created and click ok.
8. The Query Wizard appears now allows you to select the columns that you want to include in your ouput. Since the aim of the game is to make a nice compact database, try and only include the columns that your website requires - sometimes there's duplication or columns you don't need.
9. Click Next. Now you get to filter out the data that you don't want. Here it comes down to knowing what fields contain what data. But as an example, if you were making a site about Sony products using a feed from a electrical merchant, you might do something like "only include rows where "productname" contains "sony". Click Next, Next, Finish and MS Query will plonk your data right in the workbook for you. Of course, if even after your filtering, you still have more than 65,000 rows, you're back to square one :-)
You can save your query so that this process only takes a few clicks next time, or you assign it to a macro.
So just to recap, the point of doing this is to make it easier to work with product feeds which are too big and help you to reduce the size so that you can increase website performance and make them possible to open in Excel. All that said, I would look at using a dedicated database server rather than a CSV if the product set was anything approaching large.
Two related issues have arisen for me before. Sometimes these product feeds can be huge files. This causes a problem because you run the risk of quite a performance hit on your web-server if you're expecting to frequently open up and read from a very large text file. The logical step is to open it up and trim it down to just the products related to your site (if appropriate to do that), except this raises another issue because most people's default application for working with CSVs is Microsoft Excel and that has historically had a row limit of about 65,000 rows which means you can't fully open a data feed file with more products than that.
As I mentioned a while back, this has actually been resolved in the latest version of Excel which has some huge new row limit, but if you're like me and still on a previous version, then you might have wondered before how you can best work with these files. The way I do it is to use MS Query.
MS Query is a nice little add on available within Microsoft Office which lets you connect to a CSV, other database file formats, and extract data but crucially you can apply criteria so you only pull out the info you need.
A practical example is that I use HMV's videogame product information on my Toys and Videogames site. There's a whole load of other stuff in that product feed that I don't want - books, DVDs, CDs etc - so I use this method to pull out just the videogame data. Here's how to do it:
1. Open up Excel (I'm using Excel 2003)
2. Click Data -> Import External Data -> New Database Query
3. "New Data Source" should be selected already, so just click OK
4. A box tiled "Create New Data Source" appears. In field 1, Give your Data source some kind of friendly name e.g. Merchant X Product Feed. Then in the drop down box, look for "Microsoft Text Driver (*.txt, *.csv)". Click the Connect button below.
5. A new box should appear titled ODBC Text Setup - untick the check box for Use Current Directory and use the Select Directory button to browse to the directory where your CSV is, and click OK.
6. You will be returned to the "Create New Data Source" window and now the 4th field - a drop down box - is enabled and you select your CSV from a list of available datasources in that directory. Do that and click OK.
7. You have now created your Datasource and you will be back at the first screen. So rather than choose to create a new Datasource (which you did first time round), now select the one you created and click ok.
8. The Query Wizard appears now allows you to select the columns that you want to include in your ouput. Since the aim of the game is to make a nice compact database, try and only include the columns that your website requires - sometimes there's duplication or columns you don't need.
9. Click Next. Now you get to filter out the data that you don't want. Here it comes down to knowing what fields contain what data. But as an example, if you were making a site about Sony products using a feed from a electrical merchant, you might do something like "only include rows where "productname" contains "sony". Click Next, Next, Finish and MS Query will plonk your data right in the workbook for you. Of course, if even after your filtering, you still have more than 65,000 rows, you're back to square one :-)
You can save your query so that this process only takes a few clicks next time, or you assign it to a macro.
So just to recap, the point of doing this is to make it easier to work with product feeds which are too big and help you to reduce the size so that you can increase website performance and make them possible to open in Excel. All that said, I would look at using a dedicated database server rather than a CSV if the product set was anything approaching large.
Comments:
<< Home
I don't have words to describe how useful this is going to be on mybathroomfinder.com - thank you thank you thank you!
Post a Comment
<< Home
Subscribe to Posts [Atom]










