Here’s what I did to get a cool looking tag cloud of data mining jobs:
- Used Yahoo Pipes (I created mine, but this one has more feeds)– this pipe aggregates feeds from different job web-sites, and gives the user unique job listing that you can subscribe via RSS: Job Feed Aggregator by Sean Dolan
- Subscribed to the RSS feed for the keyword “data mining”
- Copied the job descriptions and requirements of many jobs, and saved the text file
- Got the python stemmer
- Applied the python stemmer to the text file. Stemmer truncates words to their roots, so that we can combine variants of a word into a single word. (First or second step in text mining)
- Created a tag cloud using the services of http://www.wordle.net/ . They use “stop words,” so I didn’t have to apply those. Stop words are common words, which necessarily don’t add any value for categorization, of a language.

Data Mining Jobs Tag Cloud
The most frequent word is: experience. Companies want people with experience in different data mining techniques. You’ll see that some other big words are: SAS (stemmed as sa), Excel, SQL, analytical skills, statistics, and quantitative skills.
And how do you master these skills, you ask?
- Get a graduate degree in statistics, economics, mathematics, computer science, financial engineering, or industrial engineering with emphasis on databases, data mining, and marketing.
- Successfully complete data mining projects using free, open-source data mining tools, such as Weka, R, Orange, Rapid-Miner.
- Participate in data mining competitions. SAS’s data mining conference has a data mining competition every year.
Have a look at a detailed study by Pejic Bach, M: Creating profile of data mining specialist
Don’t you hate it when you are working with many fields and you want to filter a particular field, so you apply auto filter on the all fields, but you find out that the field you were working on is gone from your sight and you see A1 cell. I did. Not anymore.
Solution: Simply hit the left or the right arrow key, and it will take you to the left or the right cell of the field you were working on.
My gripe about graduate school is that the school focused on well-established software and never embraced nor encouraged open-source software. If they had taught, or at least introduced, these following open-source software, it would have helped us immensely to produce the best looking reports with great data analysis. We, however, had to struggle with SAS/Excel to get the graphs and analysis needed, and then spend hours to perform formatting in Word. Why they didn’t teach us:
- LaTex: a powerful typeset editor, where you focus on writing and not on formatting. It takes care of all the headings, page numbering, figure/table/equation numbering, TOC, bibliography/citation, and more. Although the learning curve is rather steep, once you get the hang of it, life becomes so easy. For windows: you need to get MikTex and any LaTex editor, such as LEd, LyX, or WinEd
- R: awesome statistical package with wonderful graphics components. Producing stunning graphics and statistics has never been easy. It had me at
summary. Any software that can do produce the following, just by giving summary(iris) command has to be great:

Summary produced by R of the iris data set
In my thesis, I had plenty of equations, and every time I made some significant changes, MS Word happily would turn those equations into empty white boxes — and then I had to rewrite them. In retrospect, I find it ridiculous that I was entering citations manually. So every time I added a new reference, I would manually change the bibliography page and the page where I cited that reference. With LaTex, it is just a breeze to do all this.
Descriptive stats, Box-plots, normal curves, neural network, charts with LaTex equations, and a lot of more stuff, all could be easily done using R, and the best part is “repeatability.” With simple commands, you could export all the charts as images for various data sets or for various training algorithms. No sweat! Try that with Excel. (I did some years ago).
Ever since I read the book The Visual Display of Quantitative Information, by Edward R. Tufte, I am captivated by the idea of creating good design while doing data analysis or dashboard building. Although Excel 2007 charts are much nicer than its previous births, I have started disliking Excel charts. I am even developing an eye for picking out the bad information pixels. Apart from Tufte’s books, these books have helped me immensely:
- The Elements of Graphing Data, by William S. Cleveland
- Information Dashboard Design, by Stephen Few
Administrators/executives neither have the time nor the patience to understand complicated data mining algorithms and its results, and when they don’t understand them most probably they will never go in “production.” Simple, yet informative, designs and charts have better chances of going in production, which I am sure every data miner longs for.
I found a course web-site on Information Visualization: http://www.stat.auckland.ac.nz/~ihaka/120/lectures.html
I was trying to enable an add-in for Excel 2007, and I kept getting this error:
Access to the VB project is not trusted
You can turn this message off by going to Developer tab in the ribbon and then clicking on the Macro security button, and checking the check box for “Trust access to the VBA project object model.”
A word of caution: be wary of the add-ins/projects that need VBA access. Don’t allow any project/add-in with VBA access, unless you know its exact purpose or the author of that project.
Recent Comments