• Home
  • Blog
  • Resume
  • Contact
  • Projects
  • Gallery
  • Amit’s Resume
  • About Nagpur
KEEP IN TOUCH

Free Certificate in Data Mining/Analytics

Feb01
2011
Leave a Comment Written by admin

Analytics or data science has following components:

  • data mining/machine learning/statistics
  • data visualization
  • database management
  • programming

There are some free online courses that cover many of these areas, and these courses are usually part of a degree or a certificate program in data mining. Those who are new or interested in this field can learn a whole lot without paying a dime. Here is the list:

  • Intro to Probability and Statistics (Carnegie Mellon)
  • Machine Learning 101/102
  • GovData (MIT/Harvard)
  • STATS 120: Information Visualisation (The University of Auckland)
  • R Programming (UCLA)
  • CS 229: Machine Learning (Stanford) (videos)
  • Linguistics 420: Statistical Natural Language Processing (Georgetown)
  • SI 508: Networks: Theory and Application (University of Michigan)
  • CS 591: Data Mining (West Virginia University)
  • STATS 782: Computing for Statisticians (The University of Auckland)
  • 6.867: Machine Learning (MIT)
  • Andrew Moore’s Slides on Statistical Data Mining Tutorials
  • Lots of tutorials (Data Mining Tools)
  • Capstone project:  kaggle or kdd (for a bigger list see kdnuggets)

Some free text books:

  • The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman
  • Mining of Massive Datasets by Rajaraman and Ullman

In addition, there is an excellent thread on quora on how to become a data scientist that covers lot of things and is a very good resource on the practice of analytics.

Posted in Books, Data mining - Tagged analytics, courses, data mining, data science, free, machine learning, R, text mining, visualization

Linked Table Query Returning Incorrect Results

Aug18
2010
Leave a Comment Written by admin

We had this annoying problem: a simple query in Access to a linked table (using ODBC and MS SQL server) returned incorrect results, though the same query returned correct results in MS SQL server. By incorrect I mean the field values were different, so rather than returning AZ, the query returned BZ (some other value in that field).

We thought that it was a table size issue, may be the driver could not handle huge data. However, the problem was with the “unique record identifier” while linking the table, as JM reported here. I did not have this problem any longer when I did not select any fields for the unique record identifier while linking the table.

Posted in Access - Tagged Access, error, linked tables, ODBC, SQL server

Microsoft query in Excel and Case statement

Apr15
2010
Leave a Comment Written by admin

It is frustrating that I need to pull the data from SQL server using complex queries and then format it in Excel to ”prettify” the data. Thankfully, I can use the Microsoft query in Excel to get the data from SQL server and create a table or a PivotTable. Today, I encountered a problem with the case statement.

Although my syntax was right, MS Query kept throwing exception about the CASE statement. I read some discussions on-line, and some people had the wrong syntax, such as including the alias of the case statement in the group by statement. I did not have that problem. Finally, I found the solution: It was the order by clause. I was using the alias of the case statement in the order by clause. I removed it and the query ran just fine.

Posted in Error messages - Tagged excel, MS query, SQL

Mining publication data

Mar08
2010
1 Comment Written by admin

I found treasure! Publication and citation data with metadata (author names, addresses, affiliation): http://citeseer.ist.psu.edu/oai.html

I was reading about knowledge management here, which says that knowledge management is nonsense. I agree to a certain degree, not because of the field, but because of its name. How do you manage knowledge? Isn’t knowledge derived? Wasn’t information “science” good enough? (I have problem with “business intelligence” as well…) As the author of that article says, it is a new term coined to attract attention. He does provide some evidence, but I was left unsatisfied.

I thought of performing text mining on publications database, and citeseer has this great resource. I downloaded the data (72 XML files), performed some clean-up, and ran a script to pull citeseer ID, author addresses, and publication dates where the abstract contained the term “knowledge management”. I was interested in seeing the trend of publication and places of publication.

Have a look at this chart:
Publications by year

There is a definite growth in this area, at least in research and publications. It is startling to see a paper published in 1970, and a peak in 2002. As citeseer data ends in 2004, it is possible that it doesn’t have complete publication history of 2004.

Geographic location wise, the US and Europe leads the way in number of publications:
Worldwide Publications

Posted in Uncategorized - Tagged scripting, text mining, visualization

My experiments with sparklines

Sep23
2009
4 Comments Written by admin

For the uninitiated, Sparklines are “data-intense, design-simple, word-sized graphics” according to its inventor Edward Tufte. I always wanted to include them in my trend reports. The challenge: How and Which tool to use?

The data came from this report came from a complex query using SQL server and Access that had this format:

College Major 2005 2006 2007 2008 2009
A B 10 15 14 18 25

Some options I investigated and tried (be sure to read this page, you might find many more):

  • Google’s Chart API (http://code.google.com/apis/chart/types.html): you can embed these in Google spreadsheets, create HTML pages using Excel via VBA, or embed them in Excel sheets. Alas, none would work the way I would like them to work: In-cell graphics
  • Use Sparklines for Excel add-in: this add-in will create great in-cell charts (bar, bullet graphs, sparklines, etc), but copying them down is difficult and resource intensive, and any change you make in the column size will alter the shape of that chart. In addition, I had more than 400 rows to populate–these would be too many objects in a spreadsheet for Excel to handle. My machine froze when I tried to copy it for 10 rows.
  • Use R’s implementation by Jason Dieterle (search on this page for Jason Dieterle (email), January 28, 2008. Works very good. I modified the code to print max and mins only, but the function generates a graphic, which needed to be embeded in LaTex file. It did not work nicely. Charts were too big to fit in a cell of a table.
  • Create bar charts in Excel using REPT function. I tried different font sizes and styles, but it didn’t look pretty, for the range of data varied. (Hint: make the alignment of text 90, use pipe signs, create columns and graphs for each data value, remove gridlines, keep the columns very close. It almost worked.)
  • Use spark package for LaTex, doesn’t work in pdflatex, and you have to play a lot with the settings
  • Use sparklines package for LaTex. Problem was that data needed to be normalized (or scaled) from 0 to 1, and needed extra parameters for min and max points. Solution: create a normalize function in Excel, and write a big formula to produce the exact needed string for the sparklines to work i.e.:
    \begin{sparkline}{5}
    \sparkdot 1 1 blue
    \sparkdot 0.2 0 red
    \spark 0.2 0 0.4 0.0625 0.6 0.5625 0.8 0.75 1 1 /
    \end{sparkline}

Here’s the normalize function:

Public Function Normalize(cell2Normalize As Range, WholeRng As Range)
       Normalize = (cell2Normalize.Value - WorksheetFunction.Min(WholeRng)) / (WorksheetFunction.Max(WholeRng) - WorksheetFunction.Min(WholeRng))
End Function

Here’s the big formula to convert the range for data values from 2005 to 2009 (in the range E2:I2) to the sparkline LaTex environment.

="\begin{sparkline}{5} " & "\sparkdot " & CHOOSE(MATCH(MAX(E2:I2),E2:I2,0),0.2,0.4,0.6,0.8,1) & " " & 1 & " blue " & "\sparkdot " & CHOOSE(MATCH(MIN(E2:I2),E2:I2,0),0.2,0.4,0.6,0.8,1) & " " & 0 & " red " & " \spark 0.2 " & Normalize(E2,E2:I2) & " 0.4 " & Normalize(F2,E2:I2) & " 0.6 " & Normalize(G2,E2:I2) & " 0.8 " & Normalize(H2,E2:I2) & " 1 " & Normalize(I2,E2:I2) & " / \end{sparkline}"

This last option worked beautifully. I dragged the formula down. I selected the data, and clicked on “Convert Table to LaTex” button (using this add-in). Copied the LaTex code to clipboard and pasted it in my LaTex editor. Manually merged the rows for colleges (using \multirow), and generated a beautiful looking pdf.

I was very happy. Printed it in color. Got a request back very soon that there should be total rows.

I forgot about the sparklines and created a report in Access with plain old numbers in less than 15 mins. Gave it back.

I tried to repeat this – Sparkline in Cognos -in Access with no luck.

(I later tried it one more time: Got the data in Excel using External data> Access, created a pivottable with rows and all, did some formatting, copied and pasted values and formats, inserted sparkline code, converted it to LaTex, copied and pasted in LaTex editor, and here’s the beautiful looking sample pdf of 16)

I wish there were simple reporting solutions that included awesome data visualization tools. (BTW, Excel 2010 will have sparklines: link) For this report, I did try Sweave, R, and LaTex, but because of the time constraints I could not investigate it further.

Please comment if you know any other way which meet (or don’t, Tableau is certainly one) these conditions: inexpensive (read free), efficient, and repeatable.

Posted in Uncategorized - Tagged Access, dashboard, excel, LaTex, R, report, sparklines, VBA, visualization
« Older Entries

Tags

Access Alt F8 Books boxplot cells charts count cursor dashboard data mining dbase design error excel excel functions export filter flip LaTex MS query Number Err ODBC pipes Press Alt F11 Public Sub python R random numbers Range Cells report scripting software sparklines SQL SQL server stack columns statistics stemming string tag cloud text mining UDF VBA visualization wildcard

Network

View Ashutosh Nandeshwar's profile on LinkedIn

Recent Comments

  • W. McNabb on The search key was not found in any record in Access
  • Manuel on The search key was not found in any record in Access
  • Wendy Naples on The search key was not found in any record in Access
  • larry on Access Export to Excel (2007)
  • Betty Chou on Projects

EvoLve theme by Blogatize  •  Powered by WordPress nandeshwar.info