Walmart Store Sales Forecasting

Quite lately, I have been working on a machine learning contest on Kaggle for Walmart. It’s about predicting their department-wide sales. I had some fun writing code for this competition. Although I didn’t secure a good rank, I think I learned a lot from this competition. To my surprise, people who scored 2nd and 3rd rank, didn’t really use any features to predict the sales values. Their approach was really simple and effortless when compared to mine. I tried to use Polynomial Regression and mostly spent time on coming up with new features which were mostly the terms from a multinomial expansion of all features. I am sure this experience will come in handy someday.

I will update my code soon and post a link to Github.

From sentences to vectors

Extracting sentences could be a challenging problem and more so when you have to deal with clumsy tags in an xml. The data that I am dealing with currently, is in a complicated xml format which looks something like,

 <ce:para view="all">Rhabdomyolysis is a well appreciated complication of toxins, recreational drugs, and medications, and can occur due to predisposing medical conditions such as polymyositis or dermatomyositis.
 <ce:cross-ref refid="bib1">
 <ce:sup loc="post">1</ce:sup>
 </ce:cross-ref> To date, there is no known association of rhabdomyolysis with hepatitis C virus (HCV) infection. We present a patient with chronic cocaine abuse who developed recurrent rhabdomyolysis after acute HCV infection and propose a previously unrecognized clinical association.
 </ce:para>

It seems the best way to mark end of sentences while handling punctuations and periods is to make a classifier or a small decision tree. However, I was quite satisfied with the performance of BreakIterator from java.text package in addition to some tab and new line handling code. I get pretty good separation over sentences in most of the cases except the ones where something like Mr. or etc. occurs.

String source = eElement.getTextContent().replaceAll("\n", " ").replaceAll("\t", "");
						iterator.setText(source);
						int start = iterator.first();
						int line = 1;
						for (int end = iterator.next();
								end != BreakIterator.DONE;
								start = end, end = iterator.next()) {
							Iterator<String> citr = cits.iterator();
							String sentence = source.substring(start,end);
							if(cits != null && cits.size() > 0){
								while(citr.hasNext()){
									String cit = citr.next();
									if(sentence.contains(" " + cit + " ") || sentence.contains(" [" + cit + "]")){
										System.out.println("Para : " + (temp + 1) + ", "
												+ "Line : " + line + " ==> " + sentence);
										break;
									}
								}
							}

Following my post on finding opinion statements directed towards citations in an article, I am now trying to convert these sentences into meaningful vectors with rich features. One obvious approach could be Bag of words model but that won’t really capture the composition of the sentence and key features that contribute towards forming an opinion about something. I am looking to see if I can do more than just bag of words.

Coursera has some interesting videos on NLP which I need to skim through. Other than that, Google has some neat word2vec tool which I could probably try my hands on.

Finding opinion statements in research articles

I am dealing with an issue of finding statements supported by citations in research articles. Concretely, I am interested in positive or negative sentiments directed towards citations in and around a sentence in the article. This gets a bit tricky as I don’t want to do sentiment analysis over all statements including citations as some sentences could just be making a (positive or negative) claim and using a citation to support that claim. In my case, an interesting statement would be where the author talks positively or negatively about a citation. Ultimate goal would be to get all such citation related statements across all articles and build a network showing articles and citations as nodes and positive/negative links as edges. Some valuable information about roles of such citations could be derived from the article-citation network.

To start with, I would probably need to parse all text and detect sentences that use citations to make a point. Later, I am planning to build a test set of all sentences from research articles that provide direct opinion regarding citations. This is going to be a bit of a challenge. Hopefully, there aren’t many combination of phrases which form the core of an opinion statement. A neat classifier that classifies opinion based citation statements would help provide data for sentiment analysis and the subsequent formation of article-citation network.

Generating polynomial features in Octave.

A few days back I was looking for a solution to generate polynomial features in Octave. After a bit of search on Google I got to know that there is a function called ‘multinom’ under ‘specfun’ package of Octave. I had a tough time installing ‘specfun’ on OSX Mavericks. I gave up on that approach and wrote a recursive ‘multinom’ function to generate polynomial features like the terms (monomials) of the multinomial expansion of any degree n.

Following is the code :

function [polyX indices] = multinom(X,degree)
    
    [m,n] = size(X);
   
	if(degree == 1)
		
		polyX = X;
	    indices = zeros(n,1);

	else
		polyX = [];
		[remaining_PolyX prev_indices] = multinom(X,degree-1);

		if(prev_indices(1) == 0)
			offset = 0;
			for i = 1:numel(prev_indices)
				prev_indices(i) = n-offset;
				offset = offset + 1;
			end
		else
			for i = 1:numel(prev_indices)
				total = 0;
				for j = i:numel(prev_indices)
					total = total + prev_indices(j);
				end
				prev_indices(i) = total;
			end
		endif

		for i = 1:n
			feature = X(:,i);
			for j = 0:(prev_indices(i)-1)
				polyX = [polyX X(:,i).*remaining_PolyX(:,size(remaining_PolyX,2)-j)];
			end
		end

		indices = prev_indices;

	endif

end

"Start where you are, with whatever you have. Make something out of it and never be satisfied." – George Washington Carver