Yi Tang Data Science and Emacs

Kaggle Avito Demand Prediction Challenge - 22th Solution

Avito Demand Prediction Challenge asks Kagglers to predict the "demand" likelihood of an advertisement. If an listed 2nd-hand Iphone 6 is selling for £20,000, then the "demand" is likely to be very low. This is the my first competition to build model using tabular data, text, and also images.

I teamed up with Rashmi, Abhimanyu, Yiang, Samrat and we finished at 22 among 1917 teams. So far, I have four silver medals and my rank is 542 among 83,588 Kaggler.

This is an interesting competition for me. I was about to quit this competition and Kaggle because of other commitments in life/work. Just one day before team merge deadline, Rashmi asked me to join, at that time, my position is 880-th, about 50%, and Rashmi's team is about 82-th. So I decided to join and finish this competition which I already spent about many hours.

Final Ensemble Models

As part of this team, I worked on final ensemble models. Immediately after join, i completed 5 tasks:

  1. make sure everyone uses the same agreed cross validation schema. This is essential for building ensemble model.
  2. provide model_zoo.md document to keep track of all level 1 models, their train/valid/lb scores, feature used, and file path to their oof/test prediction.
  3. write merge_oof.py to combine all oof/test predictions together.
  4. write R scripts for glmnet ensemble.
  5. write python scripts for LightGBM ensemble.

Once new model is built, other team member update the model_zoo.md and upload the data to a private github repo. Then I update the merge_oof.py to include new models' result, and run glmnet and LightGBM ensemble. We had this ensemble workflow automated so it takes little effort to see the ensemble model's performance.

I spent some times analysing the coefficients/weights of L1 model and tried to exclude models with negative and lower weights. To my surprise it doesn't help at all. The final submission is a glmnet ensemble with 41 models (lgb + xgb + NN).

Also, LightGBM ensemble has much better cv score but the LB score is worse. I suspect it is because there are leakage in L1 models and glmnet is more robust to leakage since it's linear model. Unfortunately, there's no enough time to identify which models have leakage.

Collaboration

This is my 2nd time work in a team, although there's a lot space for improvement collaborating when compared with a professional data scientist team but as night/weekend project, we have done a really good job as a team.

The setup for collaboration:

  1. Slack for discussion. we have channel for general, final_ensemble, random for cat photos etc.
  2. we also used Slack for sharing features which i personal don't like.
  3. Private github repo for sharing code and oof/test predictions.
  4. Monday.com for managing tasks. it gives a nice overview of what everyone's up to.

we tried very hard to get a gold, but other teams work even more harder. At one point we were at 17, and finished at 22.

Some Kagglers to Avoid

Finally, when we waited 1 hour for the final deadline, we had a lovely discussion about our past disqualification experience. We were all shocked when we were at different team in Toxic competition but team up with the same person. We shared their person's multiple Kaggle accounts and added to our personal block-list.

If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!
comments powered by Disqus