Простое решение Salary

Загружаем данные

In [1]:
import pandas
In [3]:
data_train = pandas.read_csv('./salary_train.csv', index_col='Id')
data_test = pandas.read_csv('./salary_test.csv', index_col='Id')
In [4]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

Извлекаем признаки из категориальных переменных

In [23]:
%%time

cat_features = [
    'LocationNormalized',
    'ContractType',
    'ContractTime', 
    'Company',
    'Category',
]

cat_vectorizer = DictVectorizer()

X_cat_train = cat_vectorizer.fit_transform(data_train[cat_features].fillna('').T.to_dict().values())
X_cat_test = cat_vectorizer.transform(data_test[cat_features].fillna('').T.to_dict().values())
CPU times: user 17.2 s, sys: 685 ms, total: 17.9 s
Wall time: 17.9 s

Извлекаем признаки из текста

In [18]:
%%time

text_vectorizer = TfidfVectorizer(min_df=10)

X_text_train = text_vectorizer.fit_transform(data_train['FullDescription'])
X_text_test = text_vectorizer.transform(data_test['FullDescription'])
CPU times: user 59.4 s, sys: 3.3 s, total: 1min 2s
Wall time: 1min 3s

Обучаем линейную модель

In [24]:
from scipy import sparse
X_train = sparse.hstack([X_cat_train, X_text_train])
X_test = sparse.hstack([X_cat_test, X_text_test])

y_train = data_train['SalaryNormalized']
In [25]:
from sklearn.linear_model import Ridge
In [29]:
model = Ridge()
Оцениваем качество модели линейной регрессии
In [33]:
%%time

from sklearn.cross_validation import cross_val_score, ShuffleSplit

splits = ShuffleSplit(n=X_train.shape[0], n_iter=3)
print cross_val_score(model, X_train, y_train, scoring='mean_absolute_error')
[-7634.14157941 -7754.42392823 -7651.81712817]
CPU times: user 1min 4s, sys: 1.65 s, total: 1min 6s
Wall time: 1min 6s
Обучаем финальную модель
In [34]:
%%time
model = Ridge().fit(X_train, y_train)
CPU times: user 30.8 s, sys: 618 ms, total: 31.4 s
Wall time: 31.7 s
In [35]:
y_test_pred = model.predict(X_test)
Записываем предсказания в файл
In [36]:
import pandas
In [37]:
pandas.DataFrame(
    {'SalaryPredicted': y_test_pred},
    index=data_test.index,
).to_csv('sample_submission.csv')