此前写过这篇:edx中数据可视化相关

edx的数据/行为分析和可视化这一块我觊觎很久了。一直以来进展不大,这一块的依赖关系和数据流我此前一直没大理清。

这段时间又重新认真地看了一遍这一块的架构/数据流和依赖关系,茅塞顿开,一经试验,居然成功了。其实文档中的这幅图说的很清晰了

Analytics_AWS_Deployment

目前的做法,斯坦福还是edx官方,似乎都将数据分析的工作跑在amazon上

国内的情况你懂的,依赖于国外的服务会很蛋疼,所以我想让数据分析的模块跑在docker上,像插件一样,随时可插拔。在理清了依赖关系后,发现数据分析的模块,与外部只有input和output的依赖(也本该如此^_^),前者是track log,后者存入mysql。那么问题就简单多啦,只要在docker里配置好环境和依赖。一试之下居然成功了。先上效果

hadoop1.jpeg

hadoop2.jpeg

目前可用的task有

1
2
:::text
{AnswerDistributionOneFilePerCourseTask,AnswerDistributionPerCourse,AnswerDistributionToMySQLTaskWorkflow,AnswerDistributionWorkflow,BaseAnswerDistributionTask,BaseHadoopJobTask,CalendarTableTask,CalendarTask,CompositionTask,CourseActivityDailyTask,CourseActivityMonthlyTask,CourseActivityTask,CourseActivityWeeklyTask,CourseEnrollmentChangesPerDay,CourseEnrollmentEventsPerDay,CourseEnrollmentTableTask,CourseEnrollmentTask,CourseEnrollmentValidationPerDateTask,CourseEnrollmentValidationTask,CreateAllEnrollmentValidationEventsTask,CreateEnrollmentValidationEventsForTodayTask,CreateEnrollmentValidationEventsTask,DailyRegistrationsEnrollmentsAndCourses,EnrollmentByBirthYearTask,EnrollmentByEducationLevelTask,EnrollmentByGenderTask,EnrollmentByModeTask,EnrollmentDailyTask,EnrollmentTask,EnrollmentValidationWorkflow,EnrollmentsByWeek,EnrollmentsandRegistrationsWorkflow,EnvironmentParamsContainer,EventExportTask,EventLogSelectionTask,GradeDistFromSqoopToMySQLWorkflow,GradeDistFromSqoopToTSVWorkflow,HistogramFromSqoopToMySQLWorkflowBase,HistogramFromStudentModuleSqoopWorkflowBase,HiveQueryTask,HiveQueryToMysqlTask,HiveTableFromQueryTask,HiveTableTask,ImportAllDatabaseTablesTask,ImportAuthUserProfileTask,ImportAuthUserTask,ImportEnrollmentsIntoMysql,ImportIntoHiveTableTask,ImportLastCountryOfUserToHiveTask,ImportMysqlToHiveTableTask,ImportStudentCourseEnrollmentTask,InsertToMysqlAnswerDistributionTableBase,InsertToMysqlCourseEnrollByCountryTask,InsertToMysqlCourseEnrollByCountryTaskBase,InsertToMysqlCourseEnrollByCountryWorkflow,InsertToMysqlLastCountryOfUserTask,JobTask,LastCountryForEachUser,LastCountryOfUser,LastProblemCheckEvent,MapReduceJobTask,MultiOutputMapReduceJobTask,MysqlInsertTask,MysqlSelectTask,ParseEventLogPerformanceTask,PathSetTask,QueryLastCountryPerCourseTask,QueryLastCountryPerCourseWorkflow,SeqOpenDistFromSqoopToMySQLWorkflow,SeqOpenDistFromSqoopToTSVWorkflow,SqoopImportFromMysql,SqoopImportTask,StudentModulePerCourseAfterImportWorkflow,StudentModulePerCourseTask,Task,TotalEventsDailyTask,TotalEventsReport,TotalEventsReportWorkflow,URLManifestTask,UserActivityTableTask,UserActivityTask,UserRegistrationsPerDay,UsersPerCountry,UsersPerCountryReport,UsersPerCountryReportWorkflow,WeeklyAllUsersAndEnrollments,WeeklyIncrementalUsersAndEnrollments,WrapperTask}

更多细节可以参考Tasks-to-Run-to-Update-Insights,以及Stanford analytics task scheduler

接下来的工作就是将分析出来的结果呈现到insights里,那么数据可视化的的工作就完成了。insights是个独立的server,分析结果存在result store(mysql)中