Data Minning Problem

3.
6 use a flow chart to summarize the folloeing procedures for attribute subset selection: a-stepwise forward selection b- stepwise backward elimination c- a combination of forward selection and backward elimination solution a- Step-wise forward selection: The best of the original attributes is picked first. Then the next best of remaining attributes is added to the set, ...
b- Step-wise backward elimination: -Repeatedly eliminate the worst attribute.
-Combination of forward selection and backward elimination Decision tree induction: -A tree is constructed from the given data. -Set of attributes appearing in the tree form the reduced attributes subset.
4.1 list and describe the five primitives for specifying data mining task solution a- Task-Relevant Data: (Database portion to be investigated) -Database or data warehouse name. -Database tables or data warehouse cubes. -Conditions for data selection. -Relevant attributes or dimensions. -Data grouping criteria. b- Types of knowledge to be mined :(Data mining functions to be performed) -Characterization -Discrimination -Association -Classification/prediction -Clustering -Outlier analysis -Other data mining tasks c- Background Knowledge:(Knowledge about the domain to be mined) -concept hierarchies -Schema hierarchy: a total or partial ordering among attributes in the database schema. E.g., street < city < province_or_state < country
-Set-grouping hierarchy: organize values for a given attribute or dimension into groups of constants or range values. E.g., {20-39} = young, {40-59} = middle_aged -Operation-derived hierarchy: based on operations specified by users, experts, or the data mining system. E.g., email address: login-name < department < university < country -Rule-based hierarchy: occur when a hierarchy is defined by a set of rules. low_profit_margin(X) More examples : To specify what concept hierarchies to use use hierarchy <hierarchy> for <attribute_or_dimension> We use different syntax to define different types of hierarchies: Schema hierarchies - P2) < $50)
define hierarchy time_hierarchy on date as [day,month quarter,year] Set-grouping hierarchies define hierarchy age_hierarchy for age on customer as level1: {young, middle_aged, senior} < level0: all level2: {20, ..., 39} < level1: young level2: {40, ..., 59} < level1: middle_aged level2: {60, ..., 89} < level1: senior Operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as {age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age) Rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as level_1: low_profit_margin < level_0: all if (price - cost)<= $50
level_1: medium-profit_margin < level_0: all if ((price - cost) > $50) and ((price - cost) <= $250)) level_1: high_profit_margin < level_0: all if (price - cost) > $250 d -Measurements of Pattern Interestingness:(to evaluate the discovered patterns) - Simplicity e.g., (association) rule length, (decision) tree size. -Certainty e.g., confidence, P(A|B) = n(A and B)/ n (B), classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc. -Utility potential usefulness, e.g., support (association), noise threshold (description). -Novelty not previously known, surprising (used to remove redundant rules). E - Visualization of Discovered Patterns :(How display the discovered patterns) -Users with different backgrounds, to identify patterns of interest, may require different forms of representation. E.g., rules, tables, crosstabs, pie or bar charts, etc. -Concept hierarchy is also important to visualize the discovered patterns. a)Discovered knowledge might be more understandable when represented at high level of abstraction. b)Interactive drill up/down, pivoting, slicing and dicing provide different perspective to data. 4.2 Describe why concept hierarchies are useful in data mining. solution They are useful in data mining because they allow the discovery of knowledge at multiple levels of abstraction and provide the structure on which data can be generalized (rolled-up) or specialized(drilled-down).
4.3 the four major types of concept hierarchies are :schema hierarchies , setgrouping,operation-drived, rule based hierarchies a- brifly define each type of hierarchy b- for each hierarchie type provide an examble solution -Schema hierarchy: a total or partial ordering among attributes in the database schema. E.g., street < city < province_or_state < country -Set-grouping hierarchy: organize values for a given attribute or dimension into groups of constants or range values. E.g., {20-39} = young, {40-59} = middle_aged -Operation-derived hierarchy: based on operations specified by users, experts, or the data mining system. E.g., email address: login-name < department < university < country -Rule-based hierarchy: occur when a hierarchy is defined by a set of rules. low_profit_margin(X) More examples : To specify what concept hierarchies to use use hierarchy <hierarchy> for <attribute_or_dimension> We use different syntax to define different types of hierarchies: Schema hierarchies - P2) < $50)
define hierarchy time_hierarchy on date as [day,month quarter,year] Set-grouping hierarchies define hierarchy age_hierarchy for age on customer as level1: {young, middle_aged, senior} < level0: all level2: {20, ..., 39} < level1: young level2: {40, ..., 59} < level1: middle_aged level2: {60, ..., 89} < level1: senior Operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as {age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age) Rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as level_1: low_profit_margin < level_0: all if (price - cost)<= $50 level_1: medium-profit_margin < level_0: all if ((price - cost) > $50) and ((price - cost) <= $250)) level_1: high_profit_margin < level_0: all if (price - cost) > $250
4.4 a- propose concept hierarchy for the attrebutes : address ,status ,major , gpa b- what types of concept hierarchy is it solution address : Schema hierarchy [ street , city ,state,country] status: ???????????????????????? major: ?????????????????????????? gpa: Rule-based hierarchies if(grade>90) gpa = A else if (grade>60) gpa = b 4.8 discuss the important of establishing a standard data minning query languge?list a few of the recent proposal in this area Solution
A DMQL can provide the ability to support interactive and to facilitate flexible knowledge discovery -Hope to achieve a similar effect like that SQL has on relational database. -Foundation for system development and evolution. -Facilitate information exchange, technology transfer, commercialization and wide acceptance. 4.9 no coupling ,lose coupling.. No coupling flat file processing, not recommended.
Loose coupling Fetching data from DB/DW.
Semi-tight coupling enhanced DM performance Provide efficient implementations of essential data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some statistical measures.
Tight coupling A uniform information processing environment. A DM system is smoothly integrated into a DB/DW system. Mining queries are optimized based on mining query analysis, data structures, indexing schemes, and query processing methods of a DB.
5.2solution Class/birth place count Programmer DBA Both-classes 180 20 200 Canada t-weight 180/300 =60% 20/100 =20% 200/400 =50% d-weight 180/200 =40% 20/200 =10% 200/200 =100% count 120 80 200 other t-weight 120/300 =40% 80/100 =80% 200/400 =50% d-weight 120/200 =60% 80/200 =40% 200/200 =100% count 300 100 400 Both-places t-weight 300/300 =100% 100/100 =100% 400/400 =100% d-weight 300/400 =75% 100/400 =25% 100%
Programmer(x) (birthplace(x) =Canada*t:60%,d:40%+,birth place(x)=other*t:40%,d:60%+) 5.3????????????????????????????/
5.6 When new tuples set, DB, is inserted into the database: Generalize DB to the same level of abstraction in the generalized relation R to derive R. Union R U R, i.e., merge counts and other statistical information to produce a new relation R
Deletion can be performed in a similar manner.

Data Minning Problem

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Minning Problem

Uploaded by

Copyright:

Available Formats

3.

b- Step-wise backward elimination: -Repeatedly eliminate the worst attribute.

Loose coupling Fetching data from DB/DW.

Programmer(x) (birthplace(x) =Canadat:60%,d:40%+,birth place(x)=othert:40%,d:60%+) 5.3????????????????????????????/

Deletion can be performed in a similar manner.

You might also like